## Unsupervised learning for red wine characteristics

In [0]:
import pandas as pd

wine = pd.read_csv("winequality-red.csv", sep = ';') 
wine.head()

### 1. 
Use K Means Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.)

Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.

If you want to make a good bottle of wine, then what characteristics are most important according to this analysis?

In [0]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X = wine.drop('quality', axis=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
num_clusters = 2

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
wine['cluster'] = kmeans.fit_predict(X_scaled)

cluster_summary = wine.groupby('cluster').mean()

# Print summary statistics for each cluster
print(cluster_summary)

# Visualize the clusters
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=wine['cluster'], cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', s=300, c='red')
plt.title('Kmeans Clustering Result')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

The cluster with higher values of wine quality has higher mean values of fixed acidity, citric acid, residual sugar, sulphates, while that with lower values of wine quality has higher mean values of volatile acidity, free sulfur dioxide, total sulfur dioxide, pH.

The most important characteristics that impact the quality of a bottle of wine are the features with most significant difference between the two clusters, that is, fixed acidity, citric acid, free sulfur dioxide and total sulfur dioxide.

### 2. 
Use Hierarchical Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.) Use complete linkage and the same number of groups that you found to be the most meaningful in question 1.

Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.

If you want to make a good bottle of wine, then what characteristics are most important according to this analysis? Have your conclusions changed using Hierarchical clustering rather than k means clustering? Present any figures that assist you in your analysis.


In [0]:
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

linkage_matrix = linkage(X_scaled, method='complete', metric='euclidean')

dendrogram(linkage_matrix, labels=wine.index, leaf_rotation=90, leaf_font_size=8)
plt.title('Hierarchical Clustering Dendrogram')
plt.show()

num_clusters = 2 
wine['cluster'] = fcluster(linkage_matrix, num_clusters, criterion='maxclust')

cluster_summary = wine.groupby('cluster').mean()

print(cluster_summary)

plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=wine['cluster'], cmap='viridis')
plt.title('Hierarchical Clustering Result')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

The cluster with higher values of wine quality has higher mean values of volatile acidity, pH and alcohol, while that with lower values of wine quality has higher mean values of citric acid, chlorides, free sulfur dioxide, sulphates.

The most important characteristics that impact the quality of a bottle of wine are the features with significant difference between the two clusters, that is, citric acid, chlorides, free sulfur dioxide, total sulfur dioxide and sulphates.

According to the above discussion, the conclusions obtained by Hierarchical clustering change a lot.

### 3.
Use Principal Components Analysis to reduce the dimensions of your data. How much of the variation in your data is explained by the first two principal components. How might you use the first two components to do supervised learning on some other variable tied to wine (e.g. - wine price)?

In [0]:
from sklearn.decomposition import PCA
import numpy as np

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

explained_variance_ratio = pca.explained_variance_ratio_

print(f"Explained Variance Ratio for PC1: {explained_variance_ratio[0]:.2f}")
print(f"Explained Variance Ratio for PC2: {explained_variance_ratio[1]:.2f}")

cumulative_explained_variance = np.cumsum(explained_variance_ratio)
plt.plot(range(1, len(cumulative_explained_variance) + 1), cumulative_explained_variance, marker='o')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance vs. Number of Principal Components')
plt.show()