## Unsupervised learning for red wine characteristics

In [0]:
import pandas as pd

wine = pd.read_csv("winequality-red.csv", sep = ';') 
wine.head()

### 1. 
Use K Means Cluster Analysis to identify cluster(s) of observations that have high and low values of the wine quality. (Assume all variables are continuous.)

Describe variables that cluster with higher values of wine quality. Describe variables that cluster with lower values of wine quality.

If you want to make a good bottle of wine, then what characteristics are most important according to this analysis?

In [0]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

X = wine.drop('quality', axis=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
num_clusters = 2

kmeans = KMeans(n_clusters=num_clusters, random_state=42)
wine['cluster'] = kmeans.fit_predict(X_scaled)

cluster_summary = wine.groupby('cluster').mean()

# Print summary statistics for each cluster
print(cluster_summary)

# Visualize the clusters
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=wine['cluster'], cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], marker='x', s=300, c='red')
plt.title('Kmeans Clustering Result')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

The cluster with higher values of wine quality has higher mean values of fixed acidity, citric acid, residual sugar, sulphates, while that with lower values of wine quality has higher mean values of volatile acidity, free sulfur dioxide, total sulfur dioxide, pH.

The most important characteristics that impact the quality of a bottle of wine are the features with most significant difference between the two clusters, that is, fixed acidity, citric acid, free sulfur dioxide and total sulfur dioxide.