In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("1-235e6.ipynb")

# K-means Clustering and Dimensionality Reduction on Wine Dataset

## Introduction

In this problem set, we will revisit key clustering and dimensionality reduction using Principal Component Analysis (PCA) and k-Means on the Wine dataset. As you work through the notebook, please follow the sequence and address the questions embedded along the way. We’ll also explore visualizing the results, interpreting them, and discussing their implications for a hypothetical wine retailer. Please keep your written answers to 300 words at max.


Let's begin by importing the necessary packages and loading the Wine dataset. 

If you encounter any issues with missing packages, install them by running `%pip install <package_name>`, for example `%pip install matplotlib`. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score

%matplotlib inline
plt.style.use('ggplot')

# Load the Wine dataset
wine_data = load_wine()

df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)
df['label'] = wine_data.target
df.head()

print(wine_data.DESCR)


In [None]:
X_wine = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)

features = df.columns.to_list()
plt.figure(figsize = (20, 10))
for i in range(0, len(features)):
    plt.subplot(5, 3, i+1)
    plt.hist(df[features[i]], bins=20, color='green', alpha=0.6, edgecolor='black', density=True)
    plt.tight_layout()

Before applying K-means clustering, we need to standardize the data to ensure all features contribute equally to the clustering process.

In [None]:
# Standardize the data
scaler = StandardScaler()
X_wine_scaled = scaler.fit_transform(X_wine)

## K-means Clustering

K-means clustering is an unsupervised machine learning algorithm used to partition a dataset into $k$ clusters. Each observation belongs to the cluster with the nearest mean (centroid). 

K-means clustering can be used in various scenarios, such as customer segmentation in marketing, anomaly detection or document clustering.  In our case, we'll use it to group similar wines together based on their characteristics.

In [None]:
# Perform K-means clustering with 3 clusters
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X_wine_scaled)

<!-- BEGIN QUESTION -->

### Question 1 (0.5 points)

Produce a scatterplot with a combination of two features from the dataset, colored to show the k-means clusters. Choose a combination of features which shows the three clusters as relatively distinct.

Hint: consider alcohol, malic acid, flavonoids and color intensity

In [None]:
# Choose two features for visualization
feature_1 = ...
feature_2 = ...


# Create a scatter plot for the clusters
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_wine[feature_1], X_wine[feature_2], c=kmeans_labels, cmap='viridis', s=50)
plt.title('K-means Clustering on Wine Dataset')
plt.xlabel(feature_1.capitalize())
plt.ylabel(feature_2.capitalize())
plt.colorbar(scatter, label='Cluster Label')
plt.show()

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 2 (2 points)
Based on the plot, describe the characteristics of each cluster in terms of your features. How well-separated are the clusters?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Principal Component Analysis (PCA)



Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data into a new coordinate system. PCA takes a complex, multi-dimensional dataset and finds a simpler way to represent it that still keeps key information. It's often used to visualize high-dimensional data in 2D or 3D plots, reduce noise in data, compress data while minimizing information loss, or prepare data for machine learning algorithms.

Let's apply PCA to our wine dataset and reduce it to 4 dimensions.

In [None]:
import seaborn as sns

n_components = 4
pca = PCA(n_components=n_components, random_state=211)
X_pca = pca.fit_transform(X_wine_scaled)

explained_variance = pca.explained_variance_ratio_
print(f'Explained variance by PCA components: {explained_variance}')


component_nums = list(range(1, n_components+1))
sns.lineplot(x=component_nums, y=np.cumsum(explained_variance))
ax = sns.scatterplot(x=component_nums, y=np.cumsum(explained_variance))
ax.set_xlabel('Number of Components')
ax.set_ylabel('Cumulative Explained Variance Ratio')
ax.set_title('Cumulative Explained Variance vs. Number of PCA Components')
ax.set(xticks= component_nums)

### Question 3 (0.5 points)

What is the smallest number of components needed to account for 50% of the variance?


In [None]:
n_components_which_account_for_50_percent_variance = ...

## K-means Clustering on PCA-reduced Data

Now, let's perform K-means clustering on the PCA-reduced dataset and visualize the results.

In [None]:
# Perform K-means on the reduced PCA dataset
kmeans_pca = KMeans(n_clusters=3, random_state=42)
kmeans_labels_pca = kmeans_pca.fit_predict(X_pca)

# Scatter plot of the clusters in the PCA-reduced space
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=kmeans_labels_pca, cmap='viridis', s=50)
plt.title('K-means Clustering after PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(scatter, label='Cluster Label')

# Get the centroids of the clusters after PCA
centroids_pca = kmeans_pca.cluster_centers_

# Plot the centroids on the scatter plot. This will help you visualise how seperated the data is around the cluster
plt.scatter(centroids_pca[:, 0], centroids_pca[:, 1], s=200, c='red', marker='X', label='Centroids')
plt.legend()
plt.show()

We're also going to compare the quality of clustering before and after PCA using the silhouette score:

In [None]:
# K-means on the original dataset (without PCA)
kmeans_original = KMeans(n_clusters=3, random_state=42)
kmeans_labels_original = kmeans_original.fit_predict(X_wine_scaled)

# Compare the clustering results
silhouette_pca = silhouette_score(X_pca, kmeans_labels_pca)
silhouette_original = silhouette_score(X_wine_scaled, kmeans_labels_original)

print(f'Silhouette Score after PCA: {silhouette_pca:.4f}')
print(f'Silhouette Score on Original Data: {silhouette_original:.4f}')

<!-- BEGIN QUESTION -->

### Question 3 (1 points)
Compare the clustering results before and after PCA. How do they differ? 

What new observations can you make from the PCA-based clustering?

Do the silhouette scores agree with what you see on the plots?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Interpreting PCA Components

<!-- BEGIN QUESTION -->

### Question 4a (1.5 points)

After performing PCA on the Wine dataset, your goal is to display the top 5 contributing features for Principal Component 1 (PC1) and Principal Component 2 (PC2). Complete the Python program to extract the features that contribute most to each of these principal components.  



In [None]:
# Get the PCA component loadings
pca_component_loadings = pd.DataFrame(pca.components_, columns=wine_data.feature_names)
pca_component_loadings


In [None]:
#  Extract the first PC loadings from the DataFrame by indexing it
pc_1 = ...

# Sort the PC and take the first 5 values
top_5_features_pc_1 = ...

# Do the same thing for the second PC
top_5_features_pc_2 = ...
print(top_5_features_pc_1)
print(top_5_features_pc_2)

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 4b (0.5 points)

What do the positive and negative contributions signify?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Application to Wine Retail



<!-- BEGIN QUESTION -->

### Question 5 (1 points)
Based on the K-means clustering results and the PCA analysis, identify two features you would choose to create distinct sections in your wine store. Justify your selection and explain how you would use these features to create distinct sections in the store. Consider how clustering and data visualization can guide organizational decisions. This is a key aspect of data science — not only extracting insights from data but also applying them in a practical, business-oriented context.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 6 (0.5 points) 
Do you see any outliers in the K-means result? What do they signify? What would you do about these in a similar context to wine selling? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 7 (0 points)

Did you use an LLM like ChatGPT or Claude to assist in answering this problem set?

Write "No" if you did not.
Write "Yes" and paste a link to the transcript  (e.g. https://chat.openai.com/share/5c14a304-1b7f-4fb9-b400-21e65ad545bb ) if you did.


_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### Question 8 (0 points)

Please use the link below to provide feedback on how well the assignment aligned with the concepts covered in class. Your input will help us improve and refine future assignments. 

Form Link - https://forms.gle/LtPwzFayDMUyBcay6

Did you fill out the feedback form? 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Submission

Follow the instructions below, then upload the ZIP file to bCourses.

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)