# Week 8: Applying Machine Learning
# Additional Exercises

In this notebook, you will explore new machine learning techniques. Below we have imported the necessary modules/packages and split the data the same way as in Week 5-7. Run the code below for setup.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

df = pd.read_csv('bc_data.csv', index_col=0)

# Data cleaning
# remove the 'Unnamed: 32' column
df = df.drop('Unnamed: 32', axis=1)

# encode target feature to binary class and split target/predictor vars
y = df["diagnosis"].map({"B" : 0, "M" : 1})
X = df.drop("diagnosis", axis = 1)

### 1. Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique. It transforms high-dimensional data into a lower-dimensional form while preserving as much variability (or information) as possible. PCA achieves this by identifying the principal components of the data, which are the directions of maximum variance.

- **Goal:** Reduce the number of features (dimensions) while retaining the most important information in the data.
- **How it works:**
  1. **Standardization:** Make sure all features have the same scale (mean 0, variance 1), to ensure no variable dominates.
  2. **Calculate Covariance Matrix:** This helps to understand the relationships between the variables.
  3. **Compute Eigenvectors and Eigenvalues:** The eigenvectors correspond to the directions of maximum variance, and the eigenvalues indicate how much variance each eigenvector captures.
  4. **Sort Eigenvectors:** Sort them by eigenvalue in descending order. The top eigenvectors will become the principal components. Choose which component to keep (low eigenvalue = less significant).
  5. **Transform the Data:** Project the data onto the new subspace formed by the principal components.

- **Use Cases:**
  - Reducing the dimensionality of large datasets to improve performance in downstream tasks.
  - Visualizing high-dimensional data (often in 2D or 3D).
  - Noise reduction by keeping only the most significant components.


In [None]:
# Standardize the data (important for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA - reduce the data to 2 components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Visualize the result
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Breast Cancer Dataset')
plt.colorbar(scatter, label='Diagnosis (0 = Benign, 1 = Malignant)')
plt.grid(True)
plt.tight_layout()
plt.show()

### 2. K-Means Clustering
K-means is an unsupervised machine learning algorithm used for clustering. It groups data into a predefined number of clusters based on similarity.

- **Goal:** Partition the data into K distinct clusters such that the data points within each cluster are as similar as possible.
- **How it works:**
  1. **Initialization:** Randomly select K points as the initial cluster centroids.
  2. **Assignment Step:** Assign each data point to the nearest centroid.
  3. **Update Step:** Recalculate the centroids by taking the mean of all data points assigned to each cluster.
  4. **Repeat:** Continue steps 2 and 3 until the centroids stop changing or a set number of iterations is reached.

- **Use Cases:**
  - Grouping similar data points for pattern recognition, like customer segmentation, document classification, etc.
  - Image compression and anomaly detection.


Clustering is used to discover natural groupings or structure in the data when you don't have predefined labels. However, it's a good method to test the dimensionality reduction technique. Let's try running K-means on our PCA data. We will cluster with 2 centroids since the target feature (y) has 2 classes.

In [None]:
# Apply K-means clustering (we use K=2 because there are two classes: Benign and Malignant)
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X_pca)

# Get the predicted cluster labels
y_kmeans = kmeans.labels_

# Visualize the KMeans clusters
plt.figure(figsize=(8,6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_kmeans, cmap='viridis', alpha=0.7)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('K-Means Clustering on PCA-Reduced Breast Cancer Data')
plt.colorbar(scatter, label='Cluster')
plt.grid(True)
plt.tight_layout()
plt.show()

# Print the cluster centers
print("Cluster Centers (in PCA-reduced space):")
print(kmeans.cluster_centers_)


Let's plot the results of PCA with the true labels and compare them side-by-side with the KMeans clustering results.

In [None]:
# Get cluster centroids from previously fitted KMeans
centroids = kmeans.cluster_centers_

# Set up figure and axes
fig, axs = plt.subplots(1, 2, figsize=(14, 6))

# --- Plot 1: KMeans Clusters ---
scatter1 = axs[0].scatter(X_pca[:, 0], X_pca[:, 1], c=y_kmeans, cmap='viridis', alpha=0.7, edgecolor='k')
axs[0].scatter(centroids[:, 0], centroids[:, 1], c='red', marker='X', s=200, label='Centroids')
axs[0].set_title('KMeans Clustering (PCA Space)', fontsize=14)
axs[0].set_xlabel('Principal Component 1')
axs[0].set_ylabel('Principal Component 2')
axs[0].legend()
axs[0].grid(True)

# --- Plot 2: True Labels ---
scatter2 = axs[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7, edgecolor='k')
axs[1].set_title('True Labels (PCA Space)', fontsize=14)
axs[1].set_xlabel('Principal Component 1')
axs[1].set_ylabel('Principal Component 2')
axs[1].grid(True)

# Layout adjustment
plt.tight_layout()
plt.show()


From the results above, we can see that the K-means clusters closely resemble the true labels, indicating that PCA successfully preserved the key structure of the data.

**Q*1: Run the PCA and K-means algorithm on the wine dataset (refer to [sklearn.datasets](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html)) data, and plot your results.** 

> Hint: Consider the values you set for the centroids

<span style="background-color: #FFD700">**Write your code below**</span> 

In [None]:
# Import necessary libraries
from sklearn.datasets import load_wine

# Load the wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Standardize the data (important for PCA)


# Apply PCA - reduce the data to 2 components


# Visualize the PCA result


In [None]:
# Apply K-means clustering (we use K=3)


# Get the predicted cluster labels


# Visualize the K-Means clusters


# Print the cluster centers



## Conclusion

In this module, we explored two fundamental techniques in unsupervised learning: Principal Component Analysis (PCA) for dimensionality reduction and K-Means Clustering for grouping similar data points. PCA allowed us to reduce the complexity of high-dimensional data while retaining most of its variance, making patterns more interpretable and computationally manageable. K-Means, on the other hand, enabled us to discover inherent structures within the data by assigning observations to clusters based on similarity.

Together, these methods demonstrate a powerful workflow: using PCA to simplify data before applying K-Means can enhance clustering performance and visualization. Understanding and combining these techniques is a valuable step in making sense of complex datasets, forming the foundation for more advanced data analysis and machine learning tasks. In HMB301, we will explore further techniques of dimensional reduction and clustering. 