# **MENG 15100** Lab 3

Welcome to the third lab of MENG 15100: Machine Learning and Artificial Intelligence for Molecular Discovery and Engineering

In this lab, you’ll learn the basics of **Dimensinality Reduction** and **Clustering**, powerful tools in Machine Learning workflows.

We’ll cover two major techniques:
- **Principal Component Analysis** (for Dimensionality Reduction)
- **k-means clustering**


This lab starts from first principles—**no prior coding experience is required.** For every exercise, you will be provided with baseline Python code, and you will only be asked to make minor edits or adaptations (e.g., change a variable’s value, adjust the number of iterations, or modify a plot’s formatting).

## Lab Structure and Grading
The lab is organized as follows:
- **Topics** – Broad units (e.g., *1. Introduction to Dimensionality Reduction, 2. Principal Component Analysis*).
- **Sections** – Subdivisions within each Topic (e.g., *1.1 Variance*).
- **Problems** – Each Section ends with a Problem to be completed in the Jupyter Notebook. Problems are indicated with the ✅ character along with a listing of the number of points available that is indicative of the level of effort required for the solution. Problems may involve:
  - Short-answer questions
  - Modifications to existing Python code
  - Note: Many Problems contain multiple tasks.

- **Custom code** - many sections include custom code and interactive graphical user interfaces. The code implementing these functions is located in a custom library titled `menglab`

## **MENG 15100** Lab 3

Lab 3 will begin with **Section 0**, which asks you to do some partial reading of a scientific paper. This section is designed to help you start practicing how to read and interpret research articles — a skill we’ll keep coming back to throughout the course.  

Please also plan ahead: although the reading is only partial, it is not something you can rush through. To get proper understanding, you should set aside around **30 minutes** to carefully work through Section 0.  

In summary:  
- Section 0 = reading two papers (30 minutes).  
- Sections 1–5 = coding in Python, best done during the lab session.  
- Do the reading early so your lab time can be dedicated to coding practice.  

## Table of Contents

**TIP**: An interactive table of contents is avaialable on the left sidebar.

### 0. PCA in Scientific Literature
&emsp; 0.1 Paper (available on canvas under Modules/Lab3)<br>

### 1. Introduction to Dimensionality Reduction

&emsp; 1.1 Projection: The simplest dimensionality reduction technique <br>

&emsp; 1.2 Variance: Measuring the importance of different dimensions <br>

&emsp; 1.3 Scree Plots and Explained Variance <br>

&emsp; 1.4 Drawbacks of Simple Projection<br>

### 2. Introduction to Principal Component Analysis

&emsp; 2.1 Interactive PCA: Step 1, Rotation <br>

&emsp; 2.2 Interactive PCA: Step 2, Projection <br>

&emsp; 2.3 PCA vs. Clustering <br>


### 3. PCA on Wines dataset

&emsp; 3.1 Import Wine Chemistry Dataset (sklearn) <br>

&emsp; 3.2 Standardizing the Data <br>

&emsp; 3.3 PCA on Wines Dataset <br>

&emsp; 3.4 Scree Plot and Cumulative Variance Explained <br>

&emsp; 3.5 Visualize Data Projected into PCA Dimensions <br>

### 4. Introduction to K-means clustering

&emsp; 4.1 Step 1: Initialize Cluster Centers <br>

&emsp; 4.2 Step 2: Assign Points to Nearest Cluster <br>

&emsp; 4.3 Step 3: Update Cluster Centers <br>

&emsp; 4.4 Step 4: Repeat! (Putting it all together) <br>

&emsp; 4.5 Determining Best k with Silhouette Scores <br>

### 5. K-means clustering on wines dataset.

&emsp; 5.1 Interactive K-means Clustering Demo <br>

&emsp; 5.2 WCSS and Elbow Plots <br>

&emsp; 5.3 Silhouette Score Plot <br>





# Imports (Execute Once)
Run the code cell below to install and import the modules necessary for this lab:

In [None]:
# Execute this cell once to import and install modules (may take several seconds)

# menglab library
%pip install -q --no-cache-dir --upgrade --force-reinstall \
  "git+https://github.com/SorenKyhl/MENG15100.git@Lab3#subdirectory=labs/L3/menglab3"

import menglab3 as menglab
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

import io, requests

# interactive gui settings
from google.colab import output
output.enable_custom_widget_manager()
import plotly.io as pio
pio.renderers.default = "colab"

# Plotting settings
import matplotlib as mpl
mpl.rcParams.update({
    "figure.figsize": (6, 4),
    "figure.dpi": 120,
    "axes.titlesize": 13,
    "axes.labelsize": 12,
    "axes.grid": True,
    "grid.linestyle": "--",
    "grid.alpha": 0.35,
    "lines.linewidth": 2.0,
    "lines.markersize": 6,
    "font.size": 12,
    "xtick.direction": "out",
    "ytick.direction": "out",
    "xtick.minor.visible": True,
    "ytick.minor.visible": True,
    "legend.frameon": False,
    "savefig.bbox": "tight",
})
%config InlineBackend.figure_format = 'retina'


# 0 PCA in Scientific Literature

Before we dive into applying Principal Component Analysis (PCA) ourselves, let’s explore how this method is used in real research within the molecular sciences. PCA is not just a mathematical tool—it’s a powerful way to uncover structure in complex chemical datasets such as spectra, fingerprints, or compositional measurements.

In this exercise, we will look at a scientific study that used PCA to distinguish wines by their country of origin based on molecular characteristics. This type of analysis is an excellent example of how unsupervised learning can reveal meaningful chemical and geographical patterns hidden within high-dimensional data.

## 0.1 Paper: (available on canvas under Modules/Lab3)

Hu, X.-Z., Liu, S.-Q., Li, X.-H., Wang, C.-X., Ni, X.-L., Liu, X., Wang, Y., Liu, Y., & Xu, C.-H. (2019). *Geographical origin traceability of Cabernet Sauvignon wines based on infrared fingerprint technology combined with chemometrics.* Scientific Reports, 9, 8256. https://doi.org/10.1038/s41598-019-44521-8


### ✅ Exercise 1 [8 Points]: Skim Paper


**Tasks:** Skim the specified sections of the paper and answer the following questions. Focus on identifying key ideas rather than technical details.

Skim the **Abstract** and **Introduction** of the paper carefully, then answer the following questions.

1. What **molecular property, experimental method, or dataset** was used as the input to PCA to classify wines by country of origin?

2. What **three countries** were the Cabernet Sauvignon wines produced in?

3. What **three algorithms** were used to classify the wines by country of origin?

4. What is the main practical motivation for classifying wines by geographical region?

Skim the **Figures** of the paper quickly, then answer the following questions:

5. Which figure displays example input data used to cluster wines by origin?

6. Which figure displays wines clustered by principal component analysis in 2D?

7. Which figure displays wines clustered by principal component analysis in 3D?

8. Although this paper did not use k-means clustering, imagine you were to cluster the wines in PCA space (as in the plots from questions 6 and 7). How many clusters do you think would best describe the data, and why?


1. What **molecular property, experimental method, or dataset** was used as the input to PCA to classify wines by country of origin?

YOUR ANSWER HERE


2. What **three countries** were the Cabernet Sauvignon wines produced in?

YOUR ANSWER HERE


3. What **three algorithms** were used to classify the wines by country of origin?

YOUR ANSWER HERE


4. What is the main practical motivation for classifying wines by geographical region?

YOUR ANSWER HERE

5. Which figure displays example input data used to cluster wines by origin?

YOUR ANSWER HERE

6. Which figure displays wines clustered by principal component analysis in 2D?

YOUR ANSWER HERE


7. Which figure displays wines clustered by principal component analysis in 3D?

YOUR ANSWER HERE

8. Although this paper did not use k-means clustering, imagine you were to cluster the wines in PCA space (as in the plots from questions 6 and 7). How many clusters do you think would best describe the data, and why?

YOUR ANSWER HERE

# 1 Introduction to Dimensionality Reduction

In many areas of **machine learning** and **artificial intelligence**, we work with datasets that have **many dimensions** — sometimes dozens, hundreds, or even thousands.  
Each “dimension” represents a different measured property or feature of our samples.

For example, in the molecular datset from Lab 2, each compound was  described by a set of numerical features such as:  
- Molecular weight  
- Boiling point  
- Surface area  
- Branching index  
- Number of hydrogen bond donors or acceptors  

These features together form a **high-dimensional feature space** — one in which each molecule is represented as a single point defined by all its measured properties.

---

### **The Challenge of High Dimensionality**

When we first look at a dataset like this, several natural questions arise:

- Do the data points form **distinct groups or clusters**?  
- What **features** or **properties** drive those clusters?  
- How many **clusters or patterns** might exist in the data?  

However, there’s a problem: we humans can **only visualize data directly in 1D, 2D or 3D**.  
Once our dataset has more than three features, it becomes impossible to “see” its structure directly. Even though computational models can operate in hundreds of dimensions, our intuition cannot.


---

### **What Is Dimensionality Reduction?**

**Dimensionality reduction** refers to a family of techniques that **compress high-dimensional data** into a **lower-dimensional space** while preserving as much meaningful structure as possible.  

These reduced dimensions are sometimes called **latent dimensions** or **principal components**, and they often capture the **essential variation** or **patterns** in the data.

In other words, dimensionality reduction tries to answer:

> *“Can we find a smaller set of new features that summarize what really matters in the data?”*

---

### **Why Use Dimensionality Reduction?**

Dimensionality reduction helps us:
- **Visualize** complex datasets in 2D or 3D plots  
- **Reveal clusters or trends** that may not be obvious in the original feature space  
- **Remove noise or redundancy** by combining correlated variables  
- **Simplify models** and reduce computational cost  

However, there is always a **trade-off**. By compressing data into fewer dimensions, we **lose some information**. The goal is to **minimize that loss** while still capturing the main structure of the dataset.

---

### **In This Lab**

In this lab, we’ll focus on two key techniques that often go hand-in-hand:

1. **Principal Component Analysis (PCA)** — a method for finding new coordinate axes that capture the greatest variation in the data.  
2. **k-Means Clustering** — an algorithm for grouping similar data points.

We’ll begin by using PCA to reduce the dimensionality of a molecular dataset (the *Wine* dataset from `scikit-learn`), and then explore whether those reduced dimensions reveal meaningful groupings using k-means.



## 1.1 Projection: the simplest dimensionality reduction technique

The simplest way to perform **dimensionality reduction** is by using a **projection** — that is, selecting only a few of the available features (axes) to visualize.

In this interactive example, we have a *fictitious dataset* with three numerical features: **x**, **y**, and **z**. These form a 3D coordinate system in which each data point has a specific position in space.

However, let's imagine we can only visualize **two dimensions** at a time. This means that when we project a high-dimensional dataset into 2D, **important structure might be lost or hidden**.

Use the 3D plot and the projection controls to explore how each view changes your perception of the data.

---

### **How to Use the Interactive Tool**

The interactive figure below allows you to visualize this dataset and experiment with **different 2D projections** of the same 3D data.

1. **Run the code cell below** to launch the interactive visualization.  
   You’ll see two panels:
   - **Left panel** — a 3D scatter plot of the dataset.  
     - You can **rotate** the view by dragging your mouse.  
     - You can **zoom** in/out using the scroll wheel or trackpad.  
   - **Right panel** — a 2D projection (or “flattened” view) of the same data.

2. **Use the “Project onto ...” dropdown menu** above the figure to select which pair of axes to visualize:
   - **XY projection:** shows what the data looks like if we ignore the z-axis.  
   - **XZ projection:** shows what the data looks like if we ignore the y-axis.  
   - **YZ projection:** shows what the data looks like if we ignore the x-axis.

---

In [None]:
menglab.interactive_projection()

### ✅ Exercise 2 [5 Points]: Interactive Dimensionality Reduction (Projection)

**Tasks** Use the interactive projection plots above to answer the following questions:

1. **Visual interpretation**
   - When viewing the **3D plot**, what do you notice about the structure of the dataset?  
     - How many clusters are visible?  
     - How are they arranged in space?

2. **Projection onto the XY plane**
   - Switch the projection to **XY**.
   - How do the two clusters appear in this view?  
     - Do they seem to overlap or remain distinct?  
     - Why do you think that happens?

3. **Projection onto the XZ plane**
   - Switch the projection to **XZ**.
   - Are the clusters easier or harder to distinguish here compared to the XY projection?  
     - Which dimension (x, y, or z) seems to separate the data most clearly?

4. **Projection onto the YZ plane**
   - Switch to the **YZ** view.
   - How is this view similar and different from the XZ projection?  
     - Which projection (XZ vs. YZ) preserves the most information? (i.e. has the least loss of information).

5. **Connecting to PCA**
   - Principal Component Analysis (PCA) is a technique that finds the *most informative* directions in the dataset.
     - Based off your prior analysis, list the three cardinal directions (x, y, z) in order of *most informative* to *least informative*


1. When viewing the **3D plot**, what do you notice about the structure of the dataset? How many clusters are visible? How are they arranged in space?

     YOUR SOLUTION HERE


2. **Projection onto the XY plane**. Switch the projection to **XY**. How do the two clusters appear in this view? Do they seem to overlap or remain distinct? Why do you think that happens?

YOUR SOLUTION HERE

3. **Projection onto the XZ plane**. Switch the projection to **XZ**. Are the clusters easier or harder to distinguish here compared to the XY projection? Which dimension (x, y, or z) seems to separate the data most clearly?


YOUR SOLUTION HERE

4. **Projection onto the YZ plane**. Switch to the **YZ** view. How is this view similar and different from the XZ projection?  Which projection (XZ vs. YZ) preserves the most information (i.e. has the least loss of information)?

YOUR SOLUTION HERE


 5. **Connecting to PCA**. Principal Component Analysis (PCA) is a technique that finds the *most informative* directions in the dataset. Based off your prior analysis, list the three cardinal directions (x, y, z) in order of *most informative* to *least informative*

 YOUR SOLUTION HERE

## 1.2 Variance: Measuring the Importance of Different Dimensions

From the previous section, we saw that not all dimensions (or features) in our dataset are equally informative.  
Some dimensions — like **z** in our example — clearly help distinguish between clusters, while others — like **x** and **y** — do not.

So how can we *quantitatively* identify which dimensions are most important?

---

### **Understanding Variance**

**Variance** is a measure of how much the values in a dataset **spread out** from their average (mean).  
It tells us *how much change* or *variation* exists along a particular feature or dimension.

- A **high-variance** feature means that data points are widely dispersed along that axis.  
  → This feature may capture important differences or structure in the data.
- A **low-variance** feature means that data points are tightly grouped near the mean.  
  → This feature carries less distinctive information.

---

### **How Variance is Calculated**

To compute variance, we follow three simple steps:

1. **Compute the mean** of the data along one dimension  
   $$
   \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
   $$

2. **Compute the squared difference** of each point from the mean  
   $$
   (x_i - \bar{x})^2
   $$

3. **Average these squared differences**  
   $$
   \text{Variance} = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
   $$

The larger the average squared difference, the more the data vary along that dimension.

**TIP:** The units of variance are the **square of the original units**, which is why we often take its square root (the **standard deviation**) when we want to measure spread in the same units as the data.

$$ \text{Standard Deviation} = \sqrt{\text{Variance}} $$

---

### **Let’s Explore Variance Numerically**

In the next step, we’ll calculate the variance of each feature in our dataset and compare them.

We’ll use simple Python operations to:
1. Compute the **mean** and **variance** for each dimension (`x`, `y`, and `z`).  
2. Interpret which dimension carries the most information.  
3. Visualize the differences in variance as a simple bar chart.

This will give us a quantitative understanding of which features vary most — a key first step before we later connect this idea to **dimensionality reduction methods** like PCA.


### ✅  Exercise 3 [8 Points]: Calculate Variance

**Tasks:** In this exercise, you’ll quantify how much the dataset varies along each dimension (`x`, `y`, and `z`).  
This will help you connect the visual patterns you observed earlier to a numerical measure of **spread** — the **variance**.

The dataset from **Exercise 1** (the interactive projection) has been preloaded into the variable `dataset`.

1. **Plot histograms of the dataset in the x, y, and z dimensions.**  

   A **histogram** is a plot that shows how data are distributed along a single dimension.  
   It divides the range of values into small intervals (called **bins**) and counts how many data points fall into each bin.  
   The taller the bar, the more data points lie in that range.

   In this way, a histogram shows the **density** or **frequency** of data along one axis.  
   You can think of it as a way of **projecting** a higher-dimensional dataset into **1D** —  
   for example, looking only at how the data spread along `x`, while ignoring `y` and `z`.

   **Hint:**  You can access each dimension of the dataset like this:
   `dataset['x']` or `dataset.x`, And so on for the x, y, and z directions

2. Modify the code below to implement the `variance` function, which takes in data from one dimension, and calculates the variance of the data in that dimension. Calculate the following intermediate variables:

  - `mean`: the mean of the input data
  - `difference`: The difference of each datapoint from the mean
  - `squared_difference`: the squared difference of each datapoint from the mean
  - `variance`: the mean squared difference.

    **Tip:** Use numpy functions such as `np.mean`, and  `np.square` to do calculations on entire arrays of data at once.

3. Use the `variance` function you wrote in the previous task to calculate the variance of the data in the x, y, and z directions.

4. Execute the code cell to visualize the magnitude of the variance in the x, y, and z directions. How does the magnitue of the variance in these directions correspond to your answer from Exercise 1, "Based off your prior analysis, list the three cardinal directions (x, y, z) in order of *most informative* to *least informative*."


In [None]:
# Variables (provided)
# Run this cell to initialize the variables
variances = [0, 0, 0]
labels = ['x','y','z']

dataset = menglab.generate_projection_data()
dataset.head()

In [None]:
# Task 1
fig, ax = plt.subplots(1, 3, figsize=(12, 4), sharey=True)  # 1 row, 3 columns

data_x_dimension = NotImplemented ### YOUR SOLUTION HERE
data_y_dimension = NotImplemented ### YOUR SOLUTION HERE
data_z_dimension = NotImplemented ### YOUR SOLUTION HERE

# Plot histograms for each column
ax[0].hist(data_x_dimension, bins=25, color="steelblue", alpha=0.7)
ax[1].hist(data_y_dimension, bins=25, color="seagreen", alpha=0.7)
ax[2].hist(data_z_dimension, bins=25, color="indianred", alpha=0.7)

# Set axis limits for consistency
for i in range(3):
    ax[i].set_xlim(-4, 4)
    ax[i].set_ylim(0, None)
    ax[i].set_title(f"Distribution of {['x','y','z'][i].upper()}")
    ax[i].set_xlabel(['x','y','z'][i])
    ax[i].grid(alpha=0.3)
    if i == 0:
        ax[i].set_ylabel("Count")

fig.suptitle("Histograms along each dimension", fontsize=14)
fig.tight_layout()
plt.show()

In [None]:
# Task 2
def variance(data):
  """
  Compute the variance of data in one dimension.

  Parameters
  ----------
  data : array-like
      A one-dimensional list, NumPy array, or pandas Series containing numerical values.

  Returns
  -------
  variance: float
      The variance of the input data.
  """
  data = np.array(data) # convert input to numpy array
  mean = NotImplemented               ### YOUR SOLUTION HERE
  difference = NotImplemented         ### YOUR SOLUTION HERE
  squared_difference = NotImplemented ### YOUR SOLUTION HERE
  variance = NotImplemented           ### YOUR SOLUTION HERE
  return variance

In [None]:
# Task 3
variances[0] = NotImplemented ### YOUR SOLUTION HERE
variances[1] = NotImplemented ### YOUR SOLUTION HERE
variances[2] = NotImplemented ### YOUR SOLUTION HERE

# Check your answer - this should assert statement should pass!
assert(np.allclose(variances, dataset[['x', 'y', 'z']].var(ddof=0)))
print("✔️ Test passed! Your variance calculation looks correct.")

In [None]:
# Task 4: plot for vizualization
plt.bar(labels, variances)
plt.xlabel("dimension")
plt.ylabel("variance")

Task 4: Based off your prior analysis, list the three cardinal directions (x, y, z) in order of most informative to least informative. Provide a quantitative rationale for your answer.

YOUR SOLUTION HERE

## 1.3 Scree Plots and Explained Variance

So far, we’ve explored how different **dimensions** (x, y, and z) of our dataset can have different amounts of **variance** —  
some directions capture a lot of variation (information), while others capture very little.

We are almost ready to jump into **Principal Component Analysis (PCA)**, but it’s worth previewing a key idea from PCA that we can already apply to our simpler **projection-based** dimensionality reduction.

---

### **Explained Variance**

If we think of projecting our dataset onto one dimension (say, the x-axis),  
the **explained variance** tells us how much of the total spread in the dataset we still retain after that projection.

Mathematically, the explained variance ratio for a dimension (or component) is:

$$
\text{Explained Variance Ratio}_i = \frac{\text{Variance of Dimension } i}{\text{Total Variance in all Dimensions (x + y + z)}}
$$

These ratios tell us **how informative** each dimension is relative to the others.

---

### **Scree Plots**

A **Scree plot** is a simple line plot that shows how much variance is captured by each dimension (or later, each principal component in PCA).

- The **x-axis** shows each dimension (in this case, x, y, or z), **sorted** in order from highest variance explained to lowest variance explained.
- The **y-axis** shows the **explained variance ratio** — how much of the total variance that dimension captures.

Typically, the first dimension explains the most variance, and the curve **drops off** as we move to less informative dimensions.  

---

### **Cumulative Explained Variance**

We can also plot the **cumulative explained variance**, which shows how much total variance is captured as we include more dimensions:

$$
\text{Cumulative Explained Variance}_k = \sum_{i=1}^{k} \text{Explained Variance Ratio}_i
$$

For example:
- In the case where the first two dimensions ($k = 2$) capture **90%** of the total variance, most of the structure in the dataset can be understood just by looking at those two directions.
- In the case where the first 10 dimensions ($k$ = 10) capture **50%** of the data, we might not have a good ability to capture a low-dimensional projection of the data.

---

### **Why This Matters**

Although we’re not performing PCA yet, this same idea —  
ranking dimensions by how much variance they explain —  
is at the core of **dimensionality reduction** techniques.

When we *do* move on to PCA, we’ll always make:
1. a **Scree plot** (to see how variance is distributed among components), and  
2. a **Cumulative explained variance plot** (to decide how many components to keep).

For now, we’ll create both of these plots using our simple **x–y–z projection** data.  
This will help us practice the same logic that PCA uses to identify the *most informative directions* in a dataset.



### ✅ Exercise 4 [8 Points]: Plot Explained Variance

**Tasks** In the following tasks, we will calculate explained variance ratios and plot them in a Skree plot and Cumulative Explained Variance plot.

We will use the `variances` list calculated in Exercise 3, so make sure that variable is defined and calculated correctly!

1. First, sort the variances in each dimension so that the highest variance dimension is first. Modify the code to calculate `sorted_variances` from the `variances` list you calculated in Exercise 3.
- you can use the python function `sorted(list)` to sort a list or numpy array
- you can pass an optional argument `reverse` to reverse the ordering of the sorted list. `sorted(list, reverse=True)`. Should reverse be `True` or `False` in this case?

2. Next, calculate the total variance across all dimensions, and then get the ratios for each dimension. Execute the code cell to display a Scree Plot.

3. Finally, calculate the cumulative variance explained. Execute the code cell to display the cumulative variance explained plot.

    **Tip:** Use the numpy function `np.cumsum(list)` to calculate the cumulative sum of a list of values

4. Examine the Scree Plot and Cumulative Variance Explained Plot. If you project into only into the **one** best dimension, how much variance will be explained/retained? If you project into the **two** best dimensions, how much variance is explained/retained? Answer in the text box below.

    **Hint:** Can you use a code cell to print the value of python variables to answer this question rather than just reading off the plot?

In [None]:
# Task 1
sorted_variances = NotImplemented ### YOUR SOLUTION HERE
print("The sorted variances are: ", sorted_variances)

In [None]:
# Task 2
total_variance = NotImplemented            ### YOUR SOLUTION HERE
explained_variance_ratios = NotImplemented ### YOUR SOLUTION HERE

# Skree Plot
labels = ['z', 'x', 'y']
plt.bar(labels, explained_variance_ratios)
plt.xlabel("Dimension")
plt.ylabel("Explained Variance Ratios")
plt.title("Scree Plot")
plt.figure()

In [None]:
# Task 3
cumulative_variance_ratios = NotImplemented ### YOUR SOLUTION HERE

cumulative_labels = ['z', 'z+x', 'z+x+y']
plt.plot(cumulative_labels, cumulative_variance_ratios, '-o')
plt.ylabel("Cumulative Explained Variance")
plt.title("Cumulative Explained Variance")


Task 4: If you project into only into the **one** best dimension, how much variance will be explained/retained? If you project into the **two** best dimensions, how much variance is explained/retained?

**Hint:** Can you use a code cell to print the value of python variables to answer this question rather than just reading off the plot?

YOUR SOLUTION HERE

## 1.4 Drawbacks of Simple Projection

At first glance, it seems natural to look at complex data by simply projecting it — for example, plotting 3D points on a flat 2D screen using the x-y, x-z, or y-z planes. But what if that simple view hides something important?

Execute the interactive projection code below again, but now with a different, more problematic dataset.

In [None]:
menglab.interactive_projection(projection_problem=True)

### ✅ Exercise 5 [6 Points]: Problematic Projection

**Tasks:**  Use the interactive plots above — including both the 3D scatter and its 2D projections — to answer the following quesions. Code is provided below to load the new dataset into the variable `problematic_dataset`

1. **Overall structure:**  
   - How many clusters do you observe in the full 3D view? Are they well separated in 3D?
   - Along what direction, roughly speaking, are they separated?

2. **Projection views:**  
   - Examine each 2D projection (XY, XZ, YZ).  
   - Do you still see two clusters in each projection, or do they appear merged?  
   - What is it about the shape of the clusters in the dataset that gives this result?

3. **Variances**.
    - Calculate the variance in each cardinal direction, and execute the code to plot the variances.

4. **Which dimension is best?** By this visualization, does any cardinal dimension have dramatically more variance explained than others? Is there a clear choice for the best cardinal dimension to project onto?






In [None]:
# Variables (provided)
# Run this cell to initialize the variables
problematic_variances = [0, 0, 0]
labels = ['x','y','z']

problematic_dataset = menglab.generate_projection_data(projection_problem=True)
problematic_dataset.head()

1. **Overall structure:**  - How many clusters do you observe in the full 3D view? Are they well separated in 3D? Along what direction, roughly speaking, are they separated?

YOUR SOLUTON HERE

2. **Projection views:** Examine each 2D projection (XY, XZ, YZ). Do you still see two clusters in each projection, or do they appear merged? What is it about the shape of the clusters in the dataset that gives this result?

YOUR SOLUTION HERE

In [None]:
# Task 3
problematic_variances[0] = NotImplemented ### YOUR SOLUTION HERE
problematic_variances[1] = NotImplemented ### YOUR SOLUTION HERE
problematic_variances[2] = NotImplemented ### YOUR SOLUTION HERE

# Check your answer - this should assert statement should pass!
assert(np.allclose(problematic_variances, problematic_dataset[['x', 'y', 'z']].var(ddof=0)))
print("✔️ Test passed! Your variance calculation looks correct.")

# Task 4: plot for vizualization
plt.bar(labels, problematic_variances)
plt.xlabel("dimension")
plt.ylabel("variance")

4. **Which dimension is best?** By this visualization, does any cardinal dimension have dramatically more variance explained than others? Is there a clear choice for the best cardinal dimension to project onto? Support your answer with quantitative evidence.

YOUR SOLUTION HERE

# 2 Introduction to Principal Component Analysis

From the last section, we saw that simple projection (e.g., looking only at x–y, x–z, or y–z) isn’t always reliable. A projection **throws away** whole directions, so if the important variation runs along some **diagonal combination** of axes, you may miss it.

**Principal Component Analysis (PCA)** solves this by *rotating* the coordinate system to find directions that best capture the data’s variability—without restricting ourselves to the original cardinal axes.

### Intuition

- **Goal:** keep as much information (variance) as possible using as few dimensions as possible.  
- **How:** find new, orthogonal directions (the **principal components**, PCs) along which the data vary the most.  
- **Result:** PC1 is the single direction that maximizes spread; PC2 is the next best direction orthogonal to PC1, and so on.

**Bottom line:** PCA doesn’t just *drop* dimensions; it **discovers better ones**—rotated axes that capture the most important variation—so you can reduce dimensionality **while preserving structure** that simple projection might hide.

### The two "Steps" of PCA:

1) **Rotation: find the best view.**  
   Think of turning your camera around a point cloud to find the view where the data looks **most spread out** in one direction.
   - That widest direction is **PC1** (Principal Component 1).
   - The next widest direction, at a right angle to PC1, is **PC2**, and so on.
   - In plain terms, we’re just **rotating the axes** to line up with how the data naturally varies.

2) **Projection: keep the important parts.**  
   After you’ve found these new directions (PC1, PC2, …), you **keep only the top few** (often 1–3) and **project out (collapse) the rest**.
   - Keeping PC1 (and maybe PC2) gives a simpler picture that still captures **most of the spread**.
   - Projecting out the small leftover directions removes detail/noise and reduces dimensionality.



## 2.1 Interactive PCA: Step 1, Rotation

Let's look at the first step, *Rotation*.

Execute the interactive code cell below, where you’ll carry out PCA manually by **sweep the rotation angle θ.**

**What to look for**
- As you vary θ, the **variance along PC1** changes.  
- The angle where the PC1 variance is **largest** is the PCA solution for **PC1**.  
- **PC2** is always orthogonal to PC1; when PC1 captures more variance, PC2 captures less.  
- The **total variance** of the dataset is unchanged by rotation.

**How to use the interactive PCA demo:**
1. Run the cell, then drag the **θ slider** to rotate the axes.  
2. Observe when the PC1 variance is **highest**: this is your **principal direction**.  
3. Relate that angle to the visible structure (e.g., cluster separation) in the scatter plot.


In [None]:
PCA_dataset, _, _ = menglab.generate_PCA_data()
menglab.interactive_PCA_rotation(PCA_dataset)

### ✅ Exercise 6 [8 Points] : Interactive PCA, Rotation

**Tasks:** Use the interactive PCA code cell above to answer the following questions:

1. Set $\theta = 0$, $\theta = 25$, $\theta = 50$, and $\theta = 75$.
    - For each setting of $\theta$, record the variance in PC1 and PC2 for all three settings of $\theta$.  
    - Calculate the total variance (the sum of PC1 and PC2 variances) for each setting of $\theta$.
    - Execute the code cell to plot the variances as a function of $\theta$.

2. From the analysis in Task 1, does the total variance change with $\theta$ (within rounding error of approximately $\pm$1 unit)?

3. What choice of $\theta$ ($\theta = 0$, $\theta = 25$, $\theta = 50$, or $\theta = 75$) results in PC1 containing the most variance in the data?

4. For this optimal choice of $\theta$, how much variance is explained by PC1 alone?

In [None]:
# Variables (provided)
# execute this code cell to define the provided variables
thetas = [0, 25, 50, 75]
PC1_var = np.array([0, 0, 0, 0])
PC2_var = np.array([0, 0, 0, 0])

In [None]:
# Task 1
PC1_var[0] = NotImplemented ### Your Solution here, theta = 0
PC1_var[1] = NotImplemented ### Your Solution here, theta = 25
PC1_var[2] = NotImplemented ### Your Solution here, theta = 50
PC1_var[3] = NotImplemented ### Your Solution here, theta = 75

PC2_var[0] = NotImplemented ### Your Solution here, theta = 0
PC2_var[1] = NotImplemented ### Your Solution here, theta = 25
PC2_var[2] = NotImplemented ### Your Solution here, theta = 50
PC2_var[3] = NotImplemented ### Your Solution here, theta = 75

total_variance = NotImplemented ### Your Solution here

plt.plot(thetas, PC1_var, label = 'PC1 variance')
plt.plot(thetas, PC2_var, label = 'PC2 variance')
plt.plot(thetas, total_variance, label = 'total variance')
plt.legend()
plt.xlabel("theta")
plt.ylabel("variance")

Task 2. Does the total variance change with $\theta$ (within rounding error of approximately $\pm$1 unit)?

YOUR SOLUTON HERE

Task 3. What choice of $\theta$ ($\theta = 0$, $\theta = 25$, $\theta = 50$, and $\theta = 75$) results in PC1 containing the most variance in the data?

YOUR ANSWER HERE

Task 4. For this optimal choice of $\theta$, how much variance is explained by PC1 alone?

YOUR ANSWER HERE

## 2.2 Interactive PCA: Projection

Once we’ve found the **best rotation** (the principal components), we can **project** the data onto just a few of those directions and **drop the rest**. This keeps most of the important structure while making the dataset smaller and easier to work with.

### What “projection” means (informally)
- Think of shining a light so your 3D (or higher-D) cloud casts a shadow onto 1 or 2 chosen axes.
- The axes we keep are the **top principal components** (PC1, PC2, …) because they capture the most **spread** (variance).
- Everything **orthogonal** to those axes is discarded (set to zero in the reduced view).

### Choosing how many components to keep (k)
- **k = 1**: Collapse the data onto **PC1**. You get a single number per point (its **score** on PC1).  
  Great for seeing the main trend or ordering along the dominant direction.
- **k = 2**: Keep **PC1 and PC2**. You get a 2D view (scores on PC1–PC2), which often reveals clusters or elongated shapes.
- **k > 2**: Keep the top **k** PCs for modeling while still reducing dimensionality (e.g., from 100 features down to 10–20).

### What the interactive does here
In this demo, we’ll **project onto just one dimension (PC1)**:
- The **left plot** shows your data in the rotated frame (PC1 horizontal, PC2 vertical).
- The **right plot** shows a **histogram of PC1 scores**—that’s the 1D projection.
- As you adjust the rotation angle  $\theta$, you’ll see the PC1 histogram widen or narrow.  
  The **widest** histogram (largest variance) corresponds to the **true PC1**.



In [None]:
PCA_dataset, _, _ = menglab.generate_PCA_data()
menglab.interactive_PCA_projection(PCA_dataset)

### ✅ Exercise 7 [1 Points] : Interactive PCA, Projection

**Tasks:** Use the interactive PCA code cell above to answer the following questions:

Task 1. For this optimal choice of $\theta$, are the two clusters in the dataset distinguishable along PC1?

YOUR ANSWER HERE

## 2.3 PCA vs. Clustering

So far we’ve used dimensionality reduction (simple projections and PCA) to *visualize* datasets that appear clustered.  
It’s crucial to remember:

> **PCA’s primary goal is not to find clusters.**  
> PCA seeks directions of **maximum variance** (PC1, PC2, …) and—optionally—projects data onto a few of them to reduce dimension.

### What this implies
- PCA **optimizes variance**, not separation. A dataset can be clustered yet still look mixed in a PC1–PC2 plot.
- We’ll use **k-means clustering** later to ask the *separate* question: “Is this data actually clustered?”

### What you’ll do next
Run the two cells below to:
1. **Load a “paradoxical” PCA dataset**
2. **Launch the interactive PCA demos** (Rotation and Projection -- these are the same analyses we ran above in Exercises 6/7) to see how:
   - Rotating the axes changes the variance captured by “PC1,” and
   - Projecting onto a small number of PCs can either **reveal** or **obscure** structure.

In [None]:
# Generate data: paradoxical PCA dataset
PCA_dataset_nocluster, y, u_true = menglab.generate_PCA_data(
    n_per_cluster=300,
    theta_data_deg=75.0,
    sep=3.0,
    sigma_parallel=0.5,
    sigma_perp=3.0,
    random_state=42
)
menglab.interactive_PCA_rotation(PCA_dataset_nocluster)

In [None]:
menglab.interactive_PCA_projection(PCA_dataset_nocluster)

### ✅ Exercise 8 [3 points]: Paradoxical PCA

**Tasks:** Use the interactive demos above to explore a paradoxical PCA dataset.

1. What value for $\theta$ maximizes the variance in the PC1 direction? What is the variance in PC1 and variance explained ratio/percentage?

2. Does the PCA optimal result distinguish the two clusters of data after projection into PC1?

3. What about the structure of the data and the objectives of the PCA algorithm explains your observations?

Task 1. What value for $\theta$ maximizes the variance in the PC1 direction? What is the variance in PC1 and variance explained ratio/percentage?

YOUR SOLUTION HERE

Task 2. Does the PCA optimal result distinguish the two clusters of data after projection into PC1?

YOUR SOLUTION HERE

Task 3. What about the structure of the data and the objectives of the PCA algorithm explains your observations?

YOUR ANSWER HERE

# 3 PCA on Wines dataset

Let’s apply PCA to a **real** dataset: the classic *Wine* dataset from **scikit-learn**.  
We’ll loosely reproduce the analysis from the paper in Section 0.

> **Note:** The paper we read did not release its original data. As a close stand-in, we’ll use scikit-learn’s Wine dataset, which includes physicochemical (molecular) measurements for Italian wines.

Our goal is to use these features to uncover low-dimensional structure in the data—and, if present, cluster patterns (e.g., differences that might align with producers/cultivars or regions).  
We’ll standardize the features and then run PCA to visualize and interpret the dominant sources of variation.

### What’s in the dataset?
- **wines** grown from the same region in Italy, belonging to **3 cultivars** (classes).
- **chemical measurements** per wine (features), e.g. Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315, Proline.
- Goal: **unsupervised dimensionality reduction** via PCA.

### What is scikit-learn (sklearn)?

**scikit-learn** is a popular, open-source **Python library for machine learning**. It gives you reliable, well-tested implementations of common algorithms—like **PCA**, **k-means**, **logistic regression**, **random forests**, and many more—**without** having to code the math and optimization details yourself.

Why we use it here:
- **Preprocessing:** tools like `StandardScaler` to standardize features before PCA.
- **Dimensionality reduction:** `PCA` (and others) with a simple, uniform API.
- **Model selection:** utilities like `train_test_split`, `Pipeline`, and cross-validation.
- **Consistency:** almost everything follows the same pattern: `fit(...)`, then `transform(...)` or `predict(...)`.



## 3.1 Import Wine Chemistry Dataset (sklearn)

Execute the code cell below to load the wines dataset and display the first 5 rows of the dataset.

In [None]:
# Load wines dataset
wine = load_wine()
X_wine = pd.DataFrame(wine.data, columns=wine.feature_names)
y_wine = pd.Series(wine.target, name='class')  # 0,1,2 (wine cultivars)

display(X_wine.head())


### ✅ Exercise 9 [5 points]: Exploring the data

**Tasks:** The wines dataset has been loaded into `X_wine` and `y_wine`
  - `X_wine` is matrix of wines and their chemical features, each row corresponds to a wine, and each column corresponds to a chemical feature.
  - `y_wine` is a list containing the cultivar type of each wine (represented as an integer).

1. How many wines are there in this dataset? How many chemical features are provided for each wine?
    - Use the `np.shape()` function to determine the dimensions of `X_wine` (returns a list of dimension sizes).
    - Then, use list indexing to access the dimensions corresponding to the number of wines and number of features.

2. How many unique cultivars are there?

    - Use the `np.unique()` function to produce a list of unnique elements in a list.
    - Use the `len()` function to determine the length of a list.
    - Do we want to operate on `X_wine` or `y_wine` for this analysis?



In [None]:
# Task 1
dataset_shape = NotImplemented ### YOUR SOLUTION HERE

print("The shape of the dataset is: ", dataset_shape)

n_wines = NotImplemented ### YOUR SOLUTION HERE
n_features =  NotImplemented ### YOUR SOLUTION HERE

print("The number of wines is: ", n_wines)
print("The number of features are: ", n_features)

In [None]:
# Task 2
unique_cultivars = NotImplemented ### YOUR SOLUTION HERE
print("The unique cultivars are: ", unique_cultivars)

n_unique_cultivars = NotImplemented ### YOUR SOLUTION HERE
print("The number of unique cultivars is: ", n_unique_cultivars)
### END SOLUTION

### ✅ Exercise 10 [5 Points]: Visualizing the data

**Tasks:** Execute the two following code cells, for visualizing simple 2D projections of the wines dataset.

  - In the first code cell, use the interactive dropdowns to visualize any simple 2D projection of the wines dataset using two of the feature dimensions.

  - In the second code cell, all pairs of 2D feature projections are shown in a matrix.

1. In visualizing the 2D projections, do any of them show distinct, highly separated clusters of data?

2. Compare the values for different features:
    - How does the magnitude and reange of `magnesium` values compare to the magnitude and range of `hue` values?
    - If PCA is looking for directions of highest variance, how do you thing these different magnitudes and ranges will affect the PCA algorithm?

3. What does the 'Standardize (z-score)' checkbox in the interactive code cell appear to do to the data? (In rough terms).


In [None]:
menglab.interactive_2d_projection(X_wine)

In [None]:
# Scatter plots of all possible 2D projections (may take a few seconds)
from pandas.plotting import scatter_matrix
scatter_matrix(X_wine, figsize=(14, 14), diagonal='hist', color='0.2', alpha=0.6)
plt.show()

Task 1. In visualizing the 2D projections, do any of them show distinct, highly separated clusters of data?

YOUR SOLUTION HERE

Task 2. Compare the values for different features: How does the magnitude and range of `magnesium` values compare to the magnitude and range of `hue` values? If PCA is looking for directions of highest variance, how do you thing these different magnitudes and ranges will affect the PCA algorithm?

YOUR SOLUTION HERE

Task 3. What does the 'Standardize (z-score)' checkbox in the interactive code cell appear to do to the data? (In rough terms).

YOUR SOLUTION HERE

## 3.2 Standardizing the Data

Standardizing is a critical step in the PCA algorithm.

**Why standardize for PCA?**  
1. PCA is doing **rotations** on the data. Therefore, it's important for the data to be centered on the origin (mean-zero) for these rotations to be interpretable.

2. PCA looks for directions of **maximum variance**. If one feature has large units and another has small units, the large-scale feature can dominate the PCs. Standardizing puts features on a comparable footing so PCA reflects **structure**, not **units**.

### What we do
1. **Center (mean-zero) each feature**  
   Subtract the column mean so the data cloud is centered at the origin.  
2. **Scale to unit variance (z-score)**  
   Divide by the column standard deviation so each feature has variance ≈ 1.
   
---
**Z-score formula (per feature i):**
$$
z_{i} \;=\; \frac{x_{i} - {\mu}}{\sigma}
$$

where $\mu$ is the mean and $\sigma$ is the standard deviation of feature $i$.



### ✅ Exercise 11 [7 Points]: Standardizing manually

**Tasks:**

1. Complete the `standardize` function, which takes in an input `x` corresponding to one feature (dimension) and returns a standardized version of that feature.
    
    - Use numpy functions like `np.mean` and `np.std` to calculate the mean and standard deviation of a list/np.array

2. Use the `standardize` function to create a standardized version of the 'alcohol' feature.

    - As a sanity check, calculate that mean and standard deviation are approximately 0 and 1, respectively.

    - **Hint:** Individual features of the dataset can be accessed as follows: `X_wine['alcohol']` or `X_wine.alcohol`

3. Complete the code cell to display a histogram of the alcohol feature before and after standardization.

In [None]:
# Task 1
def standardize(x):
  """
  Standardize a numeric array by z-scoring (mean 0, std 1).

  This computes the global mean (μ) and standard deviation (σ) of `x` and returns
  a standardized version of the input `x`

  Parameters
  ----------
  x : array_like
      A 1D or n-D numeric array

  Returns
  -------
  z : numpy.ndarray
      An array with the same shape as `x`, standardized to have mean ~0
      and standard deviation ~1
  """
  mu = NotImplemented ### YOUR SOLUTION HERE
  sigma = NotImplemented ### YOUR SOLUTION HERE
  z = NotImplemented ### YOUR SOLUTION HERE
  return z

In [None]:
# Task 2
alcohol_feature = NotImplemented ### YOUR SOLUTION HERE
alcohol_feature_standardized = NotImplemented ### YOUR SOLUTION HERE

print("the mean of the standardized feature is: ", np.mean(alcohol_feature_standardized))
print("the standard deviation of the standardized feature is:", np.std(alcohol_feature_standardized))

In [None]:
# Task 3
fig, axs = plt.subplots(1, 2, figsize=(10, 4), sharey=True)

axs[0].hist(NotImplemented) ### YOUR SOLUIION HERE
axs[0].set_title('Alcohol')
axs[0].set_xlabel('alcohol')
axs[0].set_ylabel('count')
axs[0].grid(True, alpha=0.3)

axs[1].hist(NotImplemented) ### YOUR SOLUTION HERE
axs[1].set_title('Alcohol (z-score)')
axs[1].set_xlabel('alcohol (standardized)')
axs[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


### ✅ Exercise 12 [2 Points]: Standardizing with `scikit-learn`

In this exercise you’ll use **scikit-learn**’s `StandardScaler` to z-score features with a clean, consistent API.

**Why sklearn?**  
`scikit-learn` provides well-tested implementations of common ML steps (scaling, PCA, models) so you don’t have to hand-code the details. Most estimators follow the same pattern:

- `fit(X)`: learn parameters from data (e.g., means/standard deviations).
- `transform(X)`: apply the learned transformation to data.
- `fit_transform(X)`: do both in one step.

---

#### Your tasks

1) **construct** a `StandardScaler` object.  
    - The sklearn library provides an object called `StandardScaler()`, which can be assigned to a variable (`scaler`).

    **Hint:** Literally assign the variable `scaler` the value `StandardScaler()`

2) **Fit** the scaler and **transform** the `X_wine` dataset.  
    - Use the `scaler.fit_transform(data)` to return a standardized version of the data.
    
    - Execute the code cell and check your work by confirming that each standardized feature in `X_wine` has mean ~0 and std ~1.
    
    **N.B.** The `dataFrame.describe()` function displays the mean and standard deviation of every feature in a dataFrame

In [None]:
# Task 1
scaler = NotImplemented ### YOUR SOLUTION HERE

In [None]:
# Task 2
Xw_scaled = NotImplemented ### YOUR SOLUTION HERE

# recast to pandas dataFrame
Xw_scaled = pd.DataFrame(Xw_scaled, columns=X_wine.columns)

# visualize properties of each feature column
Xw_scaled.describe()



## 3.3 PCA on wines dataset

### ✅ Exercise 13 [2 Points]: PCA with scikit-learn

Now that the Wine features are **standardized**, let’s run **Principal Component Analysis (PCA)**.

**Protocol (same pattern as scaling):**
1) **Create** a PCA object  
2) **Fit & transform** the standardized data

**Tasks**:

1. Define a pca object using the `PCA()` function from sklearn.

**HINT:** Like for the scaler task above, literally assign the variable `pca_object` the value `PCA()`.

2. Call `.fit_transform(data)` on your pca object and set `data` to be the standardized data set you just constructed above. Look back to see what you called this variable.

**That’s it! With just a few lines, scikit-learn handles all the heavy lifting of PCA!**

In [None]:
# Task 1
pca_object = NotImplemented ### YOUR SOLUTION HERE

In [None]:
# Task 2
XW_pca = NotImplemented ### YOUR SOLUTION HERE

## 3.4 Scree plot and Cumulative Variance Explained

### ✅ Exercise 14 [6 Points]: Plot PCA Variance Explained
**Tasks:** Use sklearn functionality to plot the results of PCA

1. Modify the code below to make a scree plot.
    - The scree plot displays the explained variance ratios for each of the principal components of the data, in decreasing order.

**HINT:** sklearn provides the explained variance ratios within the pca_object: `pca_object.explained_variance_ratio_`

2. Modify the code below to calculate the cumulative explained variance, and plot the results.

**HINT:** use `np.cumsum` to calculate the cumulative sum of a list/numpy array.

3. How much cumulative variance is explained by the first 2 principal components? How much cumulative variance is explained by the first 3 principal components?

**HINT:** Can you get Python to print out a variable that will help you answer this question more accurately?

In [None]:
# Task 1
variance_explained_ratios = NotImplemented ### YOUR SOLUTION HERE

pc_labels = np.arange(1, len(var)+1)
plt.figure()
plt.plot(pc_labels, variance_explained_ratios, marker='o')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('scree plot')
plt.show()

In [None]:
# Task 2
cumulative_variance = NotImplemented ### YOUR SOLUTION HERE

x_labels = np.arange(1, len(cumulative_variance)+1)
plt.figure()
plt.plot(x_labels, cumulative_variance, marker='o')
plt.xlabel('Number of PCs')
plt.ylabel('Cumulative Explained Variance')
plt.title('Wine: Cumulative Explained Variance')
plt.show()

Task 3. How much cumulative variance is explained by the first 2 principal components? How much cumulative variance is explained by the first 3 principal components?

**HINT:** Can you get Python to print out a variable that will help you answer this question more accurately?

YOUR SOLUTION HERE

## 3.5 Visualize data projected into PCA dimensions

The output of the `.fit_transform(...)` function is the data projected into the Principal component directions. we can plot this data in 2D (using the first two principal components), or 3D (using the first 3 principal components).

In [None]:
# Visualize first two PCs, with true classes to build intuition
menglab.pca_scatter(Xw_pca, y=y_wine, pcx=1, pcy=2, title='Wine: PC1 vs PC2 (colored by true class)')

In [None]:
menglab.PCA_projection_3D(Xw_pca, y=y_wine, pcs=(1,2,3), evr=pca_object.explained_variance_ratio_)

# 4 Introduction to K-means Clustering

In the Wine dataset, the **PC1–PC2** plot *looks* clustered — but that’s because we’ve been peeking at the cultivar **labels**.  
What if we **couldn’t** see the labels? Could a computer still discover groups on its own — and even suggest **how many** there are?

This is the purpose of **k-means** clustering.
- **Unsupervised:** K-means ignores labels and looks only at the feature values.
- **Goal:** Partition the data into **k** clusters where points are close to their cluster’s **centroid** (mean)

## The core algorithm (in plain English)

0. **Pick k** (how many groups you want).
1. **Start with k centers** (often chosen automatically).
2. **Assign** each point to its **nearest** center.
3. **Update** each center to be the **average** of the points assigned to it.
4. Repeat **assign → update** until things stop changing much.

That’s it! The algorithm tries to put points into compact, well-separated clumps.

---

**Important:** **Step 0 is to Pick `k`.**  
You must start with a **guess** for how many clusters (`k`) you think the data contains.

Don’t worry—we’ll soon learn practical ways to **choose `k` automatically** (e.g., elbow plots and silhouette scores). For now, pick a reasonable starting value and iterate.

---

**Now, let's implement k-means clustering from scratch on a simple 1-D dataset.**

Execute the code below to see a 1-D visualization of the toy dataset, which is constructed to have three distinct groupings of datapoints spread along the x-direction.


In [None]:
data = menglab.generate_kmeans_data()

plt.scatter(data, np.zeros_like(data), s=100, alpha=0.8, edgecolor='k', linewidths=0.2)
plt.xlabel("x")
plt.ylabel("y")
#plt.scatter(centers, np.zeros_like(centers), c='r', s=500, alpha=0.5, edgecolor='k', linewidths=0.2)
plt.title("Toy k-means dataset")

## 4.1 Step 1: Initialize cluster centers to 'k' random data points

### ✅ Exercise 15 [2 points]: Initialize clusters

**Tasks:** The first step in **k-means** clustering is to initialize the cluster centers. The simplest way to do this is to select $k$ data points at random, and simply assign the cluster centers to lie on top of those data points.

1. Complete the function `initialize_clusters(data, k)` that takes as input:
    - `data`, a 1-D list of data points scattered in the x-direction
    - `k` (int), the number of clusters
    
    And Returns:

    - `centers`, a 1-D list of initialized cluster centers.

    To accomplish this, assign the clusters to $k$ random points from the dataset.

    - **Hint:** Random data points can be selected with the numpy function `np.random.choice(a, size)` where `a` is the source of data points and `size` is the number of randomly drawn points from the data source.

2. Initialize the cluster locations using the above function with $k=3$. Execute the code cell to visualize the dataset and corresponding cluster centers.

In [None]:
# Task 1.
def initialize_centers(data, k):
  """
  Initialize k-means cluster centers by sampling from the data.

  This function selects `k` values from `data` using `np.random.choice`
  and returns them as the initial cluster centers.

  Parameters
  ----------
  data : array-like
      Input data. This implementation assumes **1D** data (e.g., shape (n,)).
  k : int
      Number of initial centers to select.

  Returns
  -------
  centers : numpy.ndarray
      Array of length `k` containing the chosen initial centers
  """
  np.random.seed(0) # Set a seed for reproducibility
  centers = NotImplemented ### YOUR SOLUTION HERE
  return centers


In [None]:
# Task 2
# Visualization - execute to visualize the initialization of cluster centers
centers = NotImplemented ### YOUR SOLUTION HERE
plt.scatter(data, np.zeros_like(data), s=100, alpha=0.8, edgecolor='k', linewidths=0.2, label='data')
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(centers, np.zeros_like(centers), c='r', s=500, alpha=0.5, edgecolor='k', linewidths=0.2, label='Cluster centers')
plt.title("Toy k-means dataset")
plt.legend()

## 4.2 Step 2: Assign points to their nearest cluster center.

### ✅ Exercise 16 [2 points]: Assign points to clusters.

**Tasks**

1. Complete the function `assign_labels(data, centers, labels)` that takes as arguments:
    - `data`, a 1-D list of data points scattered in the x-direction
    - `centers`, a list of 'k' cluster centers
    - `labels`, a 1-D list of cluster label assignments, one for each point in `data`

    ... And returns

    - `labels`, an updated list of cluster label assignments

    To assign points to clusters, for each point calculate:
    - `distances`, a 1-D list of distances from the point to the cluster centers.
    - `closest_cluster` the index corresponding to the smallest distance (the closest cluster)

    **Hint:** use `np.argmin` to get the **index** corresponding to the smallest value in a list. Do *not* use `np.min` as this would give the value of the smallest distance itself, not the index of the cluster with the smallest distance!

2. Assign points to clusters using the function you just developed in Task 1, with cluster centers defined from the previous exercise (Exercise 15). Execute the code cell to visualize the data points colored by cluster assignment.

In [None]:
# TASK 1
def assign_labels(data, centers, labels):
  """
  Assign each data point to its nearest cluster center (1D k-means step).

  For every point in `data`, compute the absolute distance to each value in
  `centers` and write the index of the closest center into `labels`.

  Parameters
  ----------
  data : array-like of shape (n_samples,)
      1D data points to be clustered.
  centers : array-like of shape (k,)
      Current cluster center locations (in the same 1D space as `data`).
  labels : array-like of shape (n_samples,)
      Preallocated integer array; will be filled with cluster indices in 0..k-1.

  Returns
  -------
  labels : numpy.ndarray
      The same array passed in, filled with the nearest-center index per sample.
  """
  for i, point in enumerate(data):
    distances = np.abs(point - centers)
    closest_cluster = NotImplemented ### YOUR SOLUTION HERE
    labels[i] = closest_cluster
  return labels

In [None]:
# Task 2
# Visualization
labels = np.zeros_like(data) # initialized to zeros
labels = NotImplemented ### YOUR SOLUTION HERE
plt.scatter(data, np.zeros_like(data), c = labels, s=100, alpha=0.8, edgecolor='k', linewidths=0.2, label='data')
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(centers, np.zeros_like(centers), c='r', s=500, alpha=0.5, edgecolor='k', linewidths=0.2, label='Cluster centers')
plt.title("Toy k-means dataset")
plt.legend()

## 4.3 Step 3: Update Cluster Centers

### ✅ Exercise 17 [2 points]: Update Cluster Centers

**Tasks**

1. Complete the function `update_cluster_centers(data, labels)` that takes in the following as arguments:
    - `data`, a 1-D list of data points scattered in the x-direction
    - `labels`, a 1-D list of cluster label assignments, one for each point in `data`
    
    ... And returns:
    - `centers`, a list of 'k' updated cluster centers.

    To update cluster centers, for each cluster: calculate the mean value of data points ("members") in the cluster

**HINT:** Use the numpy function `np.mean` and take the mean value of all of the `cluster_members`
  
2. Update cluster centers using the function you just developed in Task 1, with labels defined according to the previous exercise (Exercise 16). Execute the code cell to visualize the new cluster centers. Note that the data points are not yet re-assigned to the closest cluster for these new cluster positions!

Remember that the value of $k$ we are using is $k=3$.
    

In [None]:
# TASK 1
def update_cluster_centers(data, labels, k):
  """
  Update k-means cluster centers as the mean of their assigned points (1D).

  This function recomputes each cluster center by taking the arithmetic mean of
  all data points currently assigned to that cluster,

  Parameters
  ----------
  data : array-like of shape (n_samples,)
      1D data values.
  labels : array-like of shape (n_samples,)
      Integer cluster assignments in the range 0..k-1 for each sample.
  k : int
      Number of clusters.

  Returns
  -------
  centers : numpy.ndarray of shape (k,)
      The updated center positions.
  """
  for cluster in range(k):
    cluster_members = data[labels == cluster] # select data points belonging to a cluster
    centers[cluster] = NotImplemented ### YOUR SOLUTION HERE

  return centers

In [None]:
# Task 2
# Visualization
# for this cell only: don't overwrite centers, store results in new_centers
new_centers = NotImplemented ### YOUR SOLUTION HERE
plt.scatter(data, np.zeros_like(data), c = labels, s=100, alpha=0.8, edgecolor='k', linewidths=0.2, label='data')
plt.xlabel("x")
plt.ylabel("y")
plt.scatter(centers, np.zeros_like(new_centers), c='r', s=500, alpha=0.5, edgecolor='k', linewidths=0.2, label='Cluster centers')
plt.title("New cluster centers (data points not re-assigned yet)")
plt.legend()

## 4.4 Step 4: Repeat! (Putting it all together)

### ✅ Exercise 18 [4 Points]: Implementing full K-means from scratch

**Tasks:** Complete the `kmeans(data, k, iterations)` function to accomplish the kmeans algorithm. Use the functions you previously defined to complete the kmeans code.

The code cell below outlines the scaffolding of the full kmeans algorithm. For this implementation, we will simply specify the number of iterations to take rather than detect when convergence has occured.

1. Initially assign cluster centers using `initialize_clusters`

For a specified number of iterations, on each iteration:

2. Assign points to their closest cluster using `assign_labels`

3. calculate new cluster centers using `update_cluster_centers`

4. Use the completed `kmeans` function to run **k-means** clustering with k=4 for 5 iterations, and plot the results for each step. (set `plot=True`)

In [None]:
# Tasks 1, 2, 3
def kmeans(data, k, iterations=5, plot=False):
  """
  Run a simple 1D k-means loop for a fixed number of iterations and plot progress.

  This implementation clusters 1D points on the real line using Lloyd's algorithm:
    (1) initialize k centers by sampling data points,
    (2) assign each point to its nearest center (absolute distance),
    (3) update each center to the mean of its assigned points,
    (4) repeat for `iterations` steps.

  Parameters
  ----------
  data : array-like of shape (n_samples,)
      1D data values to cluster.
  k : int
      Number of clusters (centroids).
  iterations : int, default=5
      Number of Lloyd iterations to perform (fixed-iteration stopping).
  plot : bool, default=False
      optional argument, if True, plot results every iteration.

  Returns
  -------
  centers : numpy.ndarray of shape (k,)
      Final center positions after the last update.
  labels : numpy.ndarray of shape (n_samples,)
      Cluster index (0..k-1) assigned to each data point.

  Dependencies
  ------------
  Uses three helper functions assumed to be defined previously:
    - `initialize_centers(data, k)` -> centers
    - `assign_labels(data, centers, labels)` -> labels
    - `update_cluster_centers(data, labels, k)` -> centers
  """
  labels = np.zeros_like(data)

  # initially assign cluster centers to k random data points
  centers = NotImplemented ### YOUR SOLUTION HERE

  for i in range(iterations):
    # for each point, calculate distance to clusters, assign to closest one
    labels = NotImplemented ### YOUR SOLUTION HERE

    # for each cluster, calculate the new center of mass
    centers = NotImplemented ### YOUR SOLUTION HERE

    # optionally, plot after each iteration
    if plot:
      plt.scatter(data, np.zeros_like(data), c = labels, s=100, alpha=0.8, edgecolor='k', linewidths=0.2, label='data')
      plt.scatter(centers, np.zeros_like(centers), c='r', s=500, alpha=0.5, edgecolor='k', linewidths=0.2, label='Cluster centers')
      plt.xlabel("x")
      plt.ylabel("y")
      plt.title(f"K-means after iteration {i}")
      plt.legend()

  return centers, labels



In [None]:
# Task 4
centers, labels = NotImplemented ### YOUR SOLUTION HERE


## 4.5 Determining the best k with Silhouette Scores

**Motivation: choosing $k$ in k-means.**  
K-means needs you to pick the number of clusters $k$ *before* fitting. How do we know a good $k$?  
The **silhouette score** gives a **scale-free** measure of cluster quality you can compare **across different $k$**. Higher is better.

---

### Definition (per point)
For a data point $i$:

- $a(i)$: the **average distance** from $i$ to all **other points in its own cluster**.
- $b(i)$: for every **other** cluster, compute the average distance from $i$ to that cluster; take the **smallest** of these averages (the “nearest other cluster”).

The silhouette value for $i$ is
$$
s(i) \;=\; \frac{b(i) - a(i)}{\max\{a(i),\, b(i)\}} \;\in\; [-1,\,1].
$$

**Interpretation:**
- $s(i) \approx 1$: well matched to its cluster and far from others.  
- $s(i) \approx 0$: on a boundary.  
- $s(i) < 0$: may be misassigned.

The **overall silhouette score** is the **mean** of $s(i)$ over all points.

---

### Using silhouette to pick $k$

1. For each $k \in \{2,3,\dots,K\}$:
   - Fit k-means (on **standardized** features).
   - Compute the **mean silhouette** $\bar{s}_k$.
2. Plot $\bar{s}_k$ vs. $k$ and choose the $k$ with a **higher** score (balance with simplicity and domain sense).



### ✅ Exercise 19 [4 Points]: Implementing silhouette score

**Tasks:** Use `scikit-learn` to execute k-means clustering and get a silhouette score.

1. Complete the `get_silhouette_score(data, k)` function that takes in as arguments:
     - `data`, a 1-D list of data points scattered in the x-direction
     - `k`, the number of clusters

     ... And returns
     - `score`, the silhouette score for the k-means clustering.

     **Tips:**

     - First, we use sklearn to define a k-means estimator: `KMeans(n_clusters)`

     - Then, we use the k-means estimator to fit and predict labels using:
     `km.fit_predict(data)`. (Note that we're using *fit_predict* and not *fit_transform* here because the labels are separate predictions, not transformations of the underlying dataset.)
     
     - Finally we need to use `silhouette_score(data, labels)` from sklearn to calculate the score from the data and labels.

2. Complete the `silhouette_sweep(data, k_min, k_max, ...)`function that executes k-means clustering on the data and calculates a silhouette score for values of k ranging from `k_min` to `k_max`. Use the `get_silhouette_score` function defined in Task 1.

3. Execute the `silhouette_sweep` on the 1-D toy dataset, for values of k from 2 to 10.

4. Examine the plot produced by silhouette sweep. Which choice for k (how many clusters) best separates the data? How does this line up with your intuition, given your obersvation of the underlying dataset?


In [None]:
# Task 1
def get_silhouette_score(data, k):
  """
  Compute the mean silhouette score for a K-Means clustering.

  This convenience function fits `KMeans(n_clusters=k)` on `data`,
  obtains cluster labels, and returns the mean silhouette score
  (range: [-1, 1], higher is better).

  Parameters
  ----------
  data : array-like of shape (n_samples, n_features)
      Feature matrix. For 1D data, pass `x.reshape(-1, 1)`.
  k : int
      Number of clusters (must be >= 2).

  Returns
  -------
  score : float
      Mean silhouette score for the fitted clustering.
  """
  km = KMeans(n_clusters=k)
  labels = km.fit_predict(data)
  score = NotImplemented ### YOUR SOLUTION HERE
  return score

In [None]:
# Task 2
def silhouette_sweep(data, k_min, k_max, plot=True, random_state=0):
  """
  Compute (and optionally plot) the mean silhouette score across a range of k.

  For each k in [k_min, k_max], this function fits a K-Means model on `data`,
  obtains cluster labels, and computes the mean silhouette score
  (range: [-1, 1], higher is better). If `plot=True`, it produces a line plot
  of silhouette score vs. k.

  Parameters
  ----------
  data : array-like of shape (n_samples,) or (n_samples, n_features)
      Input data. If 1D, pass a 1D array; implementations typically reshape
      internally to (n_samples, 1) for compatibility with scikit-learn.
  k_min : int
      Smallest number of clusters to evaluate (must be >= 2).
  k_max : int
      Largest number of clusters to evaluate. Implementations may clip this
      to at most n_samples - 1.
  plot : bool, default=True
      If True, plot silhouette score vs. k.
  random_state : int, default=0
      Seed used for K-Means initialization to make results reproducible.

  Returns
  -------
  ks : list of int
      The evaluated k values.
  scores : list of float
      The mean silhouette score for each k (NaN where undefined).
  """
  if len(data.shape) == 1:
    data = np.asarray(data).reshape(-1, 1) # reshape for compatibility with sklearn
  ks = range(k_min, k_max + 1)
  scores = []

  # calculate silhouette score for each value of k
  for k in ks:
      score = NotImplemented ### YOUR SOLUTION HERE
      scores.append(score)

  if plot:
    plt.figure(figsize=(6, 4))
    plt.plot(ks, scores, "-o")
    plt.xlabel("k")
    plt.ylabel("Silhouette score")
    plt.title("Silhouette scores")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
  return ks, scores

In [None]:
# Task 3
ks, scores = NotImplemented ### YOUR SOLUTION HERE

Task 4. Examine the plot produced by silhouette sweep. Which choice for k (how many clusters) best separates the data? How does this line up with your intuition, given your obersvation of the underlying dataset?

YOUR SOLUTION HERE

# 5 K-means clustering on PCA of wines dataset.

Now let's return to the wines dataset. Execute the code below to visualize the wines dataset projected into the first two Principal components

In [None]:
menglab.pca_scatter(Xw_pca, y=y_wine, pcx=1, pcy=2, title='Wine: PC1 vs PC2 (colored by true class)')

## 5.1  Interactive K-Means Clustering Demo

This interactive demo helps you **visualize how K-Means clustering works step by step**.  
You can adjust the **number of clusters (`k`)** and the **random seed** to see how the algorithm behaves differently with each setup.

At each iteration, the algorithm:
1. Assigns each point to its **nearest cluster centroid**.
2. Recalculates the **centroid positions** based on those assignments.
3. Updates the visualization to show how the clusters evolve.


In [None]:
k = 2
wine_2PCs = Xw_pca[:, :2]
menglab.interactive_k_means(wine_2PCs, y_wine)

## 5.2 WCSS and Elbow Plots
The interactive demo above calculates the **Within Cluster Sum of Squares** (**WCSS**).

WCSS measures **how tightly grouped** the points are inside each cluster.  
- If points are **very close** to their cluster center, the WCSS value will be **small**.  
- If points are **spread out**, the WCSS value will be **large**.

So, a **lower WCSS** means the clusters are better — they’re more compact and consistent.

### Why Do We Care About WCSS?

As we increase the number of clusters (for example, from 2 to 3 to 4, and so on), the WCSS usually **goes down**, because each cluster gets smaller and fits the data better.  
But if we keep adding clusters, we eventually stop getting much improvement.  

To find the **best number of clusters**, we often use the **elbow method**:
- Plot the WCSS value for different numbers of clusters.
- Look for the point where the line starts to **bend** like an elbow.
- That “elbow” is often a good choice for the number of clusters.
- This method can be used alongside the **silhouette method** for determining k.


### ✅ Exercise 20 [2 Points]: Elbow Plot

**Tasks** Use the interactive k-means demo above to answer the following questions.

1. What is the within-cluster sum of squares (**WCSS**) for $k=2, 3, 4,$ and $5$ after k-means clustering has converged?

**N.B.** Leave the random seed set to 0!

2. Execute the code cell below to populate an "Elbow plot" (i.e. WCCS vs k). Is there a visible elbow in the plot where the WCCS stops dropping dramatically? What value of 'k' occurs at the elbow? How does this correspond to the number of clusters we might expect from the dataset?

In [None]:
# Variables (provided)
# Execute once to define these variables
ks = [2, 3, 4, 5]
WCSS = [0, 0, 0, 0]


In [None]:
# Task 1
WCSS[0] = NotImplemented ### YOUR SOLUTION HERE, k=2
WCSS[1] = NotImplemented ### YOUR SOLUTION HERE, k=3
WCSS[2] = NotImplemented ### YOUR SOLUTION HERE, k=4
WCSS[3] = NotImplemented ### YOUR SOLUTION HERE, k=5

In [None]:
# Task 2: Visualization
plt.plot(ks, WCSS, '-o')
plt.xlabel("k (number of clusters)")
plt.ylabel("Within Cluster Sum of Squares (WCCS)")
plt.title("Elbow Plot")

2. Execute the code cell below to populate an "Elbow plot" (i.e. WCCS vs k). Is there a visible elbow in the plot where the WCCS stops dropping dramatically? What value of 'k' occurs at the elbow? How does this correspond to the number of clusters we might expect from the dataset?

YOUR ANSWER HERE

## 5.3 Silhouette Score Plot

Let's see if the Silhouette plot agrees with the Elbow plot analysis for determining the optimal number of clusters

### ✅ Exercise 21 [2 points]: Silhouette Score

**Tasks**

1. Use the `silhouette_sweep` function you implemented in Exercise 19 to plot silhouette scores for k values between 2 and 8 for the `wine_2PCS` dataset defined below.

2. What is the optimal value of k (number of clusters) according to the Silhouette Score analysis? Does it agree with the Elbow plot?

In [None]:
# Variables (proviede)
# Execute this code cell to define provided variables
wine_2PCS = Xw_pca[:, :2] # Wines dataset projected into the first 2 PCs

In [None]:
# Task 1
ks, scores = NotImplemented ### YOUR SOLUTION HERE

2. What is the optimal value of k (number of clusters) according to the Silhouette Score analysis? Does it agree with the Elbow plot?

YOUR SOLUTION HERE

**Congratulations!** You have completed Lab 3!  🥳