In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw12.ipynb")

In [None]:
rng_seed = 60

In [None]:
#imports
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import scipy as sp
import pandas as pd
#below line allows matplotlib plots to appear in cell output
%matplotlib inline

# **Question 1**: Machine Learning with the Iris Dataset

In this question, you'll explore fundamental machine learning techniques using the famous Iris dataset. You'll implement data loading, unsupervised clustering, and supervised classification methods.

### Background: The Iris Dataset

The Iris dataset is one of the most well-known datasets in machine learning and statistics. It was introduced by Ronald Fisher in 1936 and contains measurements of 150 iris flowers from three different species:
- **Iris Setosa**
- **Iris Versicolor**
- **Iris Virginica**

Each sample has four features:
1. **Sepal Length** (cm)
2. **Sepal Width** (cm)
3. **Petal Length** (cm)
4. **Petal Width** (cm)

The goal is to classify flowers into one of the three species based on these measurements.

### Machine Learning Overview

We'll explore two types of machine learning:

**Unsupervised Learning (K-Means Clustering):**
- No labeled data is used during training
- Algorithm finds patterns and groups data into clusters
- Useful when you don't have labeled examples

**Supervised Learning (Classification):**
- Labeled data is used to train a model
- Model learns to predict labels for new, unseen data
- Decision Trees and Neural Networks are popular classifiers

## **Part A**: Loading the Iris Dataset

In this part, you'll load the Iris dataset using scikit-learn's built-in dataset module.

### Your Task

Write a function `load_iris_data()` that:
1. Loads the Iris dataset from sklearn
2. Extracts the feature data and class labels
3. Returns them as numpy arrays

### Background: sklearn.datasets

Scikit-learn provides several built-in datasets for practice and testing. The `load_iris()` function returns a dictionary-like object with the following key attributes:
- `.data`: The feature matrix (measurements)
- `.target`: The class labels (0, 1, or 2 for the three species)
- `.feature_names`: Names of the four features
- `.target_names`: Names of the three species

### Requirements

- Import and use `sklearn.datasets.load_iris()`
- Extract the features (`.data` attribute) as a numpy array
- Extract the class labels (`.target` attribute) as a numpy array
- Return both arrays as a tuple: `(features, labels)`

**Parameters:**
- None

**Returns:**
- `features`: numpy array of shape `(150, 4)`, the feature matrix
- `labels`: numpy array of shape `(150,)`, the class labels (0, 1, or 2)

In [None]:
def load_iris_data():
    
    return features, labels

In [None]:
grader.check("q1a")

## **Part B**: K-Means Clustering

In this part, you'll implement unsupervised clustering using the K-Means algorithm.

### Background: K-Means Clustering

**K-Means** is an unsupervised learning algorithm that groups data points into `k` clusters based on their features. The algorithm works as follows:

1. **Initialize**: Randomly place `k` cluster centers (centroids)
2. **Assignment**: Assign each data point to the nearest centroid
3. **Update**: Move each centroid to the mean position of all points assigned to it
4. **Repeat**: Continue steps 2-3 until convergence

Mathematically, K-Means minimizes the within-cluster sum of squares:

$$J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2$$

where $C_i$ is cluster $i$ and $\mu_i$ is its centroid.

### Your Task

Write a function `kmeans_clustering(features, k)` that:
1. Takes the feature matrix from Part A
2. Applies K-Means clustering with `k` clusters
3. Returns the cluster labels assigned to each data point

### Requirements

- Use scikit-learn's K-Means implementation from `sklearn.cluster`
- Set `random_state=rng_seed` for reproducible results
- Fit the model to the features and extract cluster labels
- Return the cluster assignments as a numpy array

**Hint:** Look for the `KMeans` class in sklearn.cluster and its `.fit_predict()` method.

**Parameters:**
- `features`: numpy array of shape `(n_samples, n_features)`, the feature matrix
- `k`: int, the number of clusters

**Returns:**
- `cluster_labels`: numpy array of shape `(n_samples,)`, cluster assignments (0 to k-1)

In [None]:
def kmeans_clustering(features, k):
    
    return cluster_labels

In [None]:
grader.check("q1b")

### Example: Visualizing K-Means Clustering with Different k Values

Let's visualize how K-Means clustering performs with different numbers of clusters. We'll use the first two features (sepal dimensions) for easy visualization:

In [None]:
# Load the data
features, true_labels = load_iris_data()

# Try different values of k
k_values = [2, 3, 4, 5]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()

for idx, k in enumerate(k_values):
    # Perform K-means clustering
    cluster_labels = kmeans_clustering(features, k=k)
    
    # Plot using first two features (sepal length and sepal width)
    scatter = axes[idx].scatter(features[:, 0], features[:, 1], 
                                c=cluster_labels, cmap='viridis', 
                                s=50, alpha=0.7, edgecolors='black')
    
    axes[idx].set_xlabel('Sepal Length (cm)', fontsize=11)
    axes[idx].set_ylabel('Sepal Width (cm)', fontsize=11)
    axes[idx].set_title(f'K-Means with k={k} clusters', fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)
    
    # Add colorbar
    plt.colorbar(scatter, ax=axes[idx], label='Cluster')

plt.tight_layout()
plt.show()

print("Note: The true Iris dataset has 3 species, but K-Means can be applied with any k value.")

In [None]:
# Compare K-means clusters with true labels
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# K-means with k=3
clusters_3 = kmeans_clustering(features, k=3)
scatter1 = axes[0].scatter(features[:, 0], features[:, 1], 
                           c=clusters_3, cmap='viridis', 
                           s=50, alpha=0.7, edgecolors='black')
axes[0].set_xlabel('Sepal Length (cm)', fontsize=11)
axes[0].set_ylabel('Sepal Width (cm)', fontsize=11)
axes[0].set_title('K-Means Clustering (k=3)', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[0], label='Cluster')

# True labels
scatter2 = axes[1].scatter(features[:, 0], features[:, 1], 
                           c=true_labels, cmap='viridis', 
                           s=50, alpha=0.7, edgecolors='black')
axes[1].set_xlabel('Sepal Length (cm)', fontsize=11)
axes[1].set_ylabel('Sepal Width (cm)', fontsize=11)
axes[1].set_title('True Species Labels', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=axes[1], label='Species')

plt.tight_layout()
plt.show()

print("K-Means often finds similar (but not identical) groupings to the true species labels!")

## **Part C**: Decision Tree Classification

In this part, you'll train a supervised learning model using Decision Trees and evaluate its performance.

### Background: Decision Tree Classifier

**Decision Trees** are supervised learning models that make predictions by learning simple decision rules from the training data. The tree structure consists of:

- **Root Node**: The entire dataset
- **Internal Nodes**: Decision points based on feature values
- **Branches**: Outcomes of decisions
- **Leaf Nodes**: Final predictions (class labels)

For example, a decision might be: "If petal length < 2.5 cm, classify as Setosa, otherwise continue..."

Decision trees work by recursively splitting the data to maximize **information gain** or minimize **impurity** (measured by Gini impurity or entropy).

### Confusion Matrix

A **confusion matrix** is a table that shows the performance of a classification model:

|                | Predicted Class 0 | Predicted Class 1 | Predicted Class 2 |
|----------------|------------------|------------------|------------------|
| **True Class 0** | True Positives   | False Negatives  | False Negatives  |
| **True Class 1** | False Negatives  | True Positives   | False Negatives  |
| **True Class 2** | False Negatives  | False Negatives  | True Positives   |

Diagonal elements represent correct predictions, while off-diagonal elements represent misclassifications.

### Your Task

Write a function `train_decision_tree(features, labels, show_plot=False)` that:
1. Trains a Decision Tree classifier on the full dataset
2. Makes predictions on the same data
3. Computes and optionally plots the confusion matrix
4. Returns the trained model and confusion matrix

### Requirements

- Use `sklearn.tree.DecisionTreeClassifier` with `random_state=rng_seed`
- Fit the model using `.fit(features, labels)`
- Make predictions using `.predict(features)`
- Compute the confusion matrix using `sklearn.metrics.confusion_matrix`
- If `show_plot=True`, create a heatmap of the confusion matrix using matplotlib
- Return both the trained model and the confusion matrix as numpy array

**Hint:** For plotting, use `plt.imshow()` with a colormap and add labels for clarity.

**Parameters:**
- `features`: numpy array of shape `(n_samples, n_features)`, the feature matrix
- `labels`: numpy array of shape `(n_samples,)`, the true class labels
- `show_plot`: bool, whether to display the confusion matrix plot (default: False)

**Returns:**
- `model`: trained DecisionTreeClassifier object
- `conf_matrix`: numpy array, the confusion matrix
- `fig`: matplotlib figure object (or None if show_plot=False)

In [None]:
def train_decision_tree(features, labels, show_plot=False):
    
    return model, conf_matrix, fig

In [None]:
grader.check("q1c")

### Example: Visualizing Decision Tree Performance

In [None]:
# Train and visualize
features, labels = load_iris_data()
model, conf_matrix, fig = train_decision_tree(features, labels, show_plot=True)

# Calculate and display accuracy
accuracy = np.trace(conf_matrix) / np.sum(conf_matrix)
print(f"Decision Tree Accuracy: {accuracy:.1%}")
print(f"\nConfusion Matrix:")
print(conf_matrix)
print("\nNote: Decision trees can achieve perfect accuracy on training data,")
print("but may overfit. In practice, we should use train/test splits!")

## **Part D**: Neural Network Classification (MLPClassifier)

In this part, you'll train a neural network classifier using scikit-learn's Multi-Layer Perceptron (MLP).

### Background: Neural Networks

**Artificial Neural Networks** are inspired by biological neurons and consist of layers of interconnected nodes:

- **Input Layer**: Receives the feature values
- **Hidden Layer(s)**: Intermediate layers that learn complex patterns
- **Output Layer**: Produces class probabilities

Each connection has a **weight** that is adjusted during training. The network learns by:
1. Making predictions (forward propagation)
2. Computing the error
3. Adjusting weights to reduce error (backpropagation)

The **Multi-Layer Perceptron (MLP)** is a feedforward neural network that uses:
- **Activation functions**: Non-linear transformations (e.g., ReLU, tanh)
- **Optimization**: Gradient descent to minimize loss
- **Regularization**: Techniques to prevent overfitting

### Mathematical Background

For a simple neural network with one hidden layer:

$$h = \text{activation}(W_1 \cdot x + b_1)$$
$$y = \text{softmax}(W_2 \cdot h + b_2)$$

where:
- $x$ is the input features
- $W_1, b_1$ are weights and biases for the hidden layer
- $h$ is the hidden layer output
- $W_2, b_2$ are weights and biases for the output layer
- $y$ is the predicted class probabilities

### Your Task

Write a function `train_neural_network(features, labels, show_plot=False)` that:
1. Trains an MLP classifier on the full Iris dataset
2. Makes predictions and computes the confusion matrix
3. Optionally plots the confusion matrix
4. Returns the trained model, confusion matrix, and figure

### Requirements

- Use `sklearn.neural_network.MLPClassifier`
- Set `random_state=rng_seed` and `max_iter=1000` (or more if needed)
- Choose appropriate hyperparameters (hidden layer sizes, activation function, solver, etc.)
- The model should achieve at least 90% accuracy on the training data
- Follow the same return pattern as Part C: `(model, conf_matrix, fig)`

**Parameters:**
- `features`: numpy array of shape `(n_samples, n_features)`, the feature matrix
- `labels`: numpy array of shape `(n_samples,)`, the true class labels
- `show_plot`: bool, whether to display the confusion matrix plot (default: False)

**Returns:**
- `model`: trained MLPClassifier object
- `conf_matrix`: numpy array, the confusion matrix
- `fig`: matplotlib figure object (or None if show_plot=False)

In [None]:
def train_neural_network(features, labels, show_plot=False):
    
    return model, conf_matrix, fig

In [None]:
grader.check("q1d")

### Example: Comparing All Three Methods

Let's compare the performance of K-Means, Decision Tree, and Neural Network:

In [None]:
# Load data
features, labels = load_iris_data()

# Train all three models
print("Training models...")
print("-" * 50)

# K-Means (unsupervised)
clusters = kmeans_clustering(features, k=3)
print("✓ K-Means clustering complete")

# Decision Tree (supervised)
dt_model, dt_conf, dt_fig = train_decision_tree(features, labels)
dt_accuracy = np.trace(dt_conf) / np.sum(dt_conf)
print(f"✓ Decision Tree trained - Accuracy: {dt_accuracy:.1%}")

# Neural Network (supervised)
nn_model, nn_conf, nn_fig = train_neural_network(features, labels)
nn_accuracy = np.trace(nn_conf) / np.sum(nn_conf)
print(f"✓ Neural Network trained - Accuracy: {nn_accuracy:.1%}")

print("-" * 50)
print("\nNote: K-Means is unsupervised (no accuracy metric),")
print("while Decision Tree and Neural Network use labeled data.")

In [None]:
# Visualize confusion matrices side by side
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Decision Tree confusion matrix
im1 = axes[0].imshow(dt_conf, cmap='Blues', interpolation='nearest')
axes[0].set_xlabel('Predicted Label', fontsize=11)
axes[0].set_ylabel('True Label', fontsize=11)
axes[0].set_title(f'Decision Tree\nAccuracy: {dt_accuracy:.1%}', fontsize=12, fontweight='bold')
axes[0].set_xticks(np.arange(3))
axes[0].set_yticks(np.arange(3))
plt.colorbar(im1, ax=axes[0])

# Add text annotations for decision tree
for i in range(3):
    for j in range(3):
        text = axes[0].text(j, i, dt_conf[i, j],
                           ha="center", va="center", 
                           color="black" if dt_conf[i, j] < dt_conf.max()/2 else "white",
                           fontsize=14, fontweight='bold')

# Neural Network confusion matrix
im2 = axes[1].imshow(nn_conf, cmap='Greens', interpolation='nearest')
axes[1].set_xlabel('Predicted Label', fontsize=11)
axes[1].set_ylabel('True Label', fontsize=11)
axes[1].set_title(f'Neural Network\nAccuracy: {nn_accuracy:.1%}', fontsize=12, fontweight='bold')
axes[1].set_xticks(np.arange(3))
axes[1].set_yticks(np.arange(3))
plt.colorbar(im2, ax=axes[1])

# Add text annotations for neural network
for i in range(3):
    for j in range(3):
        text = axes[1].text(j, i, nn_conf[i, j],
                           ha="center", va="center", 
                           color="black" if nn_conf[i, j] < nn_conf.max()/2 else "white",
                           fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nBoth supervised methods achieve excellent performance on the Iris dataset!")

# **Question 2**: Regression with the Diabetes Dataset

In this question, you'll explore **regression** - predicting continuous values rather than discrete classes. You'll work with the diabetes dataset to predict disease progression and learn about proper train/test splitting and data standardization.

### Background: Regression vs Classification

While classification predicts discrete categories (like iris species), **regression** predicts continuous numerical values. Examples include:
- Predicting house prices from features
- Forecasting temperature from weather data
- Estimating disease progression from medical measurements

### The Diabetes Dataset

The diabetes dataset contains 442 samples with 10 baseline features:
- Age
- Sex
- Body mass index (BMI)
- Average blood pressure
- Six blood serum measurements

The target variable is a quantitative measure of disease progression one year after baseline.

### Key Concepts

**Train/Test Split:**
- Training data: Used to fit the model
- Test/validation data: Used to evaluate performance on unseen data
- Prevents overfitting and provides realistic performance estimates

**Data Standardization:**
- Neural networks perform better when features are on similar scales
- Standardization: Transform features to have mean=0 and std=1
- Formula: $z = \frac{x - \mu}{\sigma}$
- **Important:** Fit standardization on training data only, then apply to test data

## **Part A**: Loading and Preprocessing the Diabetes Dataset

In this part, you'll load the diabetes dataset, split it into training and test sets, and standardize the features.

### Your Task

Write a function `prepare_diabetes_data(train_fraction=0.8)` that:
1. Loads the diabetes dataset from sklearn
2. Splits the data into training and test sets
3. Standardizes both feature sets separately (important!)
4. Returns the preprocessed training and test data

### Background: Why Standardize?

Neural networks are sensitive to feature scales. Features with larger values can dominate the learning process. **Standardization** transforms each feature to have:
- Mean ($\mu$) = 0
- Standard deviation ($\sigma$) = 1

The transformation is: $z = \frac{x - \mu}{\sigma}$

**Critical:** Always fit the standardization parameters (mean, std) on the training data only, then apply the same transformation to the test data. This prevents "data leakage" where test information influences training.

### Requirements

- Use `sklearn.datasets.load_diabetes()` to load the dataset
- Use `sklearn.model_selection.train_test_split()` with `random_state=rng_seed`
- Use `sklearn.preprocessing.StandardScaler` for standardization
  - Fit the scaler on training features only: `.fit(X_train)`
  - Transform training features: `.transform(X_train)`
  - Transform test features using the same scaler: `.transform(X_test)`
- Return four arrays: `(X_train, X_test, y_train, y_test)`

**Hint:** The StandardScaler class has methods `.fit()`, `.transform()`, and `.fit_transform()`.

**Parameters:**
- `train_fraction`: float, fraction of data to use for training (default: 0.8)

**Returns:**
- `X_train`: numpy array, standardized training features
- `X_test`: numpy array, standardized test features  
- `y_train`: numpy array, training targets
- `y_test`: numpy array, test targets

In [None]:
def prepare_diabetes_data(train_fraction=0.8):
    
    return X_train, X_test, y_train, y_test

In [None]:
grader.check("q2a")

## **Part B**: Training a Neural Network Regressor

In this part, you'll train a Multi-Layer Perceptron regressor and visualize its training progress.

### Your Task

Write a function `train_mlp_regressor(X_train, X_test, y_train, y_test, show_plot=False)` that:
1. Trains an MLP regressor on the training data
2. Tracks validation scores during training (note: validation is taken from training data)
3. Computes the test R² score on the actual test set
4. Optionally plots training and validation scores over epochs
5. Returns the trained model, test R² score, and figure

### Background: MLPRegressor and Validation Curves

**MLPRegressor** is scikit-learn's neural network for regression. Unlike MLPClassifier, it predicts continuous values.

**Validation during training:**
- Training score: Performance on training data (can overfit)
- Validation score: Performance on a subset held out from training data
- Test score: Performance on completely separate test data (most realistic)
- Monitoring validation helps detect overfitting during training

The MLPRegressor can track validation scores by setting:
- `validation_fraction`: Portion of training data to use for validation
- `early_stopping=True`: Stop training if validation score doesn't improve
- The model stores loss curves in `.loss_curve_` and `.validation_scores_` attributes

**Important:** Since validation is taken from the training data, we also compute the test R² score on the true test set to get an unbiased performance estimate.

### Evaluation Metric: R² Score

For regression, we use the **coefficient of determination** ($R^2$):

$$R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}$$

where:
- $y_i$ are true values
- $\hat{y}_i$ are predicted values
- $\bar{y}$ is the mean of true values

$R^2 = 1$ is perfect prediction, $R^2 = 0$ means predicting the mean.

### Requirements

- Use `sklearn.neural_network.MLPRegressor`
- Set `random_state=rng_seed`, `max_iter=500` (or more if needed)
- Enable validation tracking with `validation_fraction=0.1` and `early_stopping=True`
- Choose appropriate hyperparameters (hidden layers, activation, etc.)
- If `show_plot=True`, create a plot showing:
  - Training loss over iterations (from `.loss_curve_`)
  - Validation score over iterations (from `.validation_scores_`)
  - Use two y-axes if needed (loss decreases, score increases)
- Compute the test R² score using `.score(X_test, y_test)`
- Return the trained model, test R² score, and figure

**Hint:** The `.loss_curve_` attribute contains training loss per epoch, and `.validation_scores_` contains validation R² scores. Use `model.score(X_test, y_test)` to get the test R² score.

**Parameters:**
- `X_train`: numpy array, standardized training features
- `X_test`: numpy array, standardized test features
- `y_train`: numpy array, training targets
- `y_test`: numpy array, test targets
- `show_plot`: bool, whether to display the training curves (default: False)

**Returns:**
- `model`: trained MLPRegressor object
- `test_r2`: float, R² score on the test set
- `fig`: matplotlib figure object (or None if show_plot=False)

In [None]:
def train_mlp_regressor(X_train, X_test, y_train, y_test, show_plot=False):
    
    return model, test_r2, fig

In [None]:
grader.check("q2b")

### Example: Complete Regression Pipeline

Let's demonstrate the full pipeline from data preparation to model training and evaluation:

In [None]:
# Prepare the data
print("Step 1: Loading and preprocessing data...")
X_train, X_test, y_train, y_test = prepare_diabetes_data(train_fraction=0.8)

print(f"  Training set: {X_train.shape[0]} samples")
print(f"  Test set: {X_test.shape[0]} samples")
print(f"  Number of features: {X_train.shape[1]}")
print(f"  Training features - Mean: {np.mean(X_train):.2e}, Std: {np.mean(np.std(X_train, axis=0)):.2f}")
print()

# Train the model
print("Step 2: Training MLP Regressor...")
model, test_score, fig = train_mlp_regressor(X_train, X_test, y_train, y_test, show_plot=True)
print(f"  Training completed in {model.n_iter_} iterations")
print()

# Evaluate performance
print("Step 3: Evaluating model performance...")
train_score = model.score(X_train, y_train)

print(f"  Training R² score: {train_score:.4f}")
print(f"  Test R² score: {test_score:.4f}")

if train_score - test_score > 0.1:
    print(f"  ⚠ Warning: Training score is much higher than test score")
    print(f"    This suggests some overfitting.")
else:
    print(f"  ✓ Good generalization - train and test scores are similar")

In [None]:
# Visualize predictions vs actual values
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training data
axes[0].scatter(y_train, y_train_pred, alpha=0.5, s=30, edgecolors='black', linewidth=0.5)
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 
             'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual Disease Progression', fontsize=12)
axes[0].set_ylabel('Predicted Disease Progression', fontsize=12)
axes[0].set_title(f'Training Set\nR² = {train_score:.4f}', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Test data
axes[1].scatter(y_test, y_test_pred, alpha=0.5, s=30, edgecolors='black', linewidth=0.5, color='green')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', linewidth=2, label='Perfect prediction')
axes[1].set_xlabel('Actual Disease Progression', fontsize=12)
axes[1].set_ylabel('Predicted Disease Progression', fontsize=12)
axes[1].set_title(f'Test Set\nR² = {test_score:.4f}', fontsize=13, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nPoints closer to the red line indicate better predictions.")

## Required disclosure of use of AI technology

Please indicate whether you used AI to complete this homework. If you did, explain how you used it in the python cell below, as a comment.

In [None]:
"""
# write ai disclosure here:

"""

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit.

Upload the .zip file to Gradescope!

In [None]:
grader.export(pdf=False, force_save=True, run_tests=True)