# Assignment 01

This assignment consists of two tasks with subtasks. Every subtask has a point value and lists expectations for answers. Please read both task and expectations carefully before answering.

### Hand-in Instructions

Submit a single `.ipynb` file with all outputs saved. The notebook must be fully self-contained and ready to read without running any cells.

### Overview

| Task  | Topic                                     | Points  |
| ----- | ----------------------------------------- | ------- |
| **1** | **PCA**                                   |         |
| 1.1   | 3D scatter plot                           | 5       |
| 1.2   | 2D scatter plot                           | 5       |
| 1.3   | Interpreting variance                     | 10      |
| 1.4   | Variance and geometry                     | 10      |
| **2** | **Breast Cancer Classification Pipeline** |         |
| 2.1   | Exploratory Data Analysis                 | 30      |
| 2.2   | Train/Test Split                          | 5       |
| 2.3   | Baseline Model                            | 5       |
| 2.4   | Kitchen Sink Model                        | 10      |
| 2.5   | Build Your Own Pipeline                   | 20      |
|       | **Total**                                 | **100** |


---

## Setup


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.datasets import load_breast_cancer


from assignment_utils import generate_annulus_4d

---

## Task 1: PCA

You've been given a mysterious dataset with **4 dimensions** (F1, F2, F3, F4). We can't directly visualize 4D data, but we can look at 3 dimensions at a time and use color for the 4th.


In [2]:
data_4d, radius = generate_annulus_4d()
df = pd.DataFrame(data_4d, columns=["F1", "F2", "F3", "F4"])
print(f"Dataset shape: {df.shape}")

# Let's visualize the first 3 dimensions (F1, F2, F3) in 3D
# The 4th dimension (F4) is represented as color
fig = px.scatter_3d(
    df,
    x="F1",
    y="F2",
    z="F3",
    color=df["F4"],
    color_continuous_scale="viridis",
    title="3D View of the 4D Dataset (F1, F2, F3, color=F4)",
    labels={"color": "F4"},
)
fig.update_traces(marker=dict(size=3))
fig.update_layout(width=800, height=600)
fig.show()

Dataset shape: (1500, 4)


### Principal Component Analysis (PCA)

**PCA** is a technique that finds new axes (called _principal components_) that capture the most variance in the data.

**Mathematical formulation:**

1. **Standardize** the data: $\mathbf{Z} = \frac{\mathbf{X} - \boldsymbol{\mu}}{\boldsymbol{\sigma}}$

2. Compute the **covariance matrix**: $\mathbf{C} = \frac{1}{n-1} \mathbf{Z}^T \mathbf{Z}$

3. Find the **eigenvectors** and **eigenvalues** of $\mathbf{C}$:
   $$\mathbf{C} \mathbf{v}_i = \lambda_i \mathbf{v}_i$$
4. **Project** the data onto the principal components: $\mathbf{Z}_{PC} = \mathbf{Z} \mathbf{V}$

where $\mathbf{V} = [\mathbf{v}_1, \mathbf{v}_2, \ldots]$ are the eigenvectors sorted by decreasing eigenvalue $\lambda_i$.

**Key ideas:**

- **PC1** points in the direction of maximum variance (largest $\lambda$)
- **PC2** is perpendicular to PC1 and captures the next most variance
- The **explained variance ratio** for each PC is: $\frac{\lambda_i}{\sum_j \lambda_j}$
- The eigenvectors $\mathbf{V} = [\mathbf{v}_1, \mathbf{v}_2, \ldots]$ are basis vectors for the principal component space. Any data vector can be completely reconstructed by a linear combination of these principal component basis vectors.

If the data lies on a lower-dimensional subspace, PCA can reveal it by finding the directions that matter most.

Let's standardize the data first (so all features have equal scale), then apply PCA.


In [None]:
# Standardize and apply PCA
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df)

pca = PCA()
data_pca = pca.fit_transform(data_scaled)

# Create a DataFrame with principal components
df_pca = pd.DataFrame(data_pca, columns=["PC1", "PC2", "PC3", "PC4"])
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Cumulative explained variance ratio:", np.cumsum(pca.explained_variance_ratio_))

# Get the PC directions (loadings) - each row is a PC, each column is a feature
components = pca.components_  # Shape: (4, 4) - 4 PCs x 4 features

# Create the scatter plot of scaled data (first 3 features)
fig = go.Figure()

# Add the data points
fig.add_trace(
    go.Scatter3d(
        x=data_scaled[:, 0],
        y=data_scaled[:, 1],
        z=data_scaled[:, 2],
        mode="markers",
        marker=dict(size=2, opacity=0.5),
        name="Data",
    )
)

# Add arrows for PC1, PC2, PC3 directions
# Arrow length proportional to explained variance ratio (with minimum for visibility)
colors = ["red", "green", "blue"]
base_scale = 5  # Base scale factor
min_scale = 0.5  # Minimum scale so small PCs are still visible

for i in range(3):
    pc_direction = components[i, :3]  # First 3 components of each PC
    # Scale by explained variance ratio - longer arrow = more variance
    # Use minimum scale so all arrows are visible
    variance_ratio = pca.explained_variance_ratio_[i] / max(
        pca.explained_variance_ratio_
    )
    scale = max(base_scale * variance_ratio, min_scale)

    # Arrow line
    fig.add_trace(
        go.Scatter3d(
            x=[0, pc_direction[0] * scale],
            y=[0, pc_direction[1] * scale],
            z=[0, pc_direction[2] * scale],
            mode="lines",
            line=dict(color=colors[i], width=8),
            name=f"PC{i + 1} ({pca.explained_variance_ratio_[i]:.1%} var)",
        )
    )

    # Arrow head (cone)
    fig.add_trace(
        go.Cone(
            x=[pc_direction[0] * scale],
            y=[pc_direction[1] * scale],
            z=[pc_direction[2] * scale],
            u=[pc_direction[0]],
            v=[pc_direction[1]],
            w=[pc_direction[2]],
            colorscale=[[0, colors[i]], [1, colors[i]]],
            showscale=False,
            sizemode="absolute",
            sizeref=0.3,
        )
    )

fig.update_layout(
    title="Principal Component Directions (arrow length ∝ variance explained)",
    scene=dict(
        xaxis_title="F1 (scaled)",
        yaxis_title="F2 (scaled)",
        zaxis_title="F3 (scaled)",
    ),
    width=800,
    height=600,
)
fig.show()

### Task 1.1 3D scatter plot

- Task: Create a 3D scatter plot using PC1, PC2, PC3 as axes
- Points: 5
- Expectations: A working 3D scatter plot of the PCA-transformed data (similar in style to the first 3D plot). No further analysis or comments.


In [None]:
# Your code here

### Task 1.2 2D scatter plot

- Task: Create a 2D scatter plot using PC1, PC2 as axes
- Points: 5
- Expectations: A working 2D scatter plot of the PCA-transformed data. No further analysis or comments.


In [None]:
# Your code here

### Task 1.3 Interpreting variance

- Task: How much variance do PC1 and PC2 capture together? Based on this, what can you conclude about the original 4D dataset?
- Points: 10
- Expectations: A written response (1 paragraph).


#### Answer


### Task 1.4 Variance and geometry

- Task: If PC1 explained 90% of the variance and PC2 only 10%, what shape would you expect the data to form? Now compare this to your actual ~50/50 split — what does this tell you about the geometry of your data?
- Points: 10
- Expectations: A written response (1 paragraph).


#### Answer


---

## Task 2: Breast Cancer Classification Pipeline

Now let's apply what you've learned to a real-world dataset: the **Wisconsin Breast Cancer** dataset. This dataset contains measurements from cell nuclei in breast tissue samples, and the goal is to classify tumors as **malignant** or **benign**.

**Sources:**

- [sklearn.datasets.load_breast_cancer](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html)
- [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic)

We'll work through a complete machine learning workflow:

1. Exploratory Data Analysis (EDA)
2. Train/test split
3. Baseline model
4. "Kitchen sink" model (all features, no preprocessing)
5. Build your own pipeline


In [None]:
# Load the breast cancer dataset
cancer = load_breast_cancer()
df_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df_cancer["target"] = cancer.target

print(f"Dataset shape: {df_cancer.shape}")
print(f"Target classes: {cancer.target_names}")

### Task 2.1 Exploratory Data Analysis (EDA)

- Task: Conduct an EDA of the breast cancer dataset. For each analysis you perform, explain _why_ you chose to look at it and what it tells you.
- Points: 30
- Expectations: A mix of code, plots, and written commentary. Quality of reasoning and plots matters more than quantity of plots.


In [None]:
# Your EDA here. Add as many code and markdown cells as needed.

### Task 2.2 Train/Test Split

- Task: Split the data into training and test sets (80/20) before any modeling.
- Points: 5
- Expectations: Complete the TODO line to create an 80/20 split with `random_state=42` for reproducibility.


In [None]:
# Separate features and target
X = df_cancer.drop("target", axis=1)
y = df_cancer["target"]

# TODO: Split into train/test sets
# X_train, X_test, y_train, y_test =

### Task 2.3 Baseline Model

Before building a real model, it's wise to establish a **baseline** — a classifier that any real model should beat.

A **confusion matrix** shows how predictions compare to actual labels:

|                     | Predicted Negative  | Predicted Positive  |
| ------------------- | ------------------- | ------------------- |
| **Actual Negative** | TN (True Negative)  | FP (False Positive) |
| **Actual Positive** | FN (False Negative) | TP (True Positive)  |

For cancer diagnosis: FN means missing a malignant tumor (bad!), FP means a false alarm (less bad, but still costly).

- Task: Run the code below, note the accuracy and examine the confusion matrix. Describe what this classifier does. Would you trust it for diagnosis? Why or why not?
- Points: 5
- Expectations: A written response (1 paragraph).


In [None]:
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)

baseline_accuracy = dummy.score(X_test, y_test)
print(f"Baseline accuracy: {baseline_accuracy:.3f}")

ConfusionMatrixDisplay.from_estimator(
    dummy, X_test, y_test, display_labels=cancer.target_names, cmap="Blues"
)
plt.title("Confusion Matrix: Baseline Model")
plt.show()

#### Answer


### Task 2.4 Kitchen Sink Model

The "kitchen sink" approach: throw all features into the model without any preprocessing. Let's see what happens.

**Logistic Regression** is a linear classifier that predicts the probability of a binary outcome. It models:

$$P(y=1 | \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$

where $\sigma$ is the sigmoid function, $\mathbf{w}$ are the feature weights, and $b$ is the bias. The model is trained by minimizing the logistic loss using an iterative optimizer. Here we use **SAGA** (`solver="saga"`), a stochastic gradient method whose fast convergence is only guaranteed on features with approximately the same scale ([sklearn docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)).

- Task: Run the code below. Did the model converge? Why or why not? Explain based on your EDA findings and how gradient-based optimization works.
- Points: 10
- Expectations: A written response (1 paragraph).


In [None]:
lr_kitchen = LogisticRegression(solver="saga", max_iter=100, random_state=42)
lr_kitchen.fit(X_train, y_train)

kitchen_train_accuracy = lr_kitchen.score(X_train, y_train)
kitchen_accuracy = lr_kitchen.score(X_test, y_test)
print(f"Kitchen sink train accuracy: {kitchen_train_accuracy:.3f}")
print(f"Kitchen sink test accuracy:  {kitchen_accuracy:.3f}")

ConfusionMatrixDisplay.from_estimator(
    lr_kitchen, X_test, y_test, display_labels=cancer.target_names, cmap="Blues"
)
plt.title("Confusion Matrix: Kitchen Sink Model")
plt.show()

#### Answer


### Task 2.5 Build Your Own Pipeline

Now it's your turn. Based on your EDA findings, build a classification pipeline.

A **Pipeline** chains multiple preprocessing steps and a final estimator into a single object. This ensures:

- No data leakage (preprocessing is fit only on training data)
- Clean, reproducible code
- Easy experimentation with different configurations

Example pipeline structure:

```python
Pipeline([
    ("step1_name", SomeTransformer()),
    ("step2_name", AnotherTransformer()),
    ("classifier", SomeClassifier()),
])
```

- Task: Build a pipeline that preprocesses the data and fits a classifier. Evaluate your model, compare it to the kitchen sink model, and justify your preprocessing choices based on your EDA insights.
- Points: 20
- Notes:
  - You are free to use any preprocessing technique (e.g., StandardScaler, PCA, column selection via ColumnTransformer, or others)
  - There is no single "correct" answer — the goal is thoughtful justification
- Expectations:
  - A working pipeline with at least one preprocessing step
  - A confusion matrix plot for your model
  - A comparison with the kitchen sink model's confusion matrix
  - A reflection on your model's errors — consider which types of mistakes matter most in a medical diagnosis context (1 paragraph)
  - A brief explanation of why you chose your preprocessing steps (1 paragraph)
  - **NOTE:** Your understanding and evaluation of the model performance is the objective here. The model's performance (how well it accurately classifies the data) will not detract from your grade. So if your model doesn't perform well, but you can explain why it doesn't perform well, then you can still receive the full 20 points.


In [None]:
# TODO: Build your pipeline
# Consider: What preprocessing steps would help based on your EDA?
# Available transformers: StandardScaler, PCA, ColumnTransformer, etc.

# Example pipeline
pipe = Pipeline(
    [
        # Example: ColumnTransformer to select/transform specific columns
        # ("preprocessor", ColumnTransformer([
        #     ("selected_features", StandardScaler(), ["mean radius", "mean texture", ...]),
        # ])),
        # ("scaler", StandardScaler()),
        # ("pca", PCA(n_components=10)),
        (
            "classifier",
            LogisticRegression(solver="saga", max_iter=1000, random_state=42),
        ),
    ]
)

# TODO: Fit the pipeline on training data


# TODO: Evaluate and print accuracy


# TODO: Plot confusion matrix

# Add as many code and markdown cells as needed.