# MIS 382N: Advanced Machine Learning Assignment 4

**Total points**: 55 pts

**Due**: 11:59 PM CST, Tuesday, November 4th, 2025.

**Submission**:
1. Submit your **Jupyter Notebook via Canvas**, AND
2. **Save your Jupyter Notebook to a PDF, and submit the PDF via Gradescope**.

You may work in groups of two if you wish. Only one student per team needs to submit the assignment on Canvas and Gradescope. But be sure to include the name and UT EID for both students.

Homework groups will be created and managed through Canvas, so please do not arbitrarily change your homework group. If you do change, let the TAs know.

For questions involving mathematical derivations, you can write your answer on paper and then upload an image. Also, please make sure your code runs and the graphics (and anything else) are displayed in your notebook before submitting. (%matplotlib inline)

**Name(s) and EID(s)**:

-------------------------

Bookmarks:

Q1. <a href=#Q1>Dimensionality Reduction - tSNE</a>

Q2. <a href=#Q2>Classification With a Loss Matrix</a>

Q3. <a href=#Q3>Visualizing the Confusion Matrices, ROC Curves, and PR Curves</a>

Q4. <a href=#Q4>Concepts about AU-ROC and AU-PRC</a>

-------------------------

**Q1. Dimensionality Reduction - tSNE** (15 pts) <a name='Q1' />

In this problem, you will apply Principal Component Analysis (PCA) to the Superconductivity Dataset. More details about the dataset are available [here](https://archive.ics.uci.edu/ml/datasets/Superconductivty+Data#). The goal of this exercise is to characterize the structure of the input feature space.

Let's start with the following steps to prepare the dataset:

1. Load the dataset from the file `Q1_data.csv` into a pandas DataFrame named `df`.
2. The dataset contains a column called `critical_temp`, representing the critical temperature of each superconductor. Please select the column `critical_temp` as `y`.
**We will not use this column when fitting t-SNE. However, you may retain it later for visualization (e.g., coloring points by `y`).**
3. Define the feature matrix: select all remaining columns (except `critical_temp`) from `df` as `X`.
4. Standardize the features using [Standard Scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler): Fit the scaler on `X`, and then transform `X` to obtain the standardized data. Apply scaling only to `X`, not to `y`.

Note: After this step, `X` should contain 81 features.


In [None]:
# Only use this code block if you are using Google Colab.
# If you are using Jupyter Notebook, please ignore this code block. You can directly upload the file to your Jupyter Notebook file systems.
from google.colab import files

## It will prompt you to select a local file. Click on “Choose Files” then select and upload the file.
## Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.
uploaded = files.upload()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
import os, sys, re
import time
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler,StandardScaler
import matplotlib.pyplot as plt

df = pd.read_csv("Q1_data.csv")

In [None]:
y = df["critical_temp"]
X = df.drop(columns=["critical_temp"])

scalar = StandardScaler()
scalar.fit(X)
X = scalar.transform(X)

**Part 1.** (5 points) Now apply T-SNE to the dataset. You are required to carry out the following tasks:

1.   Initialize a t-SNE model with number of dimensions = 3, perplexity = 300, number of iterations = 300 and random state = 42.
2.   Apply the t-SNE model to the data.

**Answer**:

In [None]:
from sklearn.manifold import TSNE

### START CODE ###
## Initialize t-SNE with: n_components=3, perplexity=300, n_iter=300, random_state=42

### END CODE ###

### START CODE ###
## Apply t-SNE to the training dataset

### END CODE ###

**Part 2**. (5 points) Sample 1000 random data points from the dataset, and show the 2D scatter plots of their first three t-SNE components.

**Answer**:

In [None]:
### START CODE ###
# Sample 1000 random data points from the dataset
# Plot first two components (2D scatter)
# Hint: Use X_tsne[:1000, 0] and X_tsne[:1000, 1] as x/y axes.
# Color by y

### END CODE ###

In [None]:
### START CODE ###
# Plot the first and the third components (2D scatter)

### END CODE ###

In [None]:
### START CODE ###
# Plot the second and the third components (2D scatter)

### END CODE ###

**Part 3**. (5 points) Now we will plot the PCA and t-SNE projections of the data and compare the plots side-by-side to see the difference in scatters created by the two methods. You can use first 1000 data points for this.

**Answer**:

In [None]:
from sklearn.decomposition import PCA

### START CODE ###
## Obtain components from PCA

### END CODE ###

### START CODE ###
## Obtain components from t-SNE

### END CODE ###

# Plot side-by-side
plt.figure(figsize=(12, 5))  # Adjust the figure size as needed

plt.subplot(1, 2, 1)  # 1 row, 2 columns, select the first subplot
plt.title('PCA')
### START CODE ###
## Left plot: scatter plot of PCA results for the first 1000 data points

### END CODE ###

plt.subplot(1, 2, 2)  # 1 row, 2 columns, select the second subplot
plt.title('T-SNE')
### START CODE ###
## Right plot: scatter plot of t-SNE results for the first 1000 data points

### END CODE ###

plt.tight_layout()
plt.show()

**Q2. Classification With a Loss Matrix** (10 pts) <a name='Q2' />

Consider a binary classification problem with the following loss matrix:
$$
   {\begin{array}{ccccc}
   & & \text{Predicted class} & \text{           } &\\
   & & C1 & C2 & Reject\\
   \text{True class} & C1 & 0 & r & c  \\
   & C2 & s & 0 & c \\
  \end{array} }
$$

where the cost of rejection is a constant, and the costs $r$ and $s$ are positive real numbers. Let $f(x)=P(C1|x)$.


**Part 1.** (2 points) Show that the expected loss when $x$ is predicted as $C_1$ is a decreasing function of $f(x)$ while expected loss when $x$ is predicted as $C_2$ is a increasing function of $f(x)$.

Hint: Express the expected losses with $f(x)$.

**Answer:**


**Part 2.** (2 points) If $c=0$, show that the decision that minimizes the expected loss is to reject all instances of $x$.

**Answer**:




**Part 3.** (3 points) Let $r=4$ and $s=3$, what is the minimum value of $c$ such that no instance of $x$ should be rejected under the optimal decision?

**Answer:**

**Part 4** (3 points) Let $r=6$, $s=3$, and $c=1$. Determine the  ranges of $f(x)$ for which the optimal decision is C1, reject and C2 respectively.

**Answer:**


**Q3. Visualizing the Confusion Matrices, ROC Curves, and PR Curves** (20 pts) <a name='Q3' />

In this question, we will train a logistic regression classifier and a multi-layer perceptron classifier on a binary classification problem. Then, we will visualize the confusion matrices, ROC Curves, and PR Curves as well as evaluating the AU-ROC and AP of these models.

In [None]:
# Only use this code block if you are using Google Colab.
# If you are using Jupyter Notebook, please ignore this code block. You can directly upload the file to your Jupyter Notebook file systems.
from google.colab import files

## It will prompt you to select a local file. Click on “Choose Files” then select and upload the file.
## Wait for the file to be 100% uploaded. You should see the name of the file once Colab has uploaded it.
uploaded = files.upload()

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_auc_score, average_precision_score, ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

pd.set_option('future.no_silent_downcasting', True)

**Dataloading and Preprocessing**

In [None]:
df = pd.read_csv('customer_churn_telcom.csv', index_col = [0])

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Print out unique values of categorical columns
def print_unique_col_values(df):
    for column in df:
        if df[column].dtypes=='object':
            print(f'{column}: {df[column].unique()}')

print_unique_col_values(df)

In [None]:
# Replace values of 'no internet service' and 'no phone service' with the value  'No'
df.replace('No internet service', 'No', inplace=True)
df.replace('No phone service', 'No', inplace=True)

In [None]:
# Change categorical columns with 2 categories to 0/1
yes_no_columns = ['Partner','Dependents','PhoneService','MultipleLines','OnlineSecurity','OnlineBackup',
                  'DeviceProtection','TechSupport','StreamingTV','StreamingMovies','PaperlessBilling','Churn']

for col in yes_no_columns:
    df[col] = df[col].replace({'Yes': 1, 'No': 0})

df['gender'] = df['gender'].replace({'Female': 1, 'Male': 0})

for col in yes_no_columns + ['gender']:
    print(f'{col}: {df[col].unique()}')

In [None]:
# One hot encoding for categorical columns with more than two categories
df = pd.get_dummies(data = df, columns = ['InternetService','Contract','PaymentMethod'])
print(df.columns)

In [None]:
# Train test split
X = df.drop('Churn', axis='columns')
y = df.Churn.astype(np.float32)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

In [None]:
# Standardize numerical columns
cols_to_scale = ['tenure','MonthlyCharges','TotalCharges']

scaler = StandardScaler()
X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
X_val[cols_to_scale] = scaler.transform(X_val[cols_to_scale])
X_test[cols_to_scale] = scaler.transform(X_test[cols_to_scale])

In [None]:
# With samples corresponding to the positive class being very low, we can clearly see the imbalance in our data
print('Churn occurences in the training set \n')
print(y_train.value_counts())
print('\n')
print('Churn occurences throughout the data \n')
print(y.value_counts())

**Part 1.** (5 points) In this part, train a **logistic regression model** for the churn prediction problem. Then, evaluate its AU-ROC and AP on **train, validation, and test sets**. Finally, visualize its confusion matrices on the **validation and test set** side by side. Use ```random_state=42``` for training your logistic regression model.

Helpful resources:
1. [sklearn.linear_model.LogisticRegression](https://scikit-learn.org/0.16/modules/generated/sklearn.linear_model.LogisticRegression.html)
2. [sklearn.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
3. [sklearn.metrics.ConfusionMatrixDisplay](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html)

**Answer:**


In [None]:
# Initialize and fit logistic regression model
### START CODE ###

### END CODE ###


# Evaluate AU-ROC and AP on train/validation/test sets
### START CODE ###
train_y_pred_score_lr = ...
val_y_pred_score_lr = ...
test_y_pred_score_lr = ...

train_auroc_lr, val_auroc_lr, test_auroc_lr = ..., ..., ...
train_ap_lr, val_ap_lr, test_ap_lr = ..., ..., ...
### END CODE ###
print(f"Logistic Regression Train AU-ROC: {train_auroc_lr:.3f} Validation AU-ROC: {val_auroc_lr:.3f} Test AU-ROC: {test_auroc_lr:.3f}")
print(f"Logistic Regression Train AP: {train_ap_lr:.3f} Validation AP: {val_ap_lr:.3f} Test AP: {test_ap_lr:.3f}")

# Visualize confusion matrices on validation and test sets
### START CODE ###

### END CODE ###

**Part 2a.** (3 points) In this part, we will train a **multi-layer perceptron** (MLP) for the churn prediction problem. For simplicity and consistency, use sklearn's ```MLPClassifier``` with ```random_state=42```. Before training the final MLP, first do hyper-parameter selection by validation average precision score.

Hint: Try out different ```hidden_layer_sizes``` and ```learning_rate_init``` configurations for ```MLPClassifier```.

Helpful resources:
1. [sklearn.neural_network.MLPClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)
2. [sklearn.metrics.average_precision_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html)
3. [sklearn.model_selection.ParameterGrid](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ParameterGrid.html)

**Answer:**

In [None]:
# Define hyper-parameter search space, train models with different hyper-parameters, and record the best hyper-parameters and validation AP
### START CODE ###

best_val_ap_mlp = ...
best_hparams_mlp = ...


### END CODE ###
print(f"Best Hyper-parameters: {best_hparams_mlp} Best Val Average Precision: {best_val_ap_mlp:.3f}")


**Part 2b.** (2 points) Use your selected best hyper-parameters to train a final MLP. Then, evaluate its AU-ROC and AP on **train, validation, and test sets**. Finally, visualize its confusion matrices on the **validation and test set** side by side.

**Answer:**

In [None]:
# Initialize and fit MLP
### START CODE ###

### END CODE ###

# Evaluate AU-ROC and AP on train/validation/test sets
### START CODE ###
train_y_pred_score_mlp = ...
val_y_pred_score_mlp = ...
test_y_pred_score_mlp = ...

train_auroc_mlp, val_auroc_mlp, test_auroc_mlp = ..., ..., ...
train_ap_mlp, val_ap_mlp, test_ap_mlp = ..., ..., ...
### END CODE ###

print(f"MLP Train AU-ROC: {train_auroc_mlp:.3f} Validation AU-ROC: {val_auroc_mlp:.3f} Test AU-ROC: {test_auroc_mlp:.3f}")
print(f"MLP Train AP: {train_ap_mlp:.3f} Validation AP: {val_ap_mlp:.3f} Test AP: {test_ap_mlp:.3f}")

# Visualize confusion matrices on validation and test sets
### START CODE ###

### END CODE ###

**Part 3.** (3 points) Plot the ROC curves of the trained logistic regression and MLP models on the validation and test sets side by side. The left plot is for validation set and the right plot is for test set. Each plot consists of two ROC curves, one of the logistic regression model and one of the MLP. Please also display the AU-ROC (or AUC) to the **3rd decimal place** in the plot as well.

Helpful resources:
1. [sklearn.metrics.RocCurveDisplay](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html)

**Answer:**


In [None]:
# Plot ROC curves
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(12,5))

### START CODE ###

### END CODE ###

plt.show()

**Part 4.** (3 points) Plot the PR curves of the trained logistic regression and MLP models on the validation and test sets side by side. The left plot is for validation set and the right plot is for test set. Each plot consists of two PR curves, one of the logistic regression model and one of the MLP. Please also display the average precision (AP) to the **3rd decimal place** in the plot as well.

Helpful resources:
1. [sklearn.metrics.PrecisionRecallDisplay](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.PrecisionRecallDisplay.html)

**Answer:**

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

### START CODE ###

### END CODE ###
plt.tight_layout()
plt.show()

**Part 5.** (4 points) Briefly describe your takeaways from the plotted ROC and PR curves and how do the two models you trained compare.

**Answer:**

**Q4. Concepts about AU-ROC and AU-PRC** (10 pts) <a name='Q4' />

**Part 1.** (4 points) Consider a binary classification problem and a no-skill random classifier $f(x)$ that outputs a number randomly sampled from $U[0,1]$ to be the predicted probability of $x$ being positive. In other words, the returned probability from $f(x)$ actually does not depend on $x$ (hence no-skill). Why is the Area Under the Preciison-Recall Curve (AU-PRC) of $f(x)$ the prior probability of the positive class, i.e. $P(y = 1)$?

HINT: Think about the definition of precision and recall, and how they relate to the decision threshold.

**Answer:**

**Part 2.** (3 points) Consider the previous binary classification problem and classifier. Why is the Area Under the Receiver Operating Characteristic Curve (AU-ROC) of $f(x)$ $0.5$?

HINT: Think about the definition of true positive rate and false positive rate, and how they relate to the decision threshold.

**Answer:**

**Part 3.** (3 points) Why is AP a better metric than AU-ROC under an imbalanced binary classification problem where the positive instances are extremely rare compared to negative instances?

**Answer:**