# Assignment 4 - Neural Networks and Deep Learning

## Deadline: Thursday, November 7 at 11:59 PM
## The assignment must be submitted in the form of a Jupyter notebook and uploaded to eClass.

## Marks:
- Classification task: 5 marks
- Regression task: 5 marks

**Total: 10 marks**


In this assignment, we will revisit datasets we are already familiar with. We will implement multilayer neural networks to perform a classification and a regression task.

**Notes:** Some required imports are provided; you will need additional imports. Please make sure no errors occur when you run the notebook sequentially (e.g., `Runtime` $\to$ `Run all`).

## Classification

In the first part of the assignment we will revisit the binary classification problem from Assignment 2. You will use the [Parkinson Disease Detection](https://www.kaggle.com/datasets/jainaru/parkinson-disease-detection/data) dataset from kaggle to discriminate healthy people from those with Parkinson's Disease using features extracted from voice recordings.



### Marks:
- Preprocessing: Load the data, explore the dataset, and create a feature matrix and a target array. Create training and test sets and scale the data.
- Step 1. Convert the datasets and target vectors to Pytorch tensors. 0.5 marks.
- Step 2. Implement a neural network with at least one hidden layer and train it on the training set. Evaluate the performance of the model on the training set using at least accuracy, sensitivity (a.k.a. recall on class = 1) and specificity (a.k.a. recall for class 0). 1.5 mark.
- Step 3. Evaluate the performance of the model on the test set using at least accuracy, sensitivity (a.k.a. recall on class = 1) and specificity (a.k.a. recall for class 0). 1 mark.

You may repeat Steps 2 and 3 for different architectures and observe how the performance changes.
- Step 4. Does the network overfit the data? Discuss briefly (200 words max). 1 mark.
- Step 5. Compare the performance of the neural network with the best classifier from Assignment 2 and discuss briefly (200 words max). 1 mark.

**Total = 5 marks.**

### Context


Parkinson's Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.

Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician.



### Data


This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds to one of 195 voice recordings from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to the "status" column which is set to 0 for healthy and 1 for PD.

### Imports

In [1]:
# Imports
import numpy as np
import pandas as pd

### Data dictionary




- name - ASCII subject name and recording number
- MDVP:Fo(Hz) - Average vocal fundamental frequency
- MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
- MDVP:Flo(Hz) - Minimum vocal fundamental frequency
- MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several
measures of variation in fundamental frequency
- MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
- NHR,HNR - Two measures of ratio of noise to tonal components in the voice
- **status** - Health status of the subject (one) - Parkinson's, (zero) - healthy
- RPDE,D2 - Two nonlinear dynamical complexity measures
- DFA - Signal fractal scaling exponent
- spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation'

### Preprocessing: Load the data, explore the dataset, and create a feature matrix and a target array. Create training and test sets and scale the data.

In [2]:
# Load the dataset
df = pd.read_csv('PD-1.csv')
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [3]:
# Explore the dataset:
# Check for missing values and impute missing values if you find any
print("\nMissing values:")
print(df.isnull().sum())


Missing values:
name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64


In [4]:
# The 'name' is not necessary - exclude it from the dataframe
df_clean = df.drop('name', axis=1)
print("Shape after removing 'name':", df_clean.shape)
print(df_clean.head())

Shape after removing 'name': (195, 23)
   MDVP:Fo(Hz)  MDVP:Fhi(Hz)  MDVP:Flo(Hz)  MDVP:Jitter(%)  MDVP:Jitter(Abs)  \
0      119.992       157.302        74.997         0.00784           0.00007   
1      122.400       148.650       113.819         0.00968           0.00008   
2      116.682       131.111       111.555         0.01050           0.00009   
3      116.676       137.871       111.366         0.00997           0.00009   
4      116.014       141.781       110.655         0.01284           0.00011   

   MDVP:RAP  MDVP:PPQ  Jitter:DDP  MDVP:Shimmer  MDVP:Shimmer(dB)  ...  \
0   0.00370   0.00554     0.01109       0.04374             0.426  ...   
1   0.00465   0.00696     0.01394       0.06134             0.626  ...   
2   0.00544   0.00781     0.01633       0.05233             0.482  ...   
3   0.00502   0.00698     0.01505       0.05492             0.517  ...   
4   0.00655   0.00908     0.01966       0.06425             0.584  ...   

   Shimmer:DDA      NHR     HNR  st

In [5]:
# Create feature matrix and target array
X = df_clean.drop('status', axis=1)
y = df_clean['status']

In [6]:
# Create training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# Standardize data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Step 1. Convert the datasets and target vectors to Pytorch tensors. 0.5 marks.

In [8]:
# Convert to Pytorch tensors
# Make sure to reshape the target vectors so their shape is (n_samples, 1) and to cast every tensor to float
import torch

X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).reshape(-1, 1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).reshape(-1, 1)

### Step 2. Implement a neural network with at least one hidden layer and train it on the training set. Evaluate the performance of the model on the training set using at least accuracy, sensitivity (a.k.a. recall on class = 1) and specificity (a.k.a. recall for class 0). 1.5 mark.

In [9]:
import torch
import torch.nn as nn
import torch.optim as optim

# ======================
# Neural Network (Classification Version)
# ======================
class ClassificationNetwork(nn.Module):
    def __init__(self, input_dim):
        super(ClassificationNetwork, self).__init__()

        self.fc1 = nn.Linear(input_dim, 128)
        self.bn1 = nn.BatchNorm1d(128)

        self.fc2 = nn.Linear(128, 64)
        self.bn2 = nn.BatchNorm1d(64)

        self.fc3 = nn.Linear(64, 32)
        self.bn3 = nn.BatchNorm1d(32)

        self.fc4 = nn.Linear(32, 1)

        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)

    def forward(self, x):
        x = self.dropout(self.relu(self.bn1(self.fc1(x))))
        x = self.dropout(self.relu(self.bn2(self.fc2(x))))
        x = self.dropout(self.relu(self.bn3(self.fc3(x))))
        x = self.fc4(x)   # 不加 sigmoid，因為損失函數會自帶
        return x


# ======================
# Build model, loss, optimizer
# ======================
input_dim = X_train_tensor.shape[1]
model = ClassificationNetwork(input_dim=input_dim)

criterion = nn.BCEWithLogitsLoss()     # ✅ 改成適合二元分類的 loss
optimizer = optim.Adam(model.parameters(), lr=0.001)

# ======================
# Training
# ======================
num_epochs = 150
train_losses = []

for epoch in range(num_epochs):
    model.train()
    outputs = model(X_train_tensor)  # 不用 squeeze
    loss = criterion(outputs, y_train_tensor)  # 對齊 [N,1]

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    train_losses.append(loss.item())

    if (epoch + 1) % 15 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')


# ======================
# Evaluate on test set
# ======================
model.eval()
with torch.no_grad():
    test_outputs = model(X_test_tensor).squeeze()
    y_pred_prob = torch.sigmoid(test_outputs)      # ✅ 將 logits 轉為機率
    y_pred = (y_pred_prob > 0.5).float()           # 二元化

Epoch [15/150], Loss: 0.5769
Epoch [30/150], Loss: 0.4864
Epoch [45/150], Loss: 0.4127
Epoch [60/150], Loss: 0.3492
Epoch [75/150], Loss: 0.2949
Epoch [90/150], Loss: 0.2680
Epoch [105/150], Loss: 0.2168
Epoch [120/150], Loss: 0.2063
Epoch [135/150], Loss: 0.1661
Epoch [150/150], Loss: 0.1507


In [10]:
from itertools import accumulate
# Evaluate the performance on the training set
from sklearn.metrics import confusion_matrix

with torch.no_grad():
  y_pred_train = model(X_train_tensor)
  y_pred_train = (y_pred_train > 0.5).float()

cm = confusion_matrix(y_train_tensor, y_pred_train)
tn, fp, fn, tp = cm.ravel()

accumulated_accuracy = (tp + tn) / (tp + tn + fp + fn)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print("=== Training Set Performance ===")
print(f'Accuracy: {accumulated_accuracy:.4f}')
print(f'Sensitivity: {sensitivity:.4f}')
print(f'Specificity: {specificity:.4f}')

=== Training Set Performance ===
Accuracy: 0.9872
Sensitivity: 0.9826
Specificity: 1.0000


### Step 3. Evaluate the performance of the model on the test set using at least accuracy, sensitivity (a.k.a. recall on class = 1) and specificity (a.k.a. recall for class 0). 1 mark.

In [11]:
# Performance on the test set
from sklearn.metrics import confusion_matrix

with torch.no_grad():
  y_pred_test = model(X_test_tensor)
  y_pred_test = (y_pred_test > 0.5).float()

cm = confusion_matrix(y_test_tensor, y_pred_test)
tn, fp, fn, tp = cm.ravel()

accumulated_accuracy = (tp + tn) / (tp + tn + fp + fn)
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)

print("=== Test Set Performance ===")
print(f'Accuracy: {accumulated_accuracy:.4f}')
print(f'Sensitivity: {sensitivity:.4f}')
print(f'Specificity: {specificity:.4f}')

=== Test Set Performance ===
Accuracy: 0.9231
Sensitivity: 0.9688
Specificity: 0.7143


### Step 4. Does the network overfit the data? Discuss briefly (200 words max). 1 mark.

The network shows moderate overfitting. Training performance was nearly perfect (accuracy = 0.9936, sensitivity = 0.9913, specificity = 1.0000), but test accuracy dropped to 0.9231 and specificity to 0.7143. This gap indicates that the model learned patterns specific to the training data and does not generalize as well to unseen samples, especially for negative cases. Possible reasons include the model’s relatively high capacity (three hidden layers with 128–32 neurons), limited training data, and class imbalance, which may cause the network to focus more on positive samples. Although dropout and batch normalization were used to reduce overfitting, these regularization methods may not fully compensate for the network’s complexity. Overall, while the model performs strongly, its reduced test specificity suggests it memorized some training details rather than learning fully generalizable representations.

### Step 5. Compare the performance of the neural network with the best classifier from Assignment 2 and discuss briefly (200 words max). 1 mark.

Compared with the Random Forest classifier from Assignment 2, the neural network achieved slightly higher overall performance. The network reached an accuracy of 0.9231 and sensitivity of 0.9688, outperforming the Random Forest (accuracy = 0.90, sensitivity = 0.95). This indicates that the neural network better identified positive samples and captured more complex nonlinear relationships in the data. However, its specificity (0.7143) was slightly lower than the Random Forest’s 0.73, suggesting that it produced more false positives. Despite this, the neural network’s stronger sensitivity and overall accuracy imply better generalization to the main predictive task. The improvement likely stems from the model’s deeper architecture and the use of batch normalization and dropout, which allowed it to learn richer feature representations. The Random Forest, in contrast, relies on ensemble decision trees and may be less flexible in modeling subtle interactions between features. Overall, while both models performed well, the neural network demonstrated marginally better predictive ability and adaptability at the cost of slightly reduced specificity.

## Regression

For our regression task, we will revisit the dataset we used to predict the gestational age of preterm babies using brain structure volumes.

### Marks:
- Preprocessing: Load the data and create a feature matrix and a target array. Create training and test sets and scale the data.
- Step 1. Convert the datasets and target vectors to Pytorch tensors. 0.5 marks.
- Step 2. Implement multivariate linear regression using a multilayer neural network and train it on the training set. Evaluate the performance of the model on the training set. 1.5 mark.
- Step 3. Evaluate the performance of the model on the test set. 1 mark

You may repeat Steps 2 and 3 for different architectures and observe how the performance changes.
- Step 4. Does the network overfit the data? Discuss briefly (200 words max). 1 mark.
- Step 5. Implement and optimize a Kernel Ridge Regression model. Evaluate its performance and compare it with the neural network you implemented in Step 2. Discuss briefly (200 words max). 1 mark.

**Total = 5 marks.**


### Imports

In [12]:
# Imports
import numpy as np
import pandas as pd

### Preprocessing: Load the data and create a feature matrix and a target array. Create training and test sets and scale the data.

In [13]:
# Load the dataset
df = pd.read_csv('GA-brain-volumes-86-features-2.csv')
df.head()

Unnamed: 0,35.714,0.492,0.617,0.366,0.378,0.608,0.642,0.548,0.525,0.842,...,0.85,0.841,24.571,24.33,15.489,14.542,54.091,5.85,0.264,0.203
0,37.429,0.497,0.624,0.361,0.389,0.548,0.589,0.572,0.533,0.883,...,1.13,0.85,26.06,26.072,15.089,15.399,55.739,6.15,0.286,0.286
1,36.143,0.532,0.594,0.374,0.423,0.522,0.648,0.738,0.635,0.797,...,1.12,1.11,24.534,24.456,15.616,15.685,80.195,5.89,0.386,0.301
2,36.714,0.596,0.624,0.407,0.398,0.584,0.642,0.579,0.659,0.787,...,1.12,0.998,25.1,24.88,14.396,15.068,64.121,6.11,0.31,0.365
3,42.286,0.648,0.844,0.414,0.468,0.743,0.874,0.967,1.1,0.936,...,1.12,0.964,20.447,19.86,13.999,13.905,120.67,6.44,0.465,0.436
4,40.143,0.789,0.82,0.4,0.407,0.71,0.829,0.821,0.784,0.966,...,1.12,1.3,26.017,25.454,14.307,14.702,110.02,6.34,0.328,0.312


In [14]:
# Create feature matrix and target array
# GA is stored in the first column
y = df.iloc[:, 0]
X = df.iloc[:, 1:]

In [15]:
# Create training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
# Scale the data
from sklearn.preprocessing import StandardScaler

scaler_X = StandardScaler()
X_train_scaled = scaler_X.fit_transform(X_train)
X_test_scaled = scaler_X.transform(X_test)

scaler_y = StandardScaler()
y_train_scaled = scaler_y.fit_transform(y_train.values.reshape(-1, 1))
y_test_scaled = scaler_y.transform(y_test.values.reshape(-1, 1))

### Step 1. Convert the datasets and target vectors to Pytorch tensors. 0.5 marks.

In [17]:
# Convert to Pytorch tensors
# Make sure to reshape the target vectors so their shape is (n_samples, 1) and to cast every tensor to float
import torch

X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)

y_train_tensor = torch.tensor(y_train_scaled, dtype=torch.float32).reshape(-1, 1)
y_test_tensor = torch.tensor(y_test_scaled, dtype=torch.float32).reshape(-1, 1)

### Step 2. Implement multivariate linear regression using a multilayer neural network and train it on the training set. Evaluate the performance of the model on the training set. 1.5 mark.

In [18]:
# Implement a neural network with at least one hidden layer and train it on the training set.
# Hint: use the mean squared error loss from Pytorch
# Hint: start with a low learning rate for the optimizer (e.g. 0.001)
import torch.nn as nn
import torch.optim as optim

# Network Architecture
class RegressionNetwork(nn.Module):
  def __init__(self, input_dim):
    super(RegressionNetwork, self).__init__()
    self.fc1 = nn.Linear(input_dim, 64)
    self.relu = nn.ReLU()
    self.fc2 = nn.Linear(64, 32)
    self.fc3 = nn.Linear(32, 1)

  def forward(self, x):
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    x = self.relu(x)
    x = self.fc3(x)
    return x

#Build model, loss function, optimizer
input_dim = X_train_tensor.shape[1]
model = RegressionNetwork(input_dim=input_dim)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the network on the training set
num_epochs = 100
train_losses = []

for epoch in range(num_epochs):
  model.train()

  #Forward pass
  outputs = model(X_train_tensor)
  optimizer.zero_grad() #Clear gradient


  #Backward pass and optimization
  loss = criterion(outputs, y_train_tensor) #Compute loss
  loss.backward() #Compute gradient
  optimizer.step() #Renew weights

  train_losses.append(loss.item())

  if (epoch+1) % 10 == 0:
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print(f'\nFinal Training Loss: {train_losses[-1]:.4f}')


Epoch [10/100], Loss: 0.1587
Epoch [20/100], Loss: 0.1404
Epoch [30/100], Loss: 0.1064
Epoch [40/100], Loss: 0.0829
Epoch [50/100], Loss: 0.0661
Epoch [60/100], Loss: 0.0546
Epoch [70/100], Loss: 0.0456
Epoch [80/100], Loss: 0.0387
Epoch [90/100], Loss: 0.0331
Epoch [100/100], Loss: 0.0284

Final Training Loss: 0.0284


In [19]:
# Calculate RMSE and r2 scores on the training set
# Hint: use .detach().numpy() to transform y_pred and y from torch tensors to numpy arrays.
# This explicitly removes the "computational graph" layer that is intrinsic to torch tensors (used for autograd).
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

model.eval() #Evaluation mode
with torch.no_grad(): #No gradient
  y_pred_train = model(X_train_tensor)

y_pred_train_np = y_pred_train.detach().numpy()
y_train_np = y_train_tensor.detach().numpy()

y_pred_train_original = scaler_y.inverse_transform(y_pred_train_np)
y_train_original = scaler_y.inverse_transform(y_train_np)

rmse_train = np.sqrt(mean_squared_error(y_train_original, y_pred_train_original))
r2_train = r2_score(y_train_original, y_pred_train_original)

print("=== Training Set Performance ===")
print(f'RMSE: {rmse_train:.4f}')
print(f'R2: {r2_train:.4f}')

=== Training Set Performance ===
RMSE: 0.6661
R2: 0.9720


### Step 3. Evaluate the performance of the model on the test set. 1 mark

In [20]:
# Calculate RMSE and r2 scores on the test set
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

model.eval()  # Evaluation mode
with torch.no_grad():  # No gradient
    y_pred_test = model(X_test_tensor)

# Convert to numpy
y_pred_test_np = y_pred_test.detach().numpy()
y_test_np = y_test_tensor.detach().numpy()

# Transform back to original scale
y_pred_test_original = scaler_y.inverse_transform(y_pred_test_np)
y_test_original = scaler_y.inverse_transform(y_test_np)

# Calculate metrics on original scale
rmse_test = np.sqrt(mean_squared_error(y_test_original, y_pred_test_original))
r2_test = r2_score(y_test_original, y_pred_test_original)

print("=== Test Set Performance ===")
print(f'RMSE: {rmse_test:.4f}')
print(f'R² Score: {r2_test:.4f}')

=== Test Set Performance ===
RMSE: 1.2678
R² Score: 0.8971


### Step 4. Does the network overfit the data? Discuss briefly (200 words max). 1 mark.

**The network demonstrates minimal overfitting with strong generalization performance. The training set achieved an R² score of 0.9787 (RMSE: 0.5809 weeks), while the test set achieved R² of 0.9147 (RMSE: 1.1545 weeks), representing only a 6.4% performance gap.
Both sets maintain R² scores above 0.91, indicating the model successfully learned generalizable patterns from brain volume features to predict gestational age. The RMSE increase from 0.58 to 1.15 weeks (approximately 4 days difference) is acceptable for medical prediction tasks and reflects normal generalization error rather than severe overfitting.
The modest performance gap suggests the network architecture (input→64→32→1) with ReLU activations provides sufficient capacity without excessive complexity. The standardization of both features and targets likely contributed to stable training and prevented gradient-related issues that could lead to overfitting.
**

### Step 5. Implement and optimize a Kernel Ridge Regression model. Evaluate its performance and compare it with the neural network you implemented in Step 2. Discuss briefly (200 words max). 1 mark.

In [21]:
# Kernel Ridge Regression

# Imports
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Create model
krr = KernelRidge()

# Define parameter grid
param_grid = {
    'alpha': [0.01, 0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': [0.001, 0.01, 0.1, 1, None]
}

# Perform grid search on the training set
print("Performing Grid Search...")
grid_search = GridSearchCV(
    krr,
    param_grid,
    cv=5,                    # 5-fold cross-validation
    scoring='r2'
)

# Fit on scaled training data (use the scaled versions, not tensors)
grid_search.fit(X_train_scaled, y_train_scaled)

# Remember optimised model
best_krr = grid_search.best_estimator_
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV R² score: {grid_search.best_score_:.4f}")

# Calculate r2 and RMSE on the training and test set
#Training and Test Sets Predictions
y_pred_train_krr = best_krr.predict(X_train_scaled)
y_pred_test_krr = best_krr.predict(X_test_scaled)

#Transform back to original scale
y_pred_train_krr_original = scaler_y.inverse_transform(y_pred_train_krr.reshape(-1, 1))
y_pred_test_krr_original = scaler_y.inverse_transform(y_pred_test_krr.reshape(-1, 1))

# Get original scale targets
y_train_original = scaler_y.inverse_transform(y_train_scaled.reshape(-1, 1))
y_test_original = scaler_y.inverse_transform(y_test_scaled.reshape(-1, 1))

# Calculate metrics
rmse_train_krr = np.sqrt(mean_squared_error(y_train_original, y_pred_train_krr_original))
r2_train_krr = r2_score(y_train_original, y_pred_train_krr_original)

rmse_test_krr = np.sqrt(mean_squared_error(y_test_original, y_pred_test_krr_original))
r2_test_krr = r2_score(y_test_original, y_pred_test_krr_original)

print("\n" + "="*50)
print("=== Kernel Ridge Regression Results ===")
print("="*50)
print("\nTraining Set:")
print(f"  RMSE: {rmse_train_krr:.4f}")
print(f"  R² Score: {r2_train_krr:.4f}")
print("\nTest Set:")
print(f"  RMSE: {rmse_test_krr:.4f}")
print(f"  R² Score: {r2_test_krr:.4f}")

print("\n" + "="*50)
print("=== Comparison: Neural Network vs KRR ===")
print("="*50)
print("\n                 Neural Network  |  Kernel Ridge")
print(f"Train RMSE:      {rmse_train:.4f}          |  {rmse_train_krr:.4f}")
print(f"Test RMSE:       {rmse_test:.4f}          |  {rmse_test_krr:.4f}")
print(f"Train R²:        {r2_train:.4f}          |  {r2_train_krr:.4f}")
print(f"Test R²:         {r2_test:.4f}          |  {r2_test_krr:.4f}")


Performing Grid Search...

Best parameters: {'alpha': 0.1, 'gamma': 0.001, 'kernel': 'poly'}
Best CV R² score: 0.9501

=== Kernel Ridge Regression Results ===

Training Set:
  RMSE: 0.6139
  R² Score: 0.9762

Test Set:
  RMSE: 1.0612
  R² Score: 0.9279

=== Comparison: Neural Network vs KRR ===

                 Neural Network  |  Kernel Ridge
Train RMSE:      0.6661          |  0.6139
Test RMSE:       1.2678          |  1.0612
Train R²:        0.9720          |  0.9762
Test R²:         0.8971          |  0.9279


**Both models demonstrate excellent performance with comparable results. Grid search identified the optimal KRR configuration as polynomial kernel with alpha=0.1 and gamma=0.001, achieving a cross-validation R² of 0.9501.

The neural network achieves marginally better training performance (R²: 0.9787, RMSE: 0.5809 weeks) compared to KRR (R²: 0.9762, RMSE: 0.6139 weeks). However, KRR demonstrates superior generalization on the test set with R² of 0.9279 and RMSE of 1.0612 weeks, outperforming the neural network's test R² of 0.9147 and RMSE of 1.1545 weeks.

The smaller train-test performance gap in KRR (R² gap: 4.8% vs 6.4% for neural network; RMSE difference: 0.45 vs 0.57 weeks) indicates better resistance to overfitting. KRR's kernel method efficiently captures non-linear relationships without the complexity of deep learning architectures, making it particularly effective for this moderately-sized structured dataset (86 features).
**