# Task 2
## 1. Env set up
1. ... (already done before)
2. pip install scikit-learn(including sklearn, scipy) torch
---
## 2. Learning/Working route
1. figure out task-involved knowledge range, in this case MLP, sklearn, PyTorch, pandas, numpy, data preprocessing, feature selection, and model evaluation
2. learn new concepts like MLP, feature selection, PCA
3. learn from code that how to train models with different configurations using these packages
4. try to analyze the results, explain them and ask AI for further analysis(don't trust it all)
5. try to improve the model performance by feature selection, scaling, etc.
6. noticed that all expeirimenting models didn't converge within 50 epochs, try more epochs like 200, 300
---
## 3. Work Sequentially
#### a. Before start
- numpy can read only numerical data, still use pandas to read
- preprocess before further action:
    1. filling missing data(with mean)
    2. encode categorical features and 'international' col(using LabelEncoder)
    3. prepare target var

In [1]:
import numpy as np
import scipy as sp
import sklearn as sk
import pandas as pd

# hyperparameters
LR = 0.0001
EPOCH = 50
BATCH_SIZE = 64

# read data
train_data = pd.read_csv('../MBAAdmission/train.csv')
test_data = pd.read_csv('../MBAAdmission/test.csv')

# preprocess data
def preprocess(data):
    # handle missing values
    num_cols = data.select_dtypes(include=[np.number]).columns
    data[num_cols] = data[num_cols].fillna(data[num_cols].mean())
    for col in data.select_dtypes(include=[object]).columns:
        if col != 'admission':
            data[col]=data[col].fillna(data[col].mode()[0])
    return

preprocess(train_data)
preprocess(test_data)

categorical_cols = train_data.select_dtypes(include=[object]).columns
numeric_cols = train_data.select_dtypes(include=[np.number]).columns

# prepare features and labels

# encode categorical variables
X_train = train_data.drop(columns=['application_id', 'admission'])
X_test = test_data.drop(columns=['application_id', 'admission'])

# simple label encoding for categorical variables
for col in categorical_cols:
    if col in X_train.columns:
        le = sk.preprocessing.LabelEncoder()
        X_train[col] = le.fit_transform(X_train[col].astype(str))
        # handle unseen labels in test set
        if col in X_test.columns:
            X_test[col] = X_test[col].astype(str)
            X_test[col] = X_test[col].map(lambda x: le.transform([x])[0] if x in le.classes_ else -1)
            
# encode 'international' column
X_train['international'] = X_train['international'].astype(int)
X_test['international'] = X_test['international'].astype(int)

# target variable
y_encoder = sk.preprocessing.LabelEncoder()
y_train = y_encoder.fit_transform(train_data['admission'])

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Classes: {y_encoder.classes_}")


Training set shape: (6095, 8)
Test set shape: (99, 8)
Classes: ['Admit' 'Reject' 'Waitlist']


  data[col]=data[col].fillna(data[col].mode()[0])


#### b. Subtask 1
- call sk..LinearRegression, fit, predict and assess.
- call LogisticReg, got warned that 1000 iter woundn't lead to convergence,
try 3000, the same; 4000, converged.
- output accuracy on both sets.


In [2]:
# Subtask 1: Linear and Logistic Regression
# a. Linear Regression
print("\n--- Linear Regression ---")
linear_reg = sk.linear_model.LinearRegression()
linear_reg.fit(X_train, y_train)

# make predictions
y_train_pred_linear = linear_reg.predict(X_train)
y_test_pred_linear = linear_reg.predict(X_test)

# convert to int for classification
y_train_pred_linear_class = np.round(np.clip(y_train_pred_linear, 0, len(y_encoder.classes_)-1)).astype(int)
y_test_pred_linear_class = np.round(np.clip(y_test_pred_linear, 0, len(y_encoder.classes_)-1)).astype(int)

# Calculate accuracy
train_acc_linear = sk.metrics.accuracy_score(y_train, y_train_pred_linear_class)
test_acc_linear = sk.metrics.accuracy_score(y_encoder.transform(test_data['admission']), y_test_pred_linear_class)

print(f"   Training Accuracy: {train_acc_linear:.4f}")
print(f"   Test Accuracy: {test_acc_linear:.4f}")
print(f"   Linear regression coefficients shape: {linear_reg.coef_.shape}")


# b. Logistic Regression
print("\n--- Logistic Regression ---")
logistic_reg = sk.linear_model.LogisticRegression(max_iter=4000, random_state=42)
logistic_reg.fit(X_train, y_train)

# make predictions
y_train_pred_logistic = logistic_reg.predict(X_train)
y_test_pred_logistic = logistic_reg.predict(X_test)

# Calculate accuracy
train_acc_logistic = sk.metrics.accuracy_score(y_train, y_train_pred_logistic)
test_acc_logistic = sk.metrics.accuracy_score(y_encoder.transform(test_data['admission']), y_test_pred_logistic)

print(f"   Training Accuracy: {train_acc_logistic:.4f}")
print(f"   Test Accuracy: {test_acc_logistic:.4f}")
print(f"   Logistic regression coefficients shape: {logistic_reg.coef_.shape}")

print("\nRegression Done.")


--- Linear Regression ---
   Training Accuracy: 0.8466
   Test Accuracy: 0.3333
   Linear regression coefficients shape: (8,)

--- Logistic Regression ---
   Training Accuracy: 0.8409
   Test Accuracy: 0.3838
   Logistic regression coefficients shape: (3, 8)

Regression Done.


#### c. Subtask 2
- learn the common process using sklearn to train a MLP model
- normalize for better training
- write a function to build a MLP model with changeable default parameters
- build model
- random_state and '42': a widely used meme seed LOL
- Glad to see 'adam' again btw.
- Notice that epoch = 50 is too small, not converged; try 100, still not good; 200, better; 300, better and yiedlded better results.

In [3]:
# Subtask 2: sklearn MLP Classifier
print("\n--- MLP Classifier ---")

# normalize features
scaler = sk.preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Scaled Training set shape: {X_train_scaled.shape}")
print(f"Scaled Test set shape: {X_test_scaled.shape}")

# build and train MLP using provided parameters
input_dim = X_train_scaled.shape[1]
output_dim = len(y_encoder.classes_)

print(f"Network Architecture: [{input_dim}, 128] -> [128, 256] -> [256, {output_dim}]")

# define model for later uses
def make_default_mlp(activation='relu', lr=LR, epoch=EPOCH, batch=BATCH_SIZE):
    return sk.neural_network.MLPClassifier(
        hidden_layer_sizes=(128, 256),
        activation=activation,
        learning_rate_init=lr,
        max_iter=epoch,
        batch_size=batch,
        random_state=42,
        solver='adam',
        early_stopping=False,
        verbose=False
    )

# build model
mlp_sk = make_default_mlp()

# train model
print("Training MLP...")
mlp_sk.fit(X_train_scaled, y_train)
print("MLP Training Complete.")

# make predictions
y_train_pred_mlp = mlp_sk.predict(X_train_scaled)
y_test_pred_mlp = mlp_sk.predict(X_test_scaled)

# Calculate accuracy
train_acc_mlp = sk.metrics.accuracy_score(y_train, y_train_pred_mlp)
test_acc_mlp = sk.metrics.accuracy_score(y_encoder.transform(test_data['admission']), y_test_pred_mlp)
print(f"   Training Accuracy: {train_acc_mlp:.4f}")
print(f"   Test Accuracy: {test_acc_mlp:.4f}")
total_params = sum(coef.size for coef in mlp_sk.coefs_) + sum(bias.size for bias in mlp_sk.intercepts_)
print(f"   Total parameters: {total_params}")




# too few epochs, not converged
# try more epochs like 300
print("\nTrying with more epochs. epoch=300")

mlp_sk_1 = make_default_mlp(epoch=300)
mlp_sk_1.fit(X_train_scaled, y_train)
y_train_pred_mlp_1 = mlp_sk_1.predict(X_train_scaled)
y_test_pred_mlp_1 = mlp_sk_1.predict(X_test_scaled)
train_acc_mlp_1 = sk.metrics.accuracy_score(y_train, y_train_pred_mlp_1)
test_acc_mlp_1 = sk.metrics.accuracy_score(y_encoder.transform(test_data['admission']), y_test_pred_mlp_1)
print(f"   Training Accuracy: {train_acc_mlp_1:.4f}")
print(f"   Test Accuracy: {test_acc_mlp_1:.4f}")
total_params_1 = sum(coef.size for coef in mlp_sk_1.coefs_) + sum(bias.size for bias in mlp_sk_1.intercepts_)
print(f"   Total parameters: {total_params_1}")


print("MLP Done.")


--- MLP Classifier ---
Scaled Training set shape: (6095, 8)
Scaled Test set shape: (99, 8)
Network Architecture: [8, 128] -> [128, 256] -> [256, 3]
Training MLP...




MLP Training Complete.
   Training Accuracy: 0.8527
   Test Accuracy: 0.3535
   Total parameters: 34947

Trying with more epochs. epoch=300
   Training Accuracy: 0.8858
   Test Accuracy: 0.3939
   Total parameters: 34947
MLP Done.




#### d. Subtask 3
- install torch
- search and see what 'super' class is: a func allows calling methods from parent class
- why commonly a class is built using Torch but not sklearn: PyTorch is more customizable while scikit-learn provides standard ML.
- adopt the same optimizer 'Adam' as above
- learn the training process using Torch
- learn some apis

In [4]:
# Subtask 3: Torch MLP Classifier
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# define MLP model
class MLP_PyTorch(nn.Module):
    def __init__(self, input_dim, num_classes):
        super(MLP_PyTorch, self).__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 256), 
            nn.ReLU(),
            nn.Linear(256, num_classes)
        )
    
    def forward(self, x):
        return self.network(x)

# prepare data for PyTorch
X_train_tensor = torch.FloatTensor(X_train_scaled)
X_test_tensor = torch.FloatTensor(X_test_scaled)
y_train_tensor = torch.LongTensor(y_train)

# create DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# initialize model, loss function, optimizer
model = MLP_PyTorch(input_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LR)

# train model
print("\nTraining PyTorch MLP...")
model.train()
for epo in range(EPOCH):
    running_loss = 0.0
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * batch_x.size(0)
    
    epoch_loss = running_loss / len(train_loader.dataset)
    print(f"Epoch {epo+1}/{EPOCH}, Loss: {epoch_loss:.4f}")

print("PyTorch MLP Training Complete.")

# evaluate model
model.eval()
with torch.no_grad():
    # Training accuracy
    train_outputs = model(X_train_tensor)
    _, train_preds = torch.max(train_outputs, 1)
    train_acc_torch = (train_preds == y_train_tensor).float().mean().item()
    
    # Test accuracy
    test_outputs = model(X_test_tensor)
    _, test_preds = torch.max(test_outputs, 1)
    test_acc_torch = (test_preds == torch.LongTensor(y_encoder.transform(test_data['admission']))).float().mean().item()
    
print(f"   Training Accuracy: {train_acc_torch:.4f}")
print(f"   Test Accuracy: {test_acc_torch:.4f}")
total_params_torch = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"   Total parameters: {total_params_torch}")




# not converged yet, try more epochs
print("\nTrying with more epochs. epoch=300")
model_1 = MLP_PyTorch(input_dim, output_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_1.parameters(), lr=LR)

# train model
print("\nTraining PyTorch MLP...")
model_1.train()
for epoch in range(300):
    running_loss = 0.0
    for batch_x, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = model_1(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * batch_x.size(0)
    
    epoch_loss = running_loss / len(train_loader.dataset)
    print(f"Epoch {epoch+1}/{300}, Loss: {epoch_loss:.4f}")
print("PyTorch MLP_1 Training Complete.")

# evaluate model
model_1.eval()
with torch.no_grad():
    # Training accuracy
    train_outputs_1 = model_1(X_train_tensor)
    _, train_preds_1 = torch.max(train_outputs_1, 1)
    train_acc_torch_1 = (train_preds_1 == y_train_tensor).float().mean().item()

    # Test accuracy
    test_outputs_1 = model_1(X_test_tensor)
    _, test_preds_1 = torch.max(test_outputs_1, 1)
    test_acc_torch_1 = (test_preds == torch.LongTensor(y_encoder.transform(test_data['admission']))).float().mean().item()

print(f"   Training Accuracy: {train_acc_torch_1:.4f}")
print(f"   Test Accuracy: {test_acc_torch_1:.4f}")
total_params_torch_1 = sum(p.numel() for p in model_1.parameters() if p.requires_grad)
print(f"   Total parameters: {total_params_torch_1}")

print("PyTorch MLP Done.")


# compare sklearn and PyTorch
print("\n--- Comparison ---")
print(f"Sklearn MLP (50) - Training Accuracy: {train_acc_mlp:.4f}, Test Accuracy: {test_acc_mlp:.4f}, Total Params: {total_params}")
print(f"Sklearn MLP(300) - Training Accuracy: {train_acc_mlp_1:.4f}, Test Accuracy: {test_acc_mlp_1:.4f}, Total Params: {total_params_1}")
print(f"PyTorch MLP (50) - Training Accuracy: {train_acc_torch:.4f}, Test Accuracy: {test_acc_torch:.4f}, Total Params: {total_params_torch}")
print(f"PyTorch MLP(300) - Training Accuracy: {train_acc_torch_1:.4f}, Test Accuracy: {test_acc_torch_1:.4f}, Total Params: {total_params_torch_1}")

print("Comparison Done.")



Training PyTorch MLP...
Epoch 1/50, Loss: 0.8123
Epoch 2/50, Loss: 0.4655
Epoch 3/50, Loss: 0.4053
Epoch 4/50, Loss: 0.3917
Epoch 5/50, Loss: 0.3873
Epoch 6/50, Loss: 0.3849
Epoch 7/50, Loss: 0.3834
Epoch 8/50, Loss: 0.3815
Epoch 9/50, Loss: 0.3803
Epoch 10/50, Loss: 0.3790
Epoch 11/50, Loss: 0.3779
Epoch 12/50, Loss: 0.3771
Epoch 13/50, Loss: 0.3765
Epoch 14/50, Loss: 0.3751
Epoch 15/50, Loss: 0.3750
Epoch 16/50, Loss: 0.3737
Epoch 17/50, Loss: 0.3728
Epoch 18/50, Loss: 0.3720
Epoch 19/50, Loss: 0.3713
Epoch 20/50, Loss: 0.3707
Epoch 21/50, Loss: 0.3699
Epoch 22/50, Loss: 0.3692
Epoch 23/50, Loss: 0.3687
Epoch 24/50, Loss: 0.3681
Epoch 25/50, Loss: 0.3675
Epoch 26/50, Loss: 0.3665
Epoch 27/50, Loss: 0.3663
Epoch 28/50, Loss: 0.3650
Epoch 29/50, Loss: 0.3646
Epoch 30/50, Loss: 0.3637
Epoch 31/50, Loss: 0.3629
Epoch 32/50, Loss: 0.3621
Epoch 33/50, Loss: 0.3614
Epoch 34/50, Loss: 0.3608
Epoch 35/50, Loss: 0.3604
Epoch 36/50, Loss: 0.3596
Epoch 37/50, Loss: 0.3587
Epoch 38/50, Loss: 0.3

#### Analysis:
##### 0. According to the results:
1. Both sklearn and PyTorch MLP models didn't converge within 50 epochs, 100, 200 or 300.(maybe converge with epoch in [~250, >300])
2. Both improved with more epochs, but may still not converge fully.
3. Both models showed signs of overfitting, with training accuracy significantly higher than test accuracy.
4. On test dataset, sklearn MLP achieved higher accuracy than PyTorch MLP at both epoch settings (50 and 300).
5. It seems that sklearn MLP reached its best performance slower than PyTorch MLP(see Pytorch MLP accuracy on test sets), but its best performance was better.
6. Both models approached similar training accuracies with more epochs, but sklearn was less overfitted.
7. For this dataset and configuration, sklearn outperformed PyTorch.
##### 1. diff between sklearn and PyTorch may come from
1. Different default initialization
2. Different optimization algorithms implementations(sklearn vs PyTorch Adam)
3. Different batch handling, shuffling, etc.
4. Different numerical precision
5. Different regularization defaults
##### 2. diff between train/test accuracy likely due to
1. Overfitting - model fits training data too well
2. too few epoches, the model may not converge
3. Model complexity too high for dataset size
4. Distribution shift between train and test sets
5. Limited training data, insufficient sample

#### e. Subtask 4
- consult names for the activation funcs in package
- train and run model with each
- output comparison and analysis

In [5]:
# Subtask 4: Activation Function Comparison
print("\n--- Activation Function Comparison ---")

# Test different activation functions using sklearn MLPClassifier
# identity=no activation, logistic=sigmoid
activation_functions = ['identity', 'relu', 'logistic']  
results = {}

for activation in activation_functions:
    print(f"Testing activation function: {activation}")
    
    mlp_activation = make_default_mlp(activation=activation)
    
    # Train model
    mlp_activation.fit(X_train_scaled, y_train)
    
    # Evaluate
    train_acc = sk.metrics.accuracy_score(y_train, mlp_activation.predict(X_train_scaled))
    
    results[activation] = {
        'train_accuracy': train_acc,
        'final_loss': mlp_activation.loss_,
        'iterations': mlp_activation.n_iter_
    }
    
    print(f"   Training Accuracy: {train_acc:.4f}")
    print(f"   Final Loss: {mlp_activation.loss_:.6f}")
    print(f"   Iterations: {mlp_activation.n_iter_}\n")

print("Activation Function Comparison:")
print(f"{'Function':<12} {'Accuracy':<10} {'Loss':<10} {'Iterations':<12}")
print("-" * 50)
for func, metrics in results.items():
    print(f"{func:<12} {metrics['train_accuracy']:<10.4f} {metrics['final_loss']:<10.6f} {metrics['iterations']:<12}")



--- Activation Function Comparison ---
Testing activation function: identity
   Training Accuracy: 0.8384
   Final Loss: 0.391846
   Iterations: 18

Testing activation function: relu




   Training Accuracy: 0.8527
   Final Loss: 0.350416
   Iterations: 50

Testing activation function: logistic
   Training Accuracy: 0.8399
   Final Loss: 0.389615
   Iterations: 50

Activation Function Comparison:
Function     Accuracy   Loss       Iterations  
--------------------------------------------------
identity     0.8384     0.391846   18          
relu         0.8527     0.350416   50          
logistic     0.8399     0.389615   50          




#### Activation Function Analysis:

As I learned, an activation func in ML(especially NN):
- introduces non-linearity which allows learning complex patterns and relationships

For this dataset and model configuration:
- 'No Activation' converges fastest in perceivable contrast.
- 'relu' achieves best accuracy and loss.
- 'logistic' makes slightly better result but cost more than 'No Activation'

For each:
1. No Activation (identity):
   - f(x) = x
   - Linear transformations only
   - Cannot learn complex non-linear patterns
   - Equivalent to linear regression for classification(no non-linearity)

2. ReLU Activation:
   - Introduces non-linearity: f(x) = max(0, x)
   - Addresses vanishing gradient problem
   - Most commonly used in deep networks
   - Can suffer from 'dying ReLU' problem
   - somehow like a neuron, yield a weight when active

3. Sigmoid Activation:
   - f(x) = 1/(1+exp(-x))
   - Smooth non-linearity
   - infinitely differentiable and infinitely smooth 
   - Output range [0,1]
   - Can suffer from vanishing gradients in deep networks
   - sensitive around x=0
   - Saturates for large input values


#### f. Subtask 5
- select features:
   1. Filter - like SelectKBest(), based on statistical tests between each feature and the target(ignoring feature interactions)
   2. Wrapper - like RFE(recursive feature elimination), use a predictive model to assess feature subsets
   3. Embedded methods like Lasso, which perform feature selection during model training by adding a penalty for complexity
   4. Dimensionality Reduction - like PCA, which transforms features into a lower-dimensional space(similar to Latent Space) while retaining most variance
- try:
   1. Filter: SelectKBest
   2. Dimensionality Reduction: PCA
   3. Scale some features differently: StandardScaler, MinMaxScaler and RobustScaler
- noticed that all expeirimenting models didn't converge within 50 epochs
- and compare

In [6]:
print(X_train.shape[1])

8


In [7]:
# Subtask 5: Feature processing for better performance
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.decomposition import PCA

print("\n--- Feature Processing for Better Performance ---")

# Baseline performance
print(f"Baseline MLP:")
print(f"   Training Accuracy: {train_acc_mlp:.4f}")
print(f"   Test Accuracy: {test_acc_mlp:.4f}")
print("--------------------------------------------------------")


exp = {}

# Try 1: Filter, SelectKBest()
print("\nTrying SelectKBest feature selection...")

# select top k features using ANOVA F-test
for k in [2, 4, 6]:
    print(f"Selecting top {k} features...")
    
    selector = SelectKBest(score_func=f_classif, k=k)
    X_train_selected = selector.fit_transform(X_train_scaled, y_train)
    X_test_selected = selector.transform(X_test_scaled)
    
    # train MLP on selected features
    mlp_fs =  make_default_mlp()
    
    mlp_fs.fit(X_train_selected, y_train)
    
    exp[f'fs_k={k}'] = mlp_fs
    train_acc_fs = sk.metrics.accuracy_score(y_train, mlp_fs.predict(X_train_selected))
    test_acc_fs = sk.metrics.accuracy_score(y_encoder.transform(test_data['admission']), mlp_fs.predict(X_test_selected))
    
    print(f"   Training Accuracy with {k} features: {train_acc_fs:.4f}")
    print(f"   Test Accuracy with {k} features: {test_acc_fs:.4f}")
    print(f"   Selected features: {np.where(selector.get_support())[0]}")

# k = 8 is the baseline
print(f"   Training Accuracy with {8} features: {train_acc_mlp:.4f}")
print(f"   Test Accuracy with {8} features: {test_acc_mlp:.4f}")
print(f"   Selected features: {range(X_train.shape[1])}")

print("--------------------------------------------------------")
    
# Try 2: Dimensionality Reduction, PCA
print("\nTrying PCA for dimensionality reduction...")

for n_components in [2, 4, 6, X_train.shape[1]]:
    print(f"Testing PCA with {n_components} components...")
    print(f"Reducing to {n_components} principal components...")
    
    pca = PCA(n_components=n_components, random_state=42)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)
    
    # train MLP on PCA components
    mlp_pca = make_default_mlp()
    
    mlp_pca.fit(X_train_pca, y_train)
    
    exp[f'pca_n={n_components}'] = mlp_pca
    train_acc_pca = sk.metrics.accuracy_score(y_train, mlp_pca.predict(X_train_pca))
    test_acc_pca = sk.metrics.accuracy_score(y_encoder.transform(test_data['admission']), mlp_pca.predict(X_test_pca))
    
    print(f"   Training Accuracy with {n_components} PCA components: {train_acc_pca:.4f}")
    print(f"   Test Accuracy with {n_components} PCA components: {test_acc_pca:.4f}")
    print(f"   Explained variance ratio: {pca.explained_variance_ratio_}")
    
print("--------------------------------------------------------")

# Try 3: scale some features differently
from sklearn.preprocessing import MinMaxScaler, RobustScaler
print("\nTrying different scaling strategies...")

scalers = {
    'minmax': MinMaxScaler(),
    'robust': RobustScaler()
}

for name, scaler in scalers.items():
    print(f"Using {name} scaler...")
    
    X_train_alt = scaler.fit_transform(X_train)
    X_test_alt = scaler.transform(X_test)
    
    mlp_alt = make_default_mlp()
    mlp_alt.fit(X_train_alt, y_train)
    
    exp[f'scaler_{name}'] = mlp_alt
    train_acc_alt = sk.metrics.accuracy_score(y_train, mlp_alt.predict(X_train_alt))
    test_acc_alt = sk.metrics.accuracy_score(y_encoder.transform(test_data['admission']), mlp_alt.predict(X_test_alt))
    print(f"   Training Accuracy with {name} scaler: {train_acc_alt:.4f}")
    print(f"   Test Accuracy with {name} scaler: {test_acc_alt:.4f}")
    
print("--------------------------------------------------------")
print("Feature Processing Done.")


--- Feature Processing for Better Performance ---
Baseline MLP:
   Training Accuracy: 0.8527
   Test Accuracy: 0.3535
--------------------------------------------------------

Trying SelectKBest feature selection...
Selecting top 2 features...




   Training Accuracy with 2 features: 0.8468
   Test Accuracy with 2 features: 0.3333
   Selected features: [2 5]
Selecting top 4 features...




   Training Accuracy with 4 features: 0.8474
   Test Accuracy with 4 features: 0.3434
   Selected features: [0 2 4 5]
Selecting top 6 features...




   Training Accuracy with 6 features: 0.8510
   Test Accuracy with 6 features: 0.3333
   Selected features: [0 1 2 3 4 5]
   Training Accuracy with 8 features: 0.8527
   Test Accuracy with 8 features: 0.3535
   Selected features: range(0, 8)
--------------------------------------------------------

Trying PCA for dimensionality reduction...
Testing PCA with 2 components...
Reducing to 2 principal components...




   Training Accuracy with 2 PCA components: 0.8468
   Test Accuracy with 2 PCA components: 0.3333
   Explained variance ratio: [0.19722454 0.18674995]
Testing PCA with 4 components...
Reducing to 4 principal components...




   Training Accuracy with 4 PCA components: 0.8484
   Test Accuracy with 4 PCA components: 0.3434
   Explained variance ratio: [0.19722454 0.18674995 0.12820658 0.12605284]
Testing PCA with 6 components...
Reducing to 6 principal components...




   Training Accuracy with 6 PCA components: 0.8489
   Test Accuracy with 6 PCA components: 0.3333
   Explained variance ratio: [0.19722454 0.18674995 0.12820658 0.12605284 0.12336417 0.12219387]
Testing PCA with 8 components...
Reducing to 8 principal components...




   Training Accuracy with 8 PCA components: 0.8533
   Test Accuracy with 8 PCA components: 0.3535
   Explained variance ratio: [0.19722454 0.18674995 0.12820658 0.12605284 0.12336417 0.12219387
 0.06049988 0.05570816]
--------------------------------------------------------

Trying different scaling strategies...
Using minmax scaler...




   Training Accuracy with minmax scaler: 0.8456
   Test Accuracy with minmax scaler: 0.3434
Using robust scaler...
   Training Accuracy with robust scaler: 0.8504
   Test Accuracy with robust scaler: 0.3535
--------------------------------------------------------
Feature Processing Done.




#### Analysis
##### According to the results:
1. All attempted methods didn't improve performance much.
2. For this dataset with 8 valid features, feature selection and scaling techniques may not have a significant impact on model performance.
3. PCA with 8 components is the only one that slightly improved performance, but the gain is still minimal.

#### Subtask 6 by AI
##### **Notations**

- $\mathbf{x}$: Input vector (shape: $[d_{in}, 1]$)
- $\mathbf{W}_1, \mathbf{b}_1$: Weights and bias for first hidden layer ($[128, d_{in}]$, $[128, 1]$)
- $\mathbf{W}_2, \mathbf{b}_2$: Weights and bias for second hidden layer ($[256, 128]$, $[256, 1]$)
- $\mathbf{W}_3, \mathbf{b}_3$: Weights and bias for output layer ($[d_{out}, 256]$, $[d_{out}, 1]$)
- $\mathrm{ReLU}(z)$: Activation function, $\mathrm{ReLU}(z) = \max(0, z)$
- $\odot$: Element-wise (Hadamard) product
- $L$: Loss function (e.g., cross-entropy)
- $\delta_i$: Gradient of loss w.r.t. pre-activation at layer $i$ (i.e., $\delta_i = \frac{\partial L}{\partial \mathbf{z}_i}$)

---

##### **Forward Pass**

\[
\begin{align*}
\mathbf{a}_1 &= \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1 \\
\mathbf{h}_1 &= \mathrm{ReLU}(\mathbf{a}_1) \\
\mathbf{a}_2 &= \mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2 \\
\mathbf{h}_2 &= \mathrm{ReLU}(\mathbf{a}_2) \\
\mathbf{z}   &= \mathbf{W}_3 \mathbf{h}_2 + \mathbf{b}_3 \\
\end{align*}
\]

---

##### **Backward Pass (Backpropagation)**

###### 1. **Output Layer**

\$$
\delta_3 = \frac{\partial L}{\partial \mathbf{z}}$$
\

- For cross-entropy with softmax, $\delta_3 = \hat{\mathbf{y}} - \mathbf{y}$

###### 2. **Second Hidden Layer**

\[
\delta_2 = (\mathbf{W}_3^\top \delta_3) \odot \mathrm{ReLU}'(\mathbf{a}_2)
\]
- $\mathrm{ReLU}'(\mathbf{a}_2)$ is $1$ where $\mathbf{a}_2 > 0$, else $0$

###### 3. **First Hidden Layer**

\[
\delta_1 = (\mathbf{W}_2^\top \delta_2) \odot \mathrm{ReLU}'(\mathbf{a}_1)
\]

---

##### **Gradients w.r.t. Parameters**

\[
\begin{align*}
\frac{\partial L}{\partial \mathbf{W}_3} &= \delta_3 \mathbf{h}_2^\top \\
\frac{\partial L}{\partial \mathbf{b}_3} &= \delta_3 \\
\frac{\partial L}{\partial \mathbf{W}_2} &= \delta_2 \mathbf{h}_1^\top \\
\frac{\partial L}{\partial \mathbf{b}_2} &= \delta_2 \\
\frac{\partial L}{\partial \mathbf{W}_1} &= \delta_1 \mathbf{x}^\top \\
\frac{\partial L}{\partial \mathbf{b}_1} &= \delta_1 \\
\end{align*}
\]

---

##### **Summary of Steps**

1. **Forward:** Compute activations $\mathbf{a}_1, \mathbf{h}_1, \mathbf{a}_2, \mathbf{h}_2, \mathbf{z}$
2. **Backward:**  
   - Compute $\delta_3$ from loss  
   - Propagate to $\delta_2$ using $\mathbf{W}_3$ and ReLU derivative  
   - Propagate to $\delta_1$ using $\mathbf{W}_2$ and ReLU derivative  
3. **Parameter Gradients:** Use $\delta_i$ and previous layer activations to get gradients for all weights and biases.

---