# **COMP 2211 - Exploring Artificial Intelligence**  
## Lab 6 · Multilayer Perceptron (MLP)

Welcome to Lab 6!  
Your mission is to design, train and evaluate a **Multilayer Perceptron (MLP)** that predicts a student's `final_grade` (`G3` in original dataset) using the altered version of [UCI Student-Performance dataset](https://archive.ics.uci.edu/ml/datasets/Student+Performance)

![MLP Illustration](https://media.geeksforgeeks.org/wp-content/uploads/nodeNeural.jpg)


## **Tasks Breakdown**

1. **Task0 · Import & visualise** the dataset (feature histograms + correlation heat-map).  
2. **Task1 · Pre-process** by standardising every feature (Z-score) using training-set statistics.  
3. **Task2 · Build** a compact MLP (≤ 20 000 params) with Keras `Dense`, `BatchNorm` and `Dropout` layers.  
4. **Task3 · Compile & train** the model (Adam optimizer + MSE loss, monitor MAE & MSE validation scores), save the best model as **`lab6_model.keras`**.  
5. **Task4 · Evaluate & save**: compute MSE, RMSE & R², draw *Actual vs Predicted* scatter.


## **Task 0 · import & visualise the dataset** (no todos)

The dataset, collected by Paulo Cortez *et al.* and hosted on the UCI Machine Learning Repository, combines demographic, social, and academic information for Portuguese secondary-school students. Here are the features:


#### Feature list

| Formal name | Description |
|-------------|-------------|
| `school` | Secondary school attended |
| `gender` | Student's gender |
| `age` | Age in years (15 - 22) |
| `address` | Home setting |
| `family_size` | Size of household (number of people living together) |
| `parent_status` | Whether parents live together |
| `mother_education` | Mother's highest education level (0 = none … 4 = tertiary) |
| `father_education` | Father's highest education level |
| `mother_job` | Mother's main occupation |
| `father_job` | Father's main occupation |
| `school_reason` | Main reason for choosing the current school |
| `guardian` | Primary legal guardian |
| `travel_time` | One-way commute time to school (1 = < 15 min … 4 = > 60 min) |
| `study_time` | Weekly hours of individual study (1 = < 2 h … 4 = ≥ 10 h) |
| `failures` | Number of past class failures (0 - 3) |
| `school_support` | Extra academic support provided by the school |
| `family_support` | Educational support provided by family |
| `extra_paid_classes` | Attendance at paid private classes |
| `extracurricular_activities` | Participation in sports, arts, clubs, … |
| `nursery_attended` | Attendance at nursery school |
| `higher_education` | Intention to pursue higher education |
| `internet_access` | Internet access at home |
| `romantic_relationship` | Ongoing romantic relationship |
| `family_relationship` | Quality of family relationships (1 = very bad … 5 = excellent) |
| `free_time` | Free time after school (1 = very low … 5 = very high) |
| `going_out` | Evenings out with friends (same 1-5 scale) |
| `weekday_alcohol_consumption` | Alcohol intake Monday-Friday (1 = none … 5 = very high) |
| `weekend_alcohol_consumption` | Alcohol intake on weekends |
| `health_status` | Current health condition (1 = very poor … 5 = very good) |
| `absences` | Number of school absences |
| `first_period_grade` | Grade at the end of the 1st term (0 - 20) |
| `second_period_grade` | Grade at the end of the 2nd term (0 - 20) |



#### Categorical Feature Encoded List:

| Feature                      | Integer → Meaning                                                                 |
|------------------------------|-----------------------------------------------------------------------------------|
| **school**                   | 0 = GP, 1 = MS                                                                    |
| **gender**                   | 0 = F, 1 = M                                                                      |
| **address**                  | 0 = R, 1 = U                                                                      |
| **family_size**              | 0 = LE3, 1 = GT3                                                                  |
| **parent_status**            | 0 = T, 1 = A                                                                      |
| **mother_job**               | 0 = services, 1 = at_home, 2 = teacher, 3 = health, 4 = other                     |
| **father_job**               | 0 = at_home, 1 = teacher, 2 = other, 3 = services, 4 = health                     |
| **school_reason**            | 0 = reputation, 1 = other, 2 = home, 3 = course                                   |
| **guardian**                 | 0 = father, 1 = other, 2 = mother                                                 |
| **school_support**           | 0 = yes, 1 = no                                                                   |
| **family_support**           | 0 = yes, 1 = no                                                                   |
| **extra_paid_classes**       | 0 = no, 1 = yes                                                                   |
| **extracurricular_activities** | 0 = no, 1 = yes                                                                 |
| **nursery_attended**         | 0 = no, 1 = yes                                                                   |
| **higher_education**         | 0 = no, 1 = yes                                                                   |
| **internet_access**          | 0 = yes, 1 = no                                                                   |
| **romantic_relationship**    | 0 = no, 1 = yes

Attribute to predict: `final_grade`, which is the grade at the end of the 3rd term.

In [None]:
# Import libraries

import numpy as np
import pandas as pd
import warnings

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

if __name__ == "__main__":
  import matplotlib.pyplot as plt
  import seaborn as sns

In [None]:
# load data

if __name__ == "__main__":
    X_train = pd.read_csv("X_train.csv")
    X_test  = pd.read_csv("X_test.csv")
    y_train = pd.read_csv("y_train.csv").squeeze()
    y_test  = pd.read_csv("y_test.csv").squeeze()

In [None]:
#  Visualise attribute, distributions & correlations

if __name__ == "__main__":
    train_df = X_train.copy()
    train_df[y_train.name] = y_train
    display(train_df.head())

    # Histograms
    ax = train_df.hist(bins=40, figsize=(14,10))
    plt.suptitle("Feature distributions (training set)", y=1.03, fontsize=16)
    plt.tight_layout(rect=[0, 0.03, 1, 0.97])
    plt.show()

    # Correlation heat‑map
    plt.figure(figsize=(20,16))
    sns.heatmap(train_df.corr(), annot=True, fmt=".1f", cmap="coolwarm")
    plt.title("Correlation heat‑map")
    plt.show()


## **Task 1 · Data Preprocessing**

### Z-score Standardisation

All features are numeric, so we apply **Z-score normalisation** using **training set** mean and variance:

$$
z \;=\;\frac{x - \mu_{\text{train}}}{\sigma_{\text{train}}}
$$

**Important Notes**

- Statistics ($\mu_{\text{train}}, \sigma_{\text{train}}$) are computed **only** on the **training** set, as we always assume we haven't obtain info of the test set when train the model.  

In [None]:
def preprocess(X_train,X_test,y_train,y_test):
    """
    Perform z-score normalisation on feature and target data.

    Parameters
    ----------
    X_train : pandas.DataFrame
        Feature matrix for the training split.
    X_test : pandas.DataFrame
        Feature matrix for the test split.
    y_train : pandas.Series
        Target vector for the training split.
    y_test : pandas.Series
        Target vector for the test split.

    Returns
    -------
    tuple
        (
        X_train_std : pandas.DataFrame, standardised training features
        X_test_std  : pandas.DataFrame, standardised test features
        y_train_std : pandas.Series, standardised training targets
        y_test_sted  : pandas.Series, standardised test targets
        )

    """

    ###############################################################################
    # TODO: your code starts here





    # TODO: your code ends here
    ###############################################################################

    return (
        X_train_std,
        X_test_std,
        y_train_std,
        y_test_std
    )


In [None]:
if __name__ == "__main__":
  X_train = pd.read_csv("X_train.csv")
  X_test  = pd.read_csv("X_test.csv")
  y_train = pd.read_csv("y_train.csv").squeeze()
  y_test  = pd.read_csv("y_test.csv").squeeze()
  print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

## **Task 2 · Model Building**

We'll build a **compact multilayer perceptron** that predicts the `final_grade` (Float value ranged from 0 - 20) from student performance features.  
Your model can *NOT* contain more than **20,000 parameters**.

#### Common layers used in MLP

| # | Layer | What it does | Why it helps here |
|---|-------|--------------|-------------------|
| **1** | **Dense**<br>([`keras.layers.Dense`](https://keras.io/api/layers/core_layers/dense/)) | Fully-connected computation  \(y = W · x + b\) followed by **ReLU**. | Captures non-linear relationships between socio-demographic features and exam marks. |
| **2** | **Batch Normalization**<br>([`keras.layers.BatchNormalization`](https://keras.io/api/layers/normalization_layers/batch_normalization/)) | Normalises layer activations to zero-mean/ unit-variance during training. | Speeds convergence and allows slightly higher learning rates on small tabular sets. |
| **3** | **Dropout**<br>([`keras.layers.Dropout`](https://keras.io/api/layers/regularization_layers/dropout/)) | Randomly “switches off” neurons each step. | Mitigates over-fitting on the modest-sized UCI dataset. |

> **Regularisation note** - Every Dense layer also uses **l2 weight-decay** (`kernel_regularizer=keras.regularizers.l2(1e-4)`), adding more generalisation power without adding parameters.


In [None]:
def create_model(input_dim):
    """
    Tiny fully-connected network for regression.

    Parameters
    ----------
    input_dim : int
        Number of input features (after preprocessing).

    Returns
    -------
    keras.Sequential
        The MLP model.
    """
    l2 = regularizers.l2(1e-4)

    ###############################################################################
    # TODO: your code starts here














    # TODO: your code ends here
    ###############################################################################


    return model

In [None]:
if __name__ == "__main__":
  # Your model summary. Make sure it doesn't contain more than 20,000 Total params.
  warnings.filterwarnings('ignore',message=r'.*input_shape.*Sequential.*',category=UserWarning)
  X_train_std, X_test_std, y_train_std, y_test_std = preprocess(
    X_train, X_test, y_train, y_test
    )
  model = create_model(X_train_std.shape[1])
  model.summary()

## **Task 3 · Compile & Train**

For detailed information on configuring your model before training, including all available parameters for the `.compile()` method, see the [TensorFlow Keras API documentation](https://www.tensorflow.org/api_docs/python/tf/keras/Model#compile):

- **Model.compile** configures the model for training by specifying:
  - `optimizer` (e.g. `Adam`, `RMSprop`, `SGD`, …). You need to use **Adam** in this task.
  - `loss` (e.g. `"mse"`, `MeanSquaredError()`, …). You need to use **MSE** in this task.
  - `metrics` (e.g. `["mae", "mse"]`, `BinaryAccuracy()`, …). You need to use **MAE and MSE** in this task.
  
For a hands-on example showing how to set up the training workflow—including `.compile()`, `.fit()`, and monitoring validation metrics—see the [Keras guide](https://www.tensorflow.org/guide/keras/training_with_built_in_methods#the_compile_method_specifying_a_loss_metrics_and_an_optimizer) on built-in training methods:

### Compile the model

In [None]:
def compile_model(model):
  """
    Compile a Keras model for a regression task.
  """
  ###############################################################################
  # TODO: your code starts here








  # TODO: your code ends here
  ###############################################################################
  return model

> ### **Practical Tip: Diagnosing Over-fitting and Under-fitting & How Hyperparameters Affect Training**  
>
> When you train a model, you'll monitor two curves over epochs: **training loss** (and metrics) vs. **validation loss** (and metrics). Their shapes tell you if your model is learning well, under-fitting, or over-fitting. Here's how to adjust common hyperparameters and what you'll observe in those curves:

> #### 1. Epochs  

 - **Too few** (under-fitting):  
   - *What you'll see:* Both training & validation loss stay high and flat.  
   - *Fix:* Increase epochs so the model has more opportunity to learn.  
 - **Too many** (over-fitting):  
   - *What you'll see:* Training loss continues to drop, but validation loss bottoms out then rises.  
   - *Fix:* Lower the number of epochs or add regularization (e.g. Dropout, weight decay).

> #### 2. Batch Size  

 - **Smaller** (e.g. 16 → 8):  
   - *Effect:* Noisier gradient estimates → more “wiggle” in the loss curves. Can help escape local minima but may slow convergence.  
   - *What you'll see:* Loss curves jump up-and-down but might find a better final minimum.  
 - **Larger** (e.g. 64 → 128):  
   - *Effect:* Smoother, more stable training but potentially gets stuck in sharp minima and over-fits.  
   - *What you'll see:* Very smooth, monotonic decrease in training loss; if validation loss rises, you may be over-fitting.

> #### 3. Learning Rate (via Optimizer)  

 - **Higher LR**:  
   - *Effect:* Bigger steps → faster initial decrease but risk of divergence or oscillation.  
   - *What you'll see:* Loss may bounce up and down wildly or even increase.  
 - **Lower LR**:  
   - *Effect:* Smaller steps → more stable but slower convergence; may get stuck if too low.  
   - *What you'll see:* Gradual, steady decline; risk of plateauing early.

> #### 4. Regularization  

 - **Dropout**:  
   - *Effect:* Randomly “drops” neurons each batch to prevent co-adaptation.  
   - *What you'll see:* Training loss will be higher (noisier), but validation loss should flatten rather than rise steeply if over'fitting was a problem.  
 - **Weight Decay (L2 Regularization)**:  
   - *Effect:* Penalizes large weights → smoother function.  
   - *What you'll see:* Similar to Dropout: less gap between training & validation loss.

> #### 5. Model Capacity  

 - **Too small** (under-fitting):  
   - *Fix:* Add more layers or neurons.  
   - *What you'll see:* Both curves high, little improvement across epochs.  
 - **Too large** (over'fitting):  
   - *Fix:* Reduce layers or neurons; add regularization.  
   - *What you'll see:* Training loss very low, validation loss rising after a point.

> #### 6. Putting It All Together  

 1. **Plot** `history.history['loss']` vs. `history.history['val_loss']`.  
 2. **Assess**:  
    - **Under-fit** (flat & high): increase capacity, epochs, or LR.  
    - **Over-fit** (train ↓, val ↑): add Dropout/L2, reduce capacity or epochs, increase batch size.  
    - **Plateau** (both flat): adjust learning rate or try a different optimizer.  
 3. **Iterate**: Change one hyperparameter at a time to see its specific effect.


### Train the model

In [None]:
from tensorflow.keras.callbacks import ModelCheckpoint

if __name__ == "__main__":


    X_train_std, X_test_std, y_train_std, y_test_std = preprocess(
    X_train, X_test, y_train, y_test
    )

    model = compile_model(create_model(input_dim=X_train_std.shape[1]))

    # Save only the best weights
    checkpoint_cb = ModelCheckpoint(
        filepath="best_weights.weights.h5",
        monitor="val_loss",
        save_best_only=True,
        save_weights_only=True,
        mode="min",
        verbose=1
    )

    ################################################################
    # TODO: You can feel free to adjust the batch size, epochs... to achieve better results
    # As long as don't make it too large as computation power is limited on Colab.
    # And remember your model should have <= 10k parameters

   

    # TODO: your code ends here
    ###############################################################################

    # Load the best weights & Save the model
    model.load_weights("best_weights.weights.h5")
    model.save("lab6_model.keras")


In [None]:
# Visualize the training process

if __name__ == "__main__":
    hist   = pd.DataFrame(history.history)
    epochs = hist.index + 1

    for key, ylabel in [("loss", "Mean-Squared Error"),
                        ("mae",  "Mean Absolute Error")]:
        plt.figure(figsize=(8,4))
        plt.plot(epochs, hist[key],        label="Train")
        plt.plot(epochs, hist[f"val_{key}"],label="Val")
        if key == "loss": plt.yscale("log")
        plt.xlabel("Epoch"); plt.ylabel(ylabel)
        plt.legend(); plt.tight_layout(); plt.show()

## **Task 4 · Evaluate & Save**

### **Evaluation & Visualisation**

* **MSE (Mean Squared Error)**  
  Measures the average of the squared differences between predictions and true values, heavily penalizing larger errors to focus the model on big mistakes.  
  $$ \mathrm{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

* **RMSE (Root Mean Squared Error)**  
  The square root of MSE, bringing the error back to the same units as the target for more intuitive interpretation while still emphasizing larger deviations.  
  $$ \mathrm{RMSE} = \sqrt{\mathrm{MSE}} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $$

* **MAE (Mean Absolute Error)**  
  Computes the average absolute difference between predictions and true values, offering a straightforward “per‐unit” error metric that’s robust to outliers.  
  $$ \mathrm{MAE} = \frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y}_i \rvert $$

* **R² (Coefficient of Determination)**  
  Reflects the proportion of variance in the true values that the model explains (1 = perfect fit, values <0 indicate worse than guessing the mean).  
  $$ R^2 = 1 - \frac{\displaystyle\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\displaystyle\sum_{i=1}^{n} (y_i - \bar{y})^2} $$

Let's compute these on the **standardised** test split and plot predictions.  



In [None]:
if __name__ == "__main__":
      model = keras.models.load_model("lab6_model.keras")
      y_pred_std = model.predict(X_test_std, verbose=0).flatten()

      y_mean = y_train.mean()
      y_std  = y_train.std(ddof=0)
      y_true = y_test_std * y_std + y_mean
      y_pred = y_pred_std * y_std + y_mean

      mse  = mean_squared_error(y_true, y_pred)
      rmse = np.sqrt(mse)
      mae  = mean_absolute_error(y_true, y_pred)
      r2   = r2_score(y_true, y_pred)
      print(f"MSE={mse:.4f}  RMSE={rmse:.4f}  MAE={mae:.4f}  R²={r2:.4f}")

      min_val = min(y_true.min(), y_pred.min())
      max_val = max(y_true.max(), y_pred.max())

      plt.figure(figsize=(6, 6))
      plt.scatter(y_true, y_pred, s=20, alpha=0.55)
      plt.plot([min_val, max_val],
         [min_val, max_val],
         color='purple',
          linestyle=':',
         linewidth=1.5,
         alpha=0.75,
         label='Ideal: $y = x$')
      plt.xlabel("Actual final_grade")
      plt.ylabel("Predicted final_grade")
      plt.title(f"Actual vs Predicted  (R²={r2:.3f})")
      plt.tight_layout()
      plt.show()


## **Grading Scheme**

Please export your notebook on Colab as `lab6_tasks.py` (File -> Download -> Download .py), download your `lab6_model.keras` model weight file, compress them into `lab6_tasks.zip` and submit.


* You get **2 points** for data preprocessing (task 1)
* You get **2 points** for the valid implementation of the MLP model, with model parameter <= 20,000 (task 2)
* You get **2 points** for model compilation (task 3)
* You get **2 points** for achieving an RMSE score of <= 2.4 on given testcase
* You get **2 points** for achieving an RMSE score of <= 2.4 on hidden testcase

**IMPORTANT**: Note that if you have a large model with > 20,000 parameter, Your testcase 4 and 5 will automatically **get 0 point** because of limit computation power for ZINC.