[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/UNSW-COMP9414/Assignment2/blob/main/COMP9414-Assignment2.ipynb)

# COMP9414 24T3 - Assignment 2 - Neural Networks, Decision Trees and Random Forests

## UNSW Sydney

Designed by Gustavo Batista.

Last change: 20th October, 2024.

Boyang He - z5575322


## Instructions

**Submission deadline:** Friday, 8th November 2024, at 17:00:00 AEDT.

**Submission:** You can submit your solution via the give system using the command ``give ass2 ass2.ipynb``.

**Instructions:**
* This is an **individual** assignment.
* Write your name and zID on the top of this Jupyter Notebook.
* You can only use the libraries listed in this notebook
* You can reuse any piece of source code developed in the tutorials.
* Do not modify the existing code in this notebook except to answer the questions. The cells that should be modified are indicated.
* If you want to submit additional code (e.g., for generating plots), write it at the end of the notebook.
* This notebook is worth **75** marks and will be rescaled to **25** marks.

**Late Submission Policy:** The penalty is a 5% reduction of the assignment value (i.e. 1.25 marks) in the mark per day. For example, if an assignment gets an on-time mark of $20/25$ and is submitted three days late, it will receive a mark reduction of $3*1.25 = 3.75$, so the assignment will get $16.25$ after three days. After five days, the assignment will receive a mark reduction of $100\%$.


**Plagiarism:**

Remember that ALL work submitted for this assignment must be your own work, and no sharing or copying of code or answers is allowed. You may discuss the assignment with other students but must not collaborate to develop answers to the questions. You may use code from the Internet only with suitable attribution of the source. You may not use ChatGPT or any similar software to generate any part of your explanations, evaluations or code. Do not use public code repositories on sites such as GitHub or file-sharing sites such as Google Drive to save any part of your work &ndash; make sure your code repository or cloud storage is private, and do not share any links. This also applies after you have finished the course, as we do not want next year’s students accessing your solution, and plagiarism penalties can still apply after the course has finished.

All submitted assignments will be run through plagiarism detection software to detect similarities to other submissions, including from past years. You should **carefully** read the UNSW policy on academic integrity and plagiarism (linked from the course web page), noting, in particular, that collusion (working together on an assignment or sharing parts of assignment solutions) is a form of plagiarism.

Finally, do not use any contract cheating “academies” or online “tutoring” services. This counts as serious misconduct with heavy penalties up to automatic failure of the course with 0 marks and expulsion from the university for repeat offenders.

## Technical prerequisites

These are the libraries you are allowed to use. No other libraries will be accepted. Make sure you are using Python 3.

In [25]:
!pip install keras-tuner
!pip install tensorflow



In [1]:
# These are the allowed libraries. You can add other libraries used in the tutorials.

# Common Python libraries
import math
import copy
import requests
import zipfile
import os
import time
import numpy as np
import pandas as pd
import scipy as sp
import matplotlib as mp
import matplotlib.pyplot as plt
from collections import defaultdict

# Scikit-Learn libraries for data preprocessing and model assessment
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Libraries for the tree models
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Scikit-learn libraries for hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# Tensorflow/keras libraries for shallow and deep-learning models
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Flatten, Dropout
from tensorflow.keras.optimizers import Adam, SGD, RMSprop

# Keras Tuner libraries for hyperparameter tuning
from keras_tuner import HyperModel
from keras_tuner.tuners import RandomSearch

# Libraries to present results in tabular format
from tabulate import tabulate

Collecting keras-tuner
  Downloading keras_tuner-1.4.7-py3-none-any.whl.metadata (5.4 kB)
Collecting kt-legacy (from keras-tuner)
  Downloading kt_legacy-1.0.5-py3-none-any.whl.metadata (221 bytes)
Downloading keras_tuner-1.4.7-py3-none-any.whl (129 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.1/129.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading kt_legacy-1.0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: kt-legacy, keras-tuner
Successfully installed keras-tuner-1.4.7 kt-legacy-1.0.5


This assignment compares three Machine Learning approaches: Neural Networks, Decision Trees, and Random Forests. We will assess these approaches in five benchmark datasets with diverse characteristics.

We would like to test a few hypotheses based on common Machine Learning wisdom and misconceptions.

1. Neural networks are the best general classifiers regarding prediction quality (accuracy, error rate, precision, recall, etc.).
2. Neural networks are time-consuming for training as fitting model parameters is slow and has many hyperparameters.
3. Random forests are an excellent compromise between classification performance and hyperparameter tuning. They can often provide competitive accuracy without requiring much hyperparameter tuning.
4. Neural networks are data-hungry and perform poorly in small datasets.
5. Decision trees offer model interpretability but are not competitive in accuracy.
6. Neural networks are the best models when learning from unstructured data such as images.
7. Random forests are the best models when learning from structured data such as a tabular dataset.

## Task 0 - Datasets description, downloading and loading the data into a Pandas dataframe

We have selected five publicly available benchmark datasets:

1. **UCI adult income dataset.** This is a binary classification dataset in which we want to predict if a person earns more than $50k/year. It is a mid-size dataset (48K examples) with 14 features of mixed data types (categorical and continuous) with missing values.

2. **Forest cover type dataset.** This is a large multi-class dataset with 580K examples and 54 features of mixed types. The objective is to predict the type of forest cover based on features such as soil type, elevation, and slope.

3. **California housing prices**. This is a regression problem in which we want to predict housing prices based on numerical features, such as population, median income and location. It has 20K instances and nine features.

4. **Fashion MNIST dataset**. It is an image classification dataset that is very similar to MNIST. Images are $28 \times 28$ grayscale pixels. The objective is to classify ten different types of clothing. It has 60k training and 10K test instances.

5. **Credit card fraud detection**. This is a binary classification dataset for detecting fraudulent transactions in credit card data. It is highly imbalanced, meaning that most transactions are normal, with some rare fraud cases. It has 284K instances and 30 numerical features.

This table summarises the datasets.

| Dataset                          | Problem Type        | Feature Type                          | Size        | Notable Challenge                                    |
|-----------------------------------|---------------------|---------------------------------------|-------------|------------------------------------------------------|
| **UCI Adult Income**              | Binary Classification| Categorical and Numerical             | 48,000      | Mix of feature types with missing values           |
| **Forest Cover Type**             | Multi-class Classification | Categorical and Numerical       | 580,000     | Large dataset with mix of feature types      |
| **California Housing Prices**     | Regression          | Numerical                              | 20,000      | Regression task      |
| **Fashion MNIST**                 | Multi-class Classification (Image)| Image (grayscale)        | 60,000      | Weak features in the form of individual pixels brightness       |
| **Credit Card Fraud Detection**   | Binary Classification (Imbalanced)| Numerical                | 284,000     | Highly imbalanced dataset        |

Let's start by downloading the data from GitHub. The cell below will download and save the data into a local ``data`` folder. We will use the data later to train and assess our models.

In [2]:
# Do not change the code in this cell.
# This cell has no code to write. It downloads and unzips the datasets to your local disk.

def download_and_extract(url, extract_to):
    """
    Download a zip file from the URL and extract it to the specified directory.

    Parameters:
    - url (str): The URL of the zip file to download.
    - extract_to (str): The directory where the zip file's contents will be extracted.

    Returns:
    None
    """
    # Get the file, dataset names from the URL
    zip_filename = url.split("/")[-1]
    dataset_name = zip_filename.split(".")[0]

    # Each dataset will have its folder
    extract_to = extract_to + "/" + dataset_name

    # Download the zip file
    print(f"Downloading {zip_filename} from {url}...")
    response = requests.get(url)
    with open(zip_filename, "wb") as file:
        file.write(response.content)

    # Create the extraction directory if it doesn't exist
    if not os.path.exists(extract_to):
        os.makedirs(extract_to)

    # Unzip the file
    with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
        zip_ref.extractall(extract_to)

    # Remove the zip files
    os.remove(zip_filename)

# These are the URLs to the datasets. We have hosted the data on GitHub.
urls = [
    "https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/adult/adult.zip",
    "https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/covertype/covertype.zip",
    "https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/california_housing/california_housing.zip",
    "https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/creditcard/creditcard.zip",
    "https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/fashion_mnist/fashion_mnist.zip",
]

for i, url in enumerate(urls):
    download_and_extract(url, "data")

Downloading adult.zip from https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/adult/adult.zip...
Downloading covertype.zip from https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/covertype/covertype.zip...
Downloading california_housing.zip from https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/california_housing/california_housing.zip...
Downloading creditcard.zip from https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/creditcard/creditcard.zip...
Downloading fashion_mnist.zip from https://raw.githubusercontent.com/UNSW-COMP9414/Assignment2/main/data/fashion_mnist/fashion_mnist.zip...


### Loading data into pandas

The datasets are well-diversified in size (number of examples and features), number of class labels, feature types (continuous and discrete), class distribution, and presence of missing data.

All datasets have pre-defined training and testing splits. We will use the training set to train the models and choose hyperparameters. You may further split the training set into training and validation sets. The test set should only be used to assess and compare the models.

The next cell has a supporting function that loads a specified dataset training and test sets into a pandas' dataframe.

In [3]:
# Do not change the code in this cell.
# This cell has no code to write. It is a helper function that loads data from this into a Pandas dataframe.

def load_train_test_data(path):
    """
    Loads the train and test CSV files and returns them split into features (X) and labels (y).

    Parameters:
    - path (str): Path to the train and test CSV files.

    Returns:
    - X_train (DataFrame): Features of the training dataset.
    - y_train (DataFrame): Labels of the training dataset.
    - X_test (DataFrame): Features of the test dataset.
    - y_test (DataFrame): Labels of the test dataset.
    """
    # Load the training and testing data
    train_df = pd.read_csv(f"{path}/train.csv")
    test_df = pd.read_csv(f"{path}/test.csv")

    # Select class label columns (those starting with 'Target')
    y_train = train_df.filter(regex='^Target')
    y_test = test_df.filter(regex='^Target')

    # Select feature columns (all columns except the ones with 'Target' prefix)
    X_train = train_df.drop(columns=y_train.columns)
    X_test = test_df.drop(columns=y_test.columns)

    return X_train, y_train, X_test, y_test

# Example usage:
path = "data/adult"
print(f"Loading {path}...")
X_train, y_train, X_test, y_test = load_train_test_data(path)

print(f"Train Features Shape: {X_train.shape}, Train Labels Shape: {y_train.shape}")
print(f"Test Features Shape: {X_test.shape}, Test Labels Shape: {y_test.shape}")




Loading data/adult...
Train Features Shape: (32561, 14), Train Labels Shape: (32561, 1)
Test Features Shape: (16281, 14), Test Labels Shape: (16281, 1)


## Task 1 [14 Marks] - Data preprocessing

Your first task is to preprocess the datasets. Preprocessing usually involves data cleaning and transformation to improve data quality and prepare the data for the specific requirements of the learning approaches.

For data preparation, we have the following tasks:
1. **Missing imputation (all models)**: The adult dataset has missing values, and none of our learning algorithm implementations can directly handle missing data. Two missing data treatments are eliminating the rows with missing data or replacing the missing values with estimated ones. *Mean imputation*, as the name suggests, replaces missing values with the attribute mean, median (continuous features) or mode (discrete features). These statistics must only be estimated in the training set.
2. **Feature encoding (all models)**: Neural networks, tree and random forest implementations available in the Scikit-Learn library do not handle categorical attributes directly. Therefore, these attributes need to be converted into numerical attributes. Although several encoding approaches exist, we will use one-hot encoding, as it is simple and recommended for categorical features with a small cardinality.
3. **Class attribute encoding (neural networks only)**: Neural networks also need a one-hot encoding for the class attribute. This step is not necessary for the tree models.
4. **Rescaling attribute values (neural networks only)**: The neural network's training benefits from rescaling the attribute values. In this task, we will convert each attribute to a number in the 0 to 1 range by using a simple linear rescaling: $x_s = \frac{x-min_f}{max_f-min_f}$, where $x_s$ is the recalled $x$ value, $min_f$ is the minimum and $max_f$ the maximum values for feature $f$ in the training data.

Tree models typically do not use class encoding and rescaling. The reason is twofold: first, this preprocessing does not help these models fit better parameters; second, tree models are known for their interpretability, and these manipulations create models that are not easier and often harder to understand. Feature encoding is also unnecessary for many tree model implementations, including the well-known [XGBoost](https://xgboost.readthedocs.io/en/stable/) and [LightGBM](https://lightgbm.readthedocs.io/en/stable/). Unfortunately, the Scikit-Learn implementation of tree models does not support categorical attributes.

**Warning**: Leaking information from the test set to the training set, even if such information is aggregated data such as means, maximums, and minimums, is considered a serious methodological error. For instance, mean imputation should use the mean only in the training set. Similarly, the maximum and minimum for attribute rescaling should be calculated in the training set. Consequently, we may see values outside the range of 0-1 in the rescaled test set. This mimics the situation in which we find extreme values after the model deployment.

### Task 1.1 [6 Marks] - Missing data removal or imputation

Create a function ``missing_data(X_train, X_test)`` that imputes missing values in the dataframes `X_train` and `X_test`. When the function returns, both dataframes should have no missing values.

In [4]:
def missing_data(X_train, X_test):
    """
    Impute missing values in the train and test DataFrames using median/mode imputation.
    Missing data statistics are only estimated on the training set and applied to the training and test sets.
    Pro-tip: you can use Scikit-Learn's SimpleImputer.

    Parameters:
    - X_train (DataFrame): Training features.
    - X_test (DataFrame): Test features.

    Returns:
    - X_train_filled (DataFrame): Training features with no missing values.
    - X_test_filled (DataFrame): Test features with no missing values.
    """
    # find out attributes which have missing values
    missing_train_attributes = X_train.columns[X_train.isna().any()].tolist()
    missing_test_attributes = X_test.columns[X_test.isna().any()].tolist()

    X_train_filled = X_train.copy()
    for mi_train in missing_train_attributes:
      if X_train[mi_train].dtype == 'int64' or X_train[mi_train].dtype == 'float64':
        median_train = X_train[mi_train].median()
        X_train_filled[mi_train] = X_train[mi_train].fillna(median_train)
      elif X_train[mi_train].dtype == 'object':
        mode_train = X_train[mi_train].mode()[0]
        X_train_filled[mi_train] =X_train[mi_train].fillna(mode_train)

    X_test_filled = X_test.copy()
    for mi_test in missing_test_attributes:
      if X_test[mi_test].dtype == 'int64' or X_test[mi_test].dtype == 'float64':
        median_test = X_test[mi_test].median()
        X_test_filled[mi_test]=X_test[mi_test].fillna(median_test)
      elif X_test[mi_test].dtype == 'object':
        mode_test = X_test[mi_test].mode()[0]
        X_test_filled[mi_test]=X_test[mi_test].fillna(mode_test)

    return X_train_filled, X_test_filled

# Example usage:
path = "data/adult"
print(f"Loading {path}...")
X_train, y_train, X_test, y_test = load_train_test_data(path)
X_train_filled, X_test_filled = missing_data(X_train, X_test)

# compare the origin with the cleanup
print(f"after_cleanup_train：\n{X_train_filled.isna().sum()}")
print(f"\nafter_cleanup_test：\n{X_test_filled.isna().sum()}")

print(f"\norigin_train：\n{X_train.isna().sum()}")
print(f"\norigin_test：\n{X_test.isna().sum()}")



Loading data/adult...
after_cleanup_train：
Age               0
Workclass         0
Fnlwgt            0
Education         0
Education-num     0
Marital-status    0
Occupation        0
Relationship      0
Race              0
Sex               0
Capital-gain      0
Capital-loss      0
Hours-per-week    0
Native-country    0
dtype: int64

after_cleanup_test：
Age               0
Workclass         0
Fnlwgt            0
Education         0
Education-num     0
Marital-status    0
Occupation        0
Relationship      0
Race              0
Sex               0
Capital-gain      0
Capital-loss      0
Hours-per-week    0
Native-country    0
dtype: int64

origin_train：
Age                  0
Workclass         1836
Fnlwgt               0
Education            0
Education-num        0
Marital-status       0
Occupation        1843
Relationship         0
Race                 0
Sex                  0
Capital-gain         0
Capital-loss         0
Hours-per-week       0
Native-country     583
dtype: int64


### Task 1.2 [4 Marks] - Feature and class encoding

Let's implement a function ``encoding(X_train, X_test)`` that creates one-hot encodings for all categorical attributes. All categorical attributes are encoded as one-hot numeric features when the function returns.

In [5]:
# This cell will be assessed. Replace the ... with your code

def encoding(X_train, X_test):
    """
    Encodes categorical features and class labels into one-hot numeric features.
    Ensure that you have a consistent encoding across training and test sets.
    Pro-tip: use Panda's get_dummies.

    Parameters:
    - X_train (DataFrame): Training features.
    - X_test (DataFrame): Test features.

    Returns:
    - X_train_encoded (DataFrame): One-hot encoded training features.
    - X_test_encoded (DataFrame): One-hot encoded test features.
    """

    X_train_encoded = pd.get_dummies(X_train,drop_first=True).astype(int)
    X_test_encoded = pd.get_dummies(X_test,drop_first=True).astype(int)

    # make sure that the two charts are equal size and are aligned
    X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='left', axis=1, fill_value=0)

    return X_train_encoded, X_test_encoded


'''
path = "data/adult"
print(f"Loading {path}...")
X_train, y_train, X_test, y_test = load_train_test_data(path)
X_train_filled, X_test_filled = missing_data(X_train, X_test)

X_train_encoded, X_test_encoded = encoding(X_train_filled, X_test_filled)

print(f"after_encoding_train：\n{X_train_encoded.head()}")
print(f"\nafter_encoding_test：\n{X_test_encoded.head()}")
'''







'\npath = "data/adult"\nprint(f"Loading {path}...")\nX_train, y_train, X_test, y_test = load_train_test_data(path)\nX_train_filled, X_test_filled = missing_data(X_train, X_test)\n\nX_train_encoded, X_test_encoded = encoding(X_train_filled, X_test_filled)\n\nprint(f"after_encoding_train：\n{X_train_encoded.head()}")\nprint(f"\nafter_encoding_test：\n{X_test_encoded.head()}")\n'

#### Task 1.3 [4 Marks] - Rescaling attributes

To conclude the pre-processing task, let's create a function ``rescale(X_train, X_test)`` that rescales all continuous attributes so that each attribute is between 0 and 1. When the function returns, all numerical attributes should be rescaled.

In [6]:
# This cell will be assessed. Replace the ... with your code

def rescale(X_train, X_test):
    """
    Rescales all continuous attributes in the train and test datasets to be in the range [0, 1].
    Rescaling statistics should only be estimated on the training set and applied to the training and test sets.
    Pro-tip: use MinMaxScaler.

    Parameters:
    - X_train (DataFrame): Training features.
    - X_test (DataFrame): Test features.

    Returns:
    - X_train_rescaled (DataFrame): Rescaled training features.
    - X_test_rescaled (DataFrame): Rescaled test features.
    """
    # select all the continuous attributes
    numerical_cols = X_train.select_dtypes(include=['float64', 'int64']).columns

    mMscaler = MinMaxScaler()
    X_train_rescaled = X_train.copy()
    X_test_rescaled = X_test.copy()

    X_train_rescaled[numerical_cols] = mMscaler.fit_transform(X_train[numerical_cols])
    X_test_rescaled[numerical_cols] = mMscaler.transform(X_test[numerical_cols])

    return X_train_rescaled, X_test_rescaled

'''
path = "data/adult"
print(f"Loading {path}...")
X_train, y_train, X_test, y_test = load_train_test_data(path)
X_train_filled, X_test_filled = missing_data(X_train, X_test)


X_train_rescaled, X_test_rescaled = rescale(X_train_filled, X_test_filled)
print(f"after_rescale_train：\n{X_train_rescaled.head()}")
print(f"\nafter_rescale_test：\n{X_test_rescaled.head()}")
'''

'\npath = "data/adult"\nprint(f"Loading {path}...")\nX_train, y_train, X_test, y_test = load_train_test_data(path)\nX_train_filled, X_test_filled = missing_data(X_train, X_test)\n\n\nX_train_rescaled, X_test_rescaled = rescale(X_train_filled, X_test_filled)\nprint(f"after_rescale_train：\n{X_train_rescaled.head()}")\nprint(f"\nafter_rescale_test：\n{X_test_rescaled.head()}")\n'

### Preprocessing the datasets

In the cell below, we will call your functions to preprocess the datasets. We will create two versions of each dataset: the first is suitable for the tree models and will have no missing values and encoded attributes. The second will have no missing values, encoded categorical and class features, and numeric features rescaled. We will save these datasets for use later. The datasets pre-processed for trees will be saved in a ``tree`` folder. The datasets for neural networks will be saved in a ``nn`` folder.

In [7]:
# Do not change the code in this cell.
# This cell has no code to write. It calls your pre-processing functions and saves the preprocessed datasets on disk.

datasets = ["adult", "covertype", "california_housing", "fashion_mnist", "creditcard"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data("data/" + dataset)

    # Preprocessing for tree-based models
    tree_path = f"data/{dataset}/tree"
    if not os.path.exists(tree_path):
        os.makedirs(tree_path)

    # Handle missing data
    X_train, X_test = missing_data(X_train, X_test)

    # Apply encoding features
    X_train_encoded, X_test_encoded = encoding(X_train, X_test)

    # Concatenate X and y for train and test data.
    # For decision trees, we do not encode the class attribute
    train_tree = pd.concat([X_train_encoded, y_train.reset_index(drop=True)], axis=1)
    test_tree = pd.concat([X_test_encoded, y_test.reset_index(drop=True)], axis=1)

    # Save tree-preprocessed datasets
    train_tree.to_csv(f"{tree_path}/train.csv", index=False)
    test_tree.to_csv(f"{tree_path}/test.csv", index=False)

    # Preprocessing for neural networks
    nn_path = f"data/{dataset}/nn"
    if not os.path.exists(nn_path):
        os.makedirs(nn_path)

    # Apply encoding class attribute. For a regression dataset, the next line should do nothing
    y_train_encoded, y_test_encoded = encoding(y_train, y_test)

    # Rescale the features
    X_train_rescaled, X_test_rescaled = rescale(X_train_encoded, X_test_encoded)

    # Concatenate X and y for train and test data
    train_nn = pd.concat([X_train_rescaled, y_train_encoded.reset_index(drop=True)], axis=1)
    test_nn = pd.concat([X_test_rescaled, y_test_encoded.reset_index(drop=True)], axis=1)

    # Save nn-preprocessed datasets
    train_nn.to_csv(f"{nn_path}/train.csv", index=False)
    test_nn.to_csv(f"{nn_path}/test.csv", index=False)

Processing dataset: adult
Processing dataset: covertype
Processing dataset: california_housing
Processing dataset: fashion_mnist
Processing dataset: creditcard


## Task 2 - [16 Marks] Model Training

We have the data ready, and in this task, we will train some initial models for each dataset. We will refine the models later, but for now, we will create a swallow model for the neural network. The decision tree and the random forest models will use Scikit-Learn's default hyperparameter values for these models.

The neural network will have three layers: the input layer ($i$), one hidden layer ($h$) and one output layer ($o$). We will use a simple rule-of-thumb for the number of units in the hidden layer: $D_h = \sqrt{D_i * D_o}$. The other hyperparameters are similar to the ones used in the Week 07 tutorial.

### Task 2.1 [4 Marks] - Shallow neural net for classification

Create a function ``train_shallow_net_class(X_train, y_train)`` that trains a shallow neural net for classification using the training data ``X_train`` and labels ``y_train``. Use the following hyperparameters:
1. A single hidden layer with $D_h = \text{round}(\sqrt{D_i * D_o})$ units.
2. ReLU activation in the hidden layer and softmax on the output layer.
3. Categorical cross-entropy as loss function.
4. Train for 30 epochs.
5. Batch size of 32 instances.
6. Validation split of 20% of the training data.
7. Adam optimiser.

In [8]:
# This cell will be assessed. Replace the ... with your code

def train_shallow_net_class(X_train, y_train):
    """
    Trains a shallow neural net for classification problems with one hidden layer using the training data.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training labels (one-hot encoded for classification).
    - Dh (int): Number of units in the hidden layer.

    Returns:
    - model: Trained Keras neural network model.
    """

    Di = X_train.shape[1]  # Number of input features
    Do = y_train.shape[1]  # Number of output classes

    Dh = round(math.sqrt(Di * Do))


    model = Sequential([
    Input(shape=(Di,)),                  # the input layer
    Dense(Dh, activation='relu'),           # Hidden layer with Dh neurons and ReLU activation
    Dense(Do, activation='softmax')         # Output layer with Do neurons (one for each digit)
    ])

    model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])


    model_trained = model.fit(X_train, y_train, epochs=30, validation_split=0.2, batch_size=32)


    return model

The next cell will call your function to train a shallow model for each classification dataset, compute the training time, and test error.

In [9]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your shallow model using the classification datasets.

results = defaultdict(dict)
datasets = ["adult", "covertype", "fashion_mnist", "creditcard"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/nn/")

    start = time.time()
    model = train_shallow_net_class(X_train, y_train)
    end = time.time()

    test_loss, test_accuracy = model.evaluate(X_test, y_test)
    print(f'Test error rate: {1 - test_accuracy:.4f}')
    print(f'Runtime to train the model: {end-start} seconds')

    results["Neural Net"].setdefault(dataset, {})
    results["Neural Net"][dataset]["Prediction quality"] =  1 - test_accuracy         # Error rate = 1 - accuracy
    results["Neural Net"][dataset]["Training time"] = end-start

Processing dataset: adult
Epoch 1/30


  return self.fn(y_true, y_pred, **self._fn_kwargs)


[1m808/814[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 1ms/step - accuracy: 0.2413 - loss: 0.0000e+00

  return self.fn(y_true, y_pred, **self._fn_kwargs)


[1m814/814[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.2413 - loss: 0.0000e+00 - val_accuracy: 0.2457 - val_loss: 0.0000e+00
Epoch 2/30
[1m814/814[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.2444 - loss: 0.0000e+00 - val_accuracy: 0.2457 - val_loss: 0.0000e+00
Epoch 3/30
[1m814/814[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.2411 - loss: 0.0000e+00 - val_accuracy: 0.2457 - val_loss: 0.0000e+00
Epoch 4/30
[1m814/814[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.2370 - loss: 0.0000e+00 - val_accuracy: 0.2457 - val_loss: 0.0000e+00
Epoch 5/30
[1m814/814[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.2427 - loss: 0.0000e+00 - val_accuracy: 0.2457 - val_loss: 0.0000e+00
Epoch 6/30
[1m814/814[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/s

  return self.fn(y_true, y_pred, **self._fn_kwargs)


[1m4747/4747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9983 - loss: 0.0000e+00

  return self.fn(y_true, y_pred, **self._fn_kwargs)


[1m4747/4747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 2ms/step - accuracy: 0.9983 - loss: 0.0000e+00 - val_accuracy: 0.9982 - val_loss: 0.0000e+00
Epoch 2/30
[1m4747/4747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1ms/step - accuracy: 0.9981 - loss: 0.0000e+00 - val_accuracy: 0.9982 - val_loss: 0.0000e+00
Epoch 3/30
[1m4747/4747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1ms/step - accuracy: 0.9983 - loss: 0.0000e+00 - val_accuracy: 0.9982 - val_loss: 0.0000e+00
Epoch 4/30
[1m4747/4747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1ms/step - accuracy: 0.9980 - loss: 0.0000e+00 - val_accuracy: 0.9982 - val_loss: 0.0000e+00
Epoch 5/30
[1m4747/4747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 1ms/step - accuracy: 0.9983 - loss: 0.0000e+00 - val_accuracy: 0.9982 - val_loss: 0.0000e+00
Epoch 6/30
[1m4747/4747[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

### Task 2.2 [4 Marks] - Shallow neural net for regression

Like the previous task, create a function ``train_shallow_net_regression(X_train, y_train)`` that trains a shallow neural net for regression using the training data ``X_train`` and labels ``y_train``. Use the following hyperparameters:
1. A single hidden layer with $D_h = \text{round}(\sqrt{D_i * D_o})$ units.
2. ReLU activation in the hidden layer and linear on the output layer.
3. MSE loss function.
4. Train for 30 epochs.
5. Batch size of 32 instances.
6. Validation split of 20% of the training data.
7. Adam optimiser.

In [10]:
# This cell will be assessed. Replace the ... with your code

def train_shallow_net_regression(X_train, y_train):
    """
    Trains a shallow neural net with one hidden layer using the training data.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training labels (one-hot encoded for classification).
    - Dh (int): Number of units in the hidden layer.

    Returns:
    - model: Trained Keras neural network model.
    """

    Di = X_train.shape[1]  # Number of input features
    Do = 1  # Number of output classes

    Dh = round(math.sqrt(Di * Do))


    model = Sequential([
    Input(shape=(Di,)),                  # the input layer
    Dense(Dh, activation='relu'),           # Hidden layer with Dh neurons and ReLU activation
    Dense(Do, activation='softmax')         # Output layer with Do neurons (one for each digit)
    ])

    model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])


    model_trained = model.fit(X_train, y_train, epochs=30, validation_split=0.2, batch_size=32)


    return model


Once again, we will run your code for each regression dataset. This assignment has only one of such datasets, but we will keep a similar code we implemented before.

In [11]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your shallow model using the regression dataset.

datasets = ["california_housing"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/nn/")

    start = time.time()
    model = train_shallow_net_regression(X_train, y_train)
    end = time.time()

    # Make predictions
    test_loss, test_mse = model.evaluate(X_test, y_test)
    print(f'Test MSE: {test_mse:.4f}')
    print(f'Runtime to train the model: {end-start} seconds')

    results["Neural Net"].setdefault(dataset, {})
    results["Neural Net"][dataset]["Prediction quality"] = test_mse
    results["Neural Net"][dataset]["Training time"] = end-start

Processing dataset: california_housing
Epoch 1/30


  return self.fn(y_true, y_pred, **self._fn_kwargs)


[1m310/344[0m [32m━━━━━━━━━━━━━━━━━━[0m[37m━━[0m [1m0s[0m 1ms/step - accuracy: 0.3998 - loss: 0.0000e+00

  return self.fn(y_true, y_pred, **self._fn_kwargs)


[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.4001 - loss: 0.0000e+00 - val_accuracy: 0.4023 - val_loss: 0.0000e+00
Epoch 2/30
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.4005 - loss: 0.0000e+00 - val_accuracy: 0.4023 - val_loss: 0.0000e+00
Epoch 3/30
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.4012 - loss: 0.0000e+00 - val_accuracy: 0.4023 - val_loss: 0.0000e+00
Epoch 4/30
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.4065 - loss: 0.0000e+00 - val_accuracy: 0.4023 - val_loss: 0.0000e+00
Epoch 5/30
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.4018 - loss: 0.0000e+00 - val_accuracy: 0.4023 - val_loss: 0.0000e+00
Epoch 6/30
[1m344/344[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/s

### Task 2.3 [2 Marks] - Decision tree models for classification

Implement a function ``train_classification_tree(X_train, y_train)`` that trains a decision tree model using the training data ``X_train`` and labels ``y_train``. This function should return a trained Scikit-Learn decision tree classifier.

In [12]:
# This cell will be assessed. Replace the ... with your code

def train_classification_tree(X_train, y_train):
    """
    Trains a Decision Tree for classification.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training class labels.

    Returns:
    - model: Trained Decision Tree Classifier.
    """
    tree_model = DecisionTreeClassifier(random_state=42)
    tree_model.fit(X_train, y_train)
    return tree_model

The code below executes the tree models and records the test accuracy and training time.

In [13]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your decision tree model using the classification datasets.

datasets = ["adult", "covertype", "fashion_mnist", "creditcard"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/tree/")

    print("Training decision tree model")

    start = time.time()
    model = train_classification_tree(X_train, y_train)
    end = time.time()

    # Make predictions
    y_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)

    print(f'Test error rate: {1 - test_accuracy:.4f}')
    print(f'Runtime to train the model: {end-start} seconds')

    results["Decision Tree"].setdefault(dataset, {})
    results["Decision Tree"][dataset]["Prediction quality"] = 1 - test_accuracy
    results["Decision Tree"][dataset]["Training time"] = end-start

Processing dataset: adult
Training decision tree model
Test error rate: 0.1849
Runtime to train the model: 0.35666728019714355 seconds
Processing dataset: covertype
Training decision tree model
Test error rate: 0.0947
Runtime to train the model: 6.823147535324097 seconds
Processing dataset: fashion_mnist
Training decision tree model
Test error rate: 0.2110
Runtime to train the model: 43.96087837219238 seconds
Processing dataset: creditcard
Training decision tree model
Test error rate: 0.0009
Runtime to train the model: 1.6340649127960205 seconds


### Task 2.4 [2 Marks] - Decision tree models for regression

Implement a function ``train_regression_tree(X_train, y_train)`` that trains a regression tree model using the training data ``X_train`` and labels ``y_train``. This function should return a trained Scikit-Learn decision tree regressor.

In [14]:
# This cell will be assessed. Replace the ... with your code

def train_regression_tree(X_train, y_train):
    """
    Trains a Decision Tree for regression.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training target values.

    Returns:
    - model: Trained Decision Tree Regressor.
    """
    model = DecisionTreeRegressor(random_state=42)

    # Train the model on the training data
    model.fit(X_train, y_train)

    return model

The code below executes the regression tree models and saves the running time and test accuracy.

In [15]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your regression tree model using the regression dataset.

datasets = ["california_housing"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/tree/")

    print("Training regression tree model")

    start = time.time()
    model = train_regression_tree(X_train, y_train)
    end = time.time()

    # Make predictions
    y_pred = model.predict(X_test)
    test_mse = mean_squared_error(y_test, y_pred)

    print(f'Test accuracy: {test_accuracy:.4f}')
    print(f'Runtime to train the model: {end-start} seconds')

    results["Decision Tree"].setdefault(dataset, {})
    results["Decision Tree"][dataset]["Prediction quality"] = test_mse
    results["Decision Tree"][dataset]["Training time"] = end-start

Processing dataset: california_housing
Training regression tree model
Test accuracy: 0.9991
Runtime to train the model: 0.06874966621398926 seconds


### Task 2.5 [2 Marks] - Random forest models for classification

Implement a function ``train_classification_forest(X_train, y_train)`` that trains a random forest model for classification using the training data ``X_train`` and labels ``y_train``. This function should return a trained Scikit-Learn random forest classifier.

In [16]:
# This cell will be assessed. Replace the ... with your code

def train_classification_forest(X_train, y_train):
    """
    Trains a Random Forest for classification.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training class labels.

    Returns:
    - model: Trained Random Forest Classifier.
    """
    rf_classifier = RandomForestClassifier(random_state=42)

    rf_classifier.fit(X_train, y_train)

    return rf_classifier


The code below executes the randon forest models and records the training time and test accuracy

In [17]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your random forest model using the classification datasets.

datasets = ["adult", "covertype", "fashion_mnist", "creditcard"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/tree/")

    print("Training random forest model")

    start = time.time()
    model = train_classification_forest(X_train, np.array(y_train).ravel())
    end = time.time()

    # Make predictions
    y_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)

    print(f'Test error rate: {1 - test_accuracy:.4f}')
    print(f'Runtime to train the model: {end-start} seconds')

    results["Random Forest"].setdefault(dataset, {})
    results["Random Forest"][dataset]["Prediction quality"] = 1 - test_accuracy
    results["Random Forest"][dataset]["Training time"] = end-start

Processing dataset: adult
Training random forest model
Test error rate: 0.1487
Runtime to train the model: 3.969787836074829 seconds
Processing dataset: covertype
Training random forest model
Test error rate: 0.0759
Runtime to train the model: 86.13231229782104 seconds
Processing dataset: fashion_mnist
Training random forest model
Test error rate: 0.1242
Runtime to train the model: 95.5809075832367 seconds
Processing dataset: creditcard
Training random forest model
Test error rate: 0.0004
Runtime to train the model: 19.953065633773804 seconds


### Task 2.6 [2 Marks] - Random Forest Models for Regression

Finally, implement a function ``train_regression_forest(X_train, y_train)`` that trains a random forest model for regression using the training data ``X_train`` and labels ``y_train``. This function should return a trained Scikit-Learn random forest regressor.

In [18]:
# This cell will be assessed. Replace the ... with your code

def train_regression_forest(X_train, y_train):
    """
    Trains a Random Forest for regression.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training target values.

    Returns:
    - model: Trained Random Forest Regressor.
    """
    rf_regressor = RandomForestRegressor(random_state=42)

    rf_regressor.fit(X_train, y_train)

    return rf_regressor

The code below executes the random forest models for regression and records the training time and mean squared error.



In [19]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your random forest model using the regression dataset.

datasets = ["california_housing"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/tree/")

    print("Training random forest model")

    start = time.time()
    model = train_regression_forest(X_train, np.array(y_train).ravel())
    end = time.time()

    # Make predictions
    y_pred = model.predict(X_test)
    test_mse = mean_squared_error(y_test, y_pred)

    print(f'Test MSE: {test_mse:.4f}')
    print(f'Runtime to train the model: {end-start} seconds')

    results["Random Forest"].setdefault(dataset, {})
    results["Random Forest"][dataset]["Prediction quality"] = test_mse
    results["Random Forest"][dataset]["Training time"] = end-start

Processing dataset: california_housing
Training random forest model
Test MSE: 0.4177
Runtime to train the model: 4.542239427566528 seconds


### Summarising the Results

Congratulations, we have reached the end of Task 2. The next cell will summarise the results obtained in a single table.

In [20]:
# Do not change the code in this cell.
# This cell has no code to write. It summarises the results in a tabular format

def format_values(row):
    """Format numeric values to 4 decimal places."""
    return {k: f"{v:.4f}" if isinstance(v, float) else v for k, v in row.items()}

def print_results_table(results):
    """
    Converts a nested dictionary of results into a table and prints it with separation lines for datasets.
    """
    # Flatten the nested dictionary into a list of rows
    flattened_data = [
        {"Dataset": dataset, "Model": model, **metrics}
        for model, datasets in results.items()
        for dataset, metrics in datasets.items()
    ]

    # Sort the data by the "Dataset" column
    flattened_data_sorted = sorted(flattened_data, key=lambda x: x["Dataset"])

    # Add separator rows between datasets
    formatted_data = []
    previous_dataset = None
    for row in flattened_data_sorted:
        if previous_dataset and row["Dataset"] != previous_dataset:
            # Insert a separator row
            formatted_data.append({key: "----" for key in row.keys()})
        formatted_data.append(format_values(row))
        previous_dataset = row["Dataset"]

    # Extract headers
    headers = list(formatted_data[0].keys())

    # Generate and print the table
    table = tabulate(formatted_data, headers="keys", tablefmt="pretty", missingval="N/A")
    print(table)

print_results_table(results)

+--------------------+---------------+--------------------+---------------+
|      Dataset       |     Model     | Prediction quality | Training time |
+--------------------+---------------+--------------------+---------------+
|       adult        |  Neural Net   |       0.7638       |    39.9626    |
|       adult        | Decision Tree |       0.1849       |    0.3567     |
|       adult        | Random Forest |       0.1487       |    3.9698     |
|        ----        |     ----      |        ----        |     ----      |
| california_housing |  Neural Net   |       0.4009       |    17.5221    |
| california_housing | Decision Tree |       0.7229       |    0.0687     |
| california_housing | Random Forest |       0.4177       |    4.5422     |
|        ----        |     ----      |        ----        |     ----      |
|     covertype      |  Neural Net   |       0.4301       |   425.1212    |
|     covertype      | Decision Tree |       0.0947       |    6.8231     |
|     covert

## Task 3 [32 Marks] - Hyperparameter optimisation

So far, we have used a fixed set of hyperparameters, but it is unclear if they are a good choice for our datasets. We will use Keras Tuner and Scikit Learn libraries to test different hyperparameter combinations. We will start with the Neural Net models.

### Task 3.1 [8 Marks] - Hyperparameter optimisation for classification neural nets

Create a function ``tune_train_classification_net(X_train, y_train, n_iter, project_name)`` that uses Keras tuner's ``RandomSearch`` to optimise the hyperparameters of a neural net model. ``X_train`` and ``y_train`` are pandas dataframes with the training data and labels. ``n_iter`` is the maximum number of iterations in the random search. ``project_name`` is an identifier used by Keras tuner to save the results on disk.

You have the freedom to choose your hyperparameter search space. Here are some suggestions based on the tutorials:
1. Depth. To make your model deeper, test a larger number of hidden layers, up to 3.
2. Width. Try different combinations of numbers of neurons per layer. For instance, you can try from $D_h / 2$ to $D_h * 2$.
3. Activation functions. ReLU, TANH and Sigmoid are common choices.
4. Optimiser. Adam and SGD.
5. Learning rate. A typical range is 1e-4 to 1e-2.

Your function should return the Keras model that achieved the best performance in a validation set of 20% of the training data. Average performance over three runs (``executions_per_trial=3``).

In [23]:
# This cell will be assessed. Replace the ... with your code

def tune_train_classification_net(X_train, y_train, n_iter, project_name):
    """
    Tunes and trains a classification neural network using Random Search.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training target values (one-hot or integer labels).
    - n_iter (int): Number of hyperparameter configurations to try.
    - project_name (str): Name for organizing logs and results.

    Returns:
    - model: Trained Keras model with the best hyperparameters.
    """
    def build(hp):
        model = Sequential()

        # input layer
        model.add(Input(shape=(X_train.shape[1],)))

        # adjust the hidden layer（1 to 3）
        for i in range(hp.Int('num_layers', 1, 3)):
            model.add(Dense(
                units=hp.Int('units_' + str(i), min_value=32, max_value=512, step=32),
                activation=hp.Choice('activation_' + str(i), ['relu', 'tanh', 'sigmoid'])
            ))

        # output layer
        model.add(Dense(y_train.shape[1], activation='softmax'))

        # optimizer and control learning rate
        optimizer = hp.Choice('optimizer', ['adam', 'sgd'])
        if optimizer == 'adam':
            opt = Adam(learning_rate=hp.Float('learning_rate', 1e-4, 1e-2, sampling='log'))
        else:
            opt = SGD(learning_rate=hp.Float('learning_rate', 1e-4, 1e-2, sampling='log'))

        # compile the model
        model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

        return model


    tuner = RandomSearch(
          build,
          objective='val_accuracy',
          max_trials=n_iter,
          executions_per_trial=3,
          directory='log_results',
          project_name=project_name
      )

    tuner.search(X_train, y_train, epochs=10, validation_split=0.2)
    best_model = tuner.get_best_models(num_models=1)[0]

    return best_model

#### Important notice about the runtime

Hyperparameter tuning can take a lot of time, as most Machine Learning algorithms have many hyperparameters, and testing all possible combinations can lead to a combinatorial explosion.

To make comparisons fairer, we will limit the hyperparameter search to using no more than **approximately** 30 minutes of computing time.

The table above tells us the training time for a single model. For instance, if a random forest takes 4s for the adult dataset, then in 1,800 seconds (30 minutes), we can train 1,800 / 4 = 450 models. Each hyperparameter combination performance will be an average of three repetitions. Thus, we can assess 450 / 3 = 150 hyperparameter combinations.

We will control the time using the ``n_iter`` parameter. This parameter defines the maximum number of parameter combinations sampled and tested during the search. Given their smaller number of hyperparameters, some inducers, particularly the trees, may run much faster than 30 minutes.

This is a rough approximation based on a single run of the default models. Thus, some models may run faster or slower than 30 minutes.

In [24]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your optimised neural net model using the classification datasets.

timeout_in_seconds = 1800
datasets = ["adult", "covertype", "fashion_mnist", "creditcard"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/nn/")

    print("\tTuning and training neural net model")

    start = time.time()
    model = tune_train_classification_net(X_train, y_train, int(timeout_in_seconds / 3 / results["Neural Net"][dataset]["Training time"]), dataset)
    end = time.time()

    # Make predictions
    test_loss, test_accuracy = model.evaluate(X_test, y_test)
    print(f'\t\tTest error rate: {1 - test_accuracy:.4f}')
    print(f'\t\tRuntime to hyperparameter search and model training: {end-start} seconds')

    results["Neural Net (HO)"].setdefault(dataset, {})
    results["Neural Net (HO)"][dataset]["Prediction quality"] = 1 - test_accuracy
    results["Neural Net (HO)"][dataset]["Training time"] = end-start

Trial 2 Complete [00h 03m 35s]
val_accuracy: 0.9982093572616577

Best val_accuracy So Far: 0.9982093572616577
Total elapsed time: 00h 07m 11s


  saveable.load_own_variables(weights_store.get(inner_path))
  return self.fn(y_true, y_pred, **self._fn_kwargs)


[1m2967/2967[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 1ms/step - accuracy: 0.9984 - loss: 0.0000e+00
		Test error rate: 0.0016
		Runtime to hyperparameter search and model training: 431.81633019447327 seconds


### Task 3.2 [8 Marks] - Hyperparameter optimisation for regression Neural Nets

Create a function ``tune_train_regression_net(X_train, y_train, n_iter, project_name)`` that uses Keras tuner's ``RandomSearch`` to optimise the hyperparameters of a regression neural net model. ``X_train`` and ``y_train`` are pandas dataframes with the training data and target values. ``n_iter`` is the maximum number of iterations in the random search. ``project_name`` is an identifier used by Keras tuner to save the results on disk.

You have the freedom to choose your hyperparameter search space. You can use the same hyperparameter recommendations given for classification.

Your function should return the Keras model that achieved the best performance in a validation set of 20% of the training data. Average performance over three runs (``executions_per_trial=3``).

In [35]:
# This cell will be assessed. Replace the ... with your code

def tune_train_regression_net(X_train, y_train, n_iter, project_name):
    """
    Tunes and trains a regression neural network using Random Search.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training target values.
    - n_iter (int): Number of hyperparameter configurations to try.
    - project_name (str): Name for organising logs and results.

    Returns:
    - model: Trained Keras model with the best hyperparameters.
    """
    def build(hp):
        model = Sequential()

        # input layer
        model.add(Input(shape=(X_train.shape[1],)))

        # adjust the hidden layer（1 to 3）
        for i in range(hp.Int('num_layers', 1, 3)):
            model.add(Dense(
                units=hp.Int('units_' + str(i), min_value=32, max_value=512, step=32),
                activation=hp.Choice('activation_' + str(i), ['relu', 'tanh', 'sigmoid'])
            ))

        # output layer
        model.add(Dense(1))

        # optimizer and control learning rate
        optimizer = hp.Choice('optimizer', ['adam', 'sgd'])
        if optimizer == 'adam':
            opt = Adam(learning_rate=hp.Float('learning_rate', 1e-4, 1e-2, sampling='log'))
        else:
            opt = SGD(learning_rate=hp.Float('learning_rate', 1e-4, 1e-2, sampling='log'))

        # compile the model
        model.compile(optimizer=opt,loss='mean_squared_error', metrics=['mae'])

        return model

    tuner = RandomSearch(
          build,
          objective='val_mae',
          max_trials=n_iter,
          executions_per_trial=3,
          directory='log_results',
          project_name=project_name
      )

    tuner.search(X_train, y_train, epochs=10, validation_split=0.2)
    best_model = tuner.get_best_models(num_models=1)[0]

    return best_model

In [34]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your neural net model using the regression dataset.

timeout_in_seconds = 1800
datasets = ["california_housing"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/nn/")

    print("Tuning and training neural net model")

    start = time.time()
    model = tune_train_regression_net(X_train, y_train, int(timeout_in_seconds / 3 / results["Neural Net"][dataset]["Training time"]), dataset)
    end = time.time()

    # Make predictions
    test_loss, test_mse = model.evaluate(X_test, y_test)
    print(f'Test MSE: {test_mse:.4f}')
    print(f'Runtime to hyperparameter search and model training: {end-start} seconds')

    results["Neural Net (HO)"].setdefault(dataset, {})
    results["Neural Net (HO)"][dataset]["Prediction quality"] = 1 - test_mse
    results["Neural Net (HO)"][dataset]["Training time"] = end-start

Trial 34 Complete [00h 00m 31s]
val_mae: 0.6185547709465027

Best val_mae So Far: 0.176962211728096
Total elapsed time: 00h 20m 46s


  saveable.load_own_variables(weights_store.get(inner_path))


[1m215/215[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - loss: 3.8458 - mae: 1.5209
Test MSE: 1.5308
Runtime to hyperparameter search and model training: 192.0619351863861 seconds


### Task 3.3 [4 Marks] - Hyperparameter optimisation for decision trees

We will train the decision trees with hyperparameter optimisation. Our code will implement the search using the ``RandomizedSearchCV`` class.

We will create the function ``tune_train_classification_tree(X_train, y_train, n_iter)``, which optimises hyperparameters and returns a scikit-learn model trained with the best parameters.

You have the freedom to define your hyperparameter search space. Here are some suggestions:
- Maximum tree depth from 10 to 40 with increments of 10. Include None, too.
- Minimum samples in a split: 2, 5, 10, 20.
- Minimum samples in a leaf node: 1, 2, 5, and 10.
- Splitting criteria: gine and entropy.

The function will search for the best combination of hyperparameter values and return a model trained in such a combination in the complete training set. During the search, average the performance using 3-fold cross-validation (``cv=3``).

In [38]:
# This cell will be assessed. Replace the ... with your code

def tune_train_classification_tree(X_train, y_train, n_iter):
    """
    Tunes and trains a Decision Tree for classification using Randomized Search.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training target values (class labels).
    - n_iter (int): Number of hyperparameter configurations to try during the search.

    Returns:
    - model: Trained Decision Tree classifier with the best-found hyperparameters.
    """
    param ={
        'max_depth': [10, 20, 30, 40, None],
        'min_samples_split': [2, 5, 10, 20],
        'min_samples_leaf': [1, 2, 5, 10],
        'criterion': ['gini', 'entropy']
    }

    model = DecisionTreeClassifier(random_state=42)

    random_search = RandomizedSearchCV(
        estimator = model,
        param_distributions = param,
        n_iter = n_iter,
        cv = 3,
        n_jobs = -1,
        random_state=42
    )

    random_search.fit(X_train, y_train)

    best_model = random_search.best_estimator_
    return best_model


In [40]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your decision tree model using the classification datasets.

timeout_in_seconds = 1800
datasets = ["adult", "covertype", "fashion_mnist", "creditcard"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/tree/")

    print("\tTuning and training decision tree model")

    start = time.time()
    model = tune_train_classification_tree(X_train, np.array(y_train).ravel(), int(timeout_in_seconds / 3 / results["Decision Tree"][dataset]["Training time"]))
    end = time.time()

    # Make predictions
    y_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)

    print(f'\t\tTest error rate: {1 - test_accuracy:.4f}')
    print(f'\t\tRuntime to hyperparameter search and model training: {end-start} seconds')

    results["Decision Tree (HO)"].setdefault(dataset, {})
    results["Decision Tree (HO)"][dataset]["Prediction quality"] = 1 - test_accuracy
    results["Decision Tree (HO)"][dataset]["Training time"] = end-start

Processing dataset: adult
	Tuning and training decision tree model


  _data = np.array(data, dtype=dtype, copy=copy,


		Test error rate: 0.1380
		Runtime to hyperparameter search and model training: 19.798728942871094 seconds
Processing dataset: covertype
	Tuning and training decision tree model


  _data = np.array(data, dtype=dtype, copy=copy,


		Test error rate: 0.0884
		Runtime to hyperparameter search and model training: 157.65669870376587 seconds
Processing dataset: fashion_mnist
	Tuning and training decision tree model
		Test error rate: 0.1908
		Runtime to hyperparameter search and model training: 130.84939694404602 seconds
Processing dataset: creditcard
	Tuning and training decision tree model


  _data = np.array(data, dtype=dtype, copy=copy,


		Test error rate: 0.0005
		Runtime to hyperparameter search and model training: 71.62795519828796 seconds


### Task 3.4 [4 Marks] - Hyperparameter optimisation for regression trees

We will train the regression trees with hyperparameter optimisation through the function ``tune_train_regression_tree(X_train, y_train, n_iter)``, which optimises hyperparameters and returns a scikit-learn model trained with the best parameters.

You can use the same suggestion for the hyperparameter space provided in the previous task. However, the splitting criteria suitable for regression trees are different. We suggest ``squared_error``, ``friedman_mse``, and ``absolute_error``.

The function will search for the best combination of hyperparameter values and return a model trained in such combination in the complete training set. During the search, average the performance using 3-fold cross-validation (``cv=3``).

In [43]:
# This cell will be assessed. Replace the ... with your code

def tune_train_regression_tree(X_train, y_train, n_iter):
    """
    Tunes and trains a Regression Tree using Randomized Search.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training target values (continuous values).
    - n_iter (int): Number of hyperparameter configurations to try during the search.

    Returns:
    - model: Trained Regression Tree with the best-found hyperparameters.
    """
    param ={
        'max_depth': [10, 20, 30, 40, None],
        'min_samples_split': [2, 5, 10, 20],
        'min_samples_leaf': [1, 2, 5, 10],
        'criterion': ['squared_error', 'friedman_mse','absolute_error']
    }

    model = DecisionTreeRegressor(random_state=42)

    random_search = RandomizedSearchCV(
        estimator = model,
        param_distributions = param,
        n_iter = n_iter,
        cv = 3,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        random_state=42
    )

    random_search.fit(X_train, y_train)

    best_model = random_search.best_estimator_
    return best_model

In [44]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your regression model using the regression dataset.

timeout_in_seconds = 1800
datasets = ["california_housing"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/tree/")

    print("Training random forest model")

    start = time.time()
    model = tune_train_regression_tree(X_train, np.array(y_train).ravel(), int(timeout_in_seconds / 3 / results["Decision Tree"][dataset]["Training time"]))
    end = time.time()

    # Make predictions
    y_pred = model.predict(X_test)
    test_mse = mean_squared_error(y_test, y_pred)

    print(f'Test MSE: {test_mse:.4f}')
    print(f'\t\tRuntime to hyperparameter search and model training: {end-start} seconds')

    results["Decision Tree (HO)"].setdefault(dataset, {})
    results["Decision Tree (HO)"][dataset]["Prediction quality"] = test_mse
    results["Decision Tree (HO)"][dataset]["Training time"] = end-start

Processing dataset: california_housing
Training random forest model




Test MSE: 0.4357
		Runtime to hyperparameter search and model training: 69.89708518981934 seconds


  _data = np.array(data, dtype=dtype, copy=copy,


### Task 3.5 [4 Marks] - Hyperparameter optimisation for decision forest

We will create the function ``tune_train_classification_forest(X_train, y_train, n_iter)``, which optimises hyperparameters for a classification random forest and returns a scikit-learn model trained with the best parameters.

You have the freedom to define your hyperparameter search space. Here are some suggestions:
- Number of estimators (trees): 50, 100, 200.
- Maximum tree depth from 10 to 40 with increments of 10. Include None, too.
- Minimum samples in a split: 2, 5, 10, 20.
- Minimum samples in a leaf node: 1, 2, 5, and 10.
- Splitting criteria: gine and entropy.
- Bootstrap sampling: yes and no.

The function will search for the best combination of hyperparameter values and return a model trained in such combination in the complete training set. During the search, average the performance using 3-fold cross-validation (``cv=3``).

In [45]:
# This cell will be assessed. Replace the ... with your code

def tune_train_classification_forest(X_train, y_train, n_iter):
    """
    Tunes and trains a Random Forest classifier using Randomized Search.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training target values (class labels).
    - n_iter (int): Number of hyperparameter configurations to try during the search.

    Returns:
    - model: Trained Random Forest classifier with the best-found hyperparameters.
    """
    param = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [10, 20, 30, 40, None], # Maximum depth of each tree
    'min_samples_split': [2, 5, 10, 20], # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 5,10],   # Minimum samples required to form a leaf
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False]
    }

    model = RandomForestClassifier(random_state=42)

    random_search = RandomizedSearchCV(
        estimator = model,
        param_distributions = param,
        n_iter = n_iter,
        cv = 3,
        n_jobs = -1,
        random_state=42,
        scoring='accuracy'
    )

    random_search.fit(X_train, y_train)
    best_model = random_search.best_estimator_

    return best_model

In [46]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your random forest model using the classification datasets.

timeout_in_seconds = 1800
datasets = ["adult", "covertype", "fashion_mnist", "creditcard"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/tree/")

    print("\tTuning and training random forest model")

    start = time.time()
    model = tune_train_classification_forest(X_train, np.array(y_train).ravel(), int(timeout_in_seconds / 3 / results["Random Forest"][dataset]["Training time"]))
    end = time.time()

    # Make predictions
    y_pred = model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred)

    print(f'\t\tTest error rate: {1 - test_accuracy:.4f}')
    print(f'\t\tRuntime to hyperparameter search and model training: {end-start} seconds')

    results["Random Forest (HO)"].setdefault(dataset, {})
    results["Random Forest (HO)"][dataset]["Prediction quality"] = 1 - test_accuracy
    results["Random Forest (HO)"][dataset]["Training time"] = end-start

Processing dataset: adult
	Tuning and training random forest model
		Test error rate: 0.1360
		Runtime to hyperparameter search and model training: 204.7143087387085 seconds
Processing dataset: covertype
	Tuning and training random forest model
		Test error rate: 0.0790
		Runtime to hyperparameter search and model training: 241.41599941253662 seconds
Processing dataset: fashion_mnist
	Tuning and training random forest model
		Test error rate: 0.1270
		Runtime to hyperparameter search and model training: 452.8285641670227 seconds
Processing dataset: creditcard
	Tuning and training random forest model
		Test error rate: 0.0003
		Runtime to hyperparameter search and model training: 232.9679627418518 seconds


### Task 3.6 [4 Marks] - Hyperparameter optimisation for regression forest

We will create the function ``tune_train_regression_forest(X_train, y_train, n_iter)``, which optimises hyperparameters for a regression random forest and returns a scikit-learn model trained with the best parameters.

You have the freedom to define your hyperparameter search space. Our recommendations are similar for the classification forest. However, the splitting criteria suitable for regression problems are ``squared_error``, ``friedman_mse``, and ``absolute_error``.

The function will search for the best combination of hyperparameter values and return a model trained in such combination in the complete training set. During the search, average the performance using 3-fold cross-validation (``cv=3``).

In [47]:
# This cell will be assessed. Replace the ... with your code

def tune_train_regression_forest(X_train, y_train, n_iter):
    """
    Tunes and trains a Random Forest regressor using Randomized Search.

    Parameters:
    - X_train (DataFrame): Training features.
    - y_train (DataFrame): Training target values (continuous values).
    - n_iter (int): Number of hyperparameter configurations to try during the search.

    Returns:
    - model: Trained Random Forest regressor with the best-found hyperparameters.
    """
    param = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [10, 20, 30, 40, None], # Maximum depth of each tree
    'min_samples_split': [2, 5, 10, 20], # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 5,10],   # Minimum samples required to form a leaf
    'criterion': ['squared_error', 'friedman_mse','absolute_error'],
    'bootstrap': [True, False]
    }

    model = RandomForestRegressor(random_state=42)

    random_search = RandomizedSearchCV(
        estimator = model,
        param_distributions = param,
        n_iter = n_iter,
        cv = 3,
        n_jobs = -1,
        random_state=42,
        scoring='neg_mean_squared_error'
    )

    random_search.fit(X_train, y_train)
    best_model = random_search.best_estimator_

    return best_model

In [None]:
# Do not change the code in this cell.
# This cell has no code to write. It trains and assesses your random forest model using the regression dataset.

timeout_in_seconds = 1800
datasets = ["california_housing"]

for dataset in datasets:
    print(f"Processing dataset: {dataset}")

    # Load the train and test data
    X_train, y_train, X_test, y_test = load_train_test_data(f"data/{dataset}/tree/")

    print("Training random forest model")

    start = time.time()
    model = tune_train_regression_forest(X_train, np.array(y_train).ravel(), int(timeout_in_seconds / 3 / results["Random Forest"][dataset]["Training time"]))
    end = time.time()

    # Make predictions
    y_pred = model.predict(X_test)
    test_mse = mean_squared_error(y_test, y_pred)

    print(f'Test MSE: {test_mse:.4f}')
    print(f'Runtime to hyperparameter search and model training: {end-start} seconds')

    results["Random Forest (HO)"].setdefault(dataset, {})
    results["Random Forest (HO)"][dataset]["Prediction quality"] = test_mse
    results["Random Forest (HO)"][dataset]["Training time"] = end-start

Processing dataset: california_housing
Training random forest model


The next cell tabulates all the results. HO stands for Hyperparameter Optimisation.

In [None]:
print_results_table(results)

Congratulations! You have reached the end of the assignment. In the remaining of this document, you will analyse the results in a report.

## Task 4 [13 Marks] - Report

Write a report with less than 1,000 words (around two pages) in the following cells using markdown. You can include graphs and tables in your report. Answer the following questions in your report.

- [3 Marks] Discuss the performance of the algorithms in terms of prediction quality and training time. Use plots to compare these methods. Is there a method that stands out?
- [3 Marks] Do you think any of the seven hypotheses (machine learning wisdom and misconceptions) presented at the beginning of this assignment are correct? Have you observed any evidence that supports them?
- [3 Marks] Is the hyperparameter optimisation worth the time spent? Did you observe significant improvements in prediction quality?
- [2 Marks] We have measured the training time of these models, but another important aspect is the inference time. Would you expect some models to be more efficient than others for inference? What is the importance of having efficient models for inference? What is the importance of having efficient models for training?
- [2 Marks] The credit card dataset is imbalanced; in this situation, the error rate tends to be very small and difficult to interpret. Extend the performance analysis in this dataset to include the confusion matrix and F1 score. Analyse the performance of the classifiers under these performance measures.

Use one or more cell here to write your report.