# Machine Learning - Practical 4 - Deep Learning VS Trees


Names: Pratik Dhameliya _Injamam Ul Karim _Hasan Marwan Mahmood 
Summer Term 2023   
Due Date: Tuesday, June 13, 2pm

In this practical we are going to use a tabular dataset. We will test two different approaches - forests and neural networks and compare performance. We are also going to learn how to make trees interpretable.

To prepare this tutorial we used [this paper](https://arxiv.org/pdf/2207.08815.pdf) with its [reposit
ory](https://github.com/LeoGrin/tabular-benchmark).

For explained variance in trees, you can read more [here](https://scikit-learn.org/0.15/auto_examples/ensemble/plot_gradient_boosting_regression.html#example-ensemble-plot-gradient-boosting-regression-py).


In [None]:
%matplotlib inline

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pickle
import os

from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import recall_score, precision_score, accuracy_score

from PIL import Image
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset, Subset
import torchvision.datasets as datasets
import torchvision.transforms as transforms
from torchvision.utils import make_grid
torch.manual_seed(42) # Set manual seed

In [None]:
# DO NOT CHANGE
use_cuda = True
use_cuda = False if not use_cuda else torch.cuda.is_available()
device = torch.device('cuda:0' if use_cuda else 'cpu')
torch.cuda.get_device_name(device) if use_cuda else 'cpu'
print('Using device', device)

## Load, clean and split the tabular dataset

We use the preprocessing pipeline from [Grinsztajn, 2022](https://arxiv.org/pdf/2207.08815.pdf).

**No missing data**    

Remove all rows containing at least one missing entry.    

*In practice people often do not remove rows with missing values but try to fill missing values in a column with the mean or median values for numerical data and mode or median values for categorical data. Sometimes even simple prediction models are used to fill in the gaps but we will remove rows or columns with missing values for the sake of simplicity*

**Balanced classes**   

For classification, the target is binarised if there are multiple classes, by taking the two most numerous classes, and we keep half of samples in each class.

**Low cardinality categorical features**   

Remove categorical features with more than 20 items. 

**High cardinality numerical features**   

Remove numerical features with less than 10 unique values. Convert numerical features with 2 unique values to categorical.

**Data description:**  
Data reported to the police about the circumstances of personal injury road accidents in Great Britain from 1979. This version includes data up to 2015. We will try to predict the sex of the driver based on the data provided.

In [None]:
## In case you have any issues with loading the pickle file
## check that your pandas version is 1.4.1
## or just simply run:
## pip install pandas==1.4.1

with open('/kaggle/input/adopted/adopted_road_safety.pkl', 'rb') as f:
    dataset = pickle.load(f)

In [None]:
dataset

In [None]:
target_column = 'Sex_of_Driver'
test_size = 0.2
random_state = 42

In [None]:
def remove_nans(df):
    '''
    this fucntion removes rows with nans
    '''
    # TODO
    return df.dropna()


def numerical_to_categorical(df, n=2, ignore=[target_column]):
    '''
    change the type of the column to categorical
    if it has <= n unique values
    '''
    # TODO
    for column in df.columns:
        if column in ignore:
            continue

        if df[column].nunique() <= n:
            df[column] = df[column].astype('category')

    return df


def remove_columns_by_n(df, n=10, condition=np.number, direction='less',
                        ignore=[target_column]):
    '''

    Remove columns with more or less than n unique values.
    Usually it makes sense to apply this function to columns with categorical values (see below where it is called).
    With the default values we remove all numerical columns which have less than 10 unique values (except for the target_column).
    '''
    # TODO
    for column in df.columns:
        if column in ignore:
            continue
        if df[column].dtype == condition:
            if direction == 'less':
                if df[column].nunique() < n:
                    df = df.drop(column, axis=1)
            elif direction == 'more':
                if df[column].nunique() > n:
                    df = df.drop(column, axis=1)
    return df

In [None]:
df = dataset
df = remove_nans(df)
df = numerical_to_categorical(df, n=3, ignore=[target_column])
df = remove_columns_by_n(df, n=10, condition=np.number, direction='less', 
                         ignore=[target_column])
df = remove_columns_by_n(df, n=40, condition='category', direction='more', 
                         ignore=[target_column])
assert not df.isna().any().any(), 'There are still nans in the dataframe'

In [None]:
# TODO : make train-test split from the dataframe using the parameters above
# expected results variable names - train_X, test_X, train_y, test_y
Y = df[target_column]
X = df.drop(target_column, axis=1)

train_X, test_X, train_y, test_y = train_test_split(X, Y, test_size=test_size, random_state=random_state)

In [None]:
Y

## Task 1: Create a GradientBoostingClassifier

In [None]:
## TODO : define the GradientBoostingClassifier, 
## train it on the train set and predict on the test set
gradientBoostingClassifier = GradientBoostingClassifier()
gradientBoostingClassifier.fit(train_X, train_y)
y_pred = gradientBoostingClassifier.predict(test_X)

In [None]:
## TODO : print  accuracy, precision, recall
## Hint : use functions from sklearn metrics#
accuracy , precision , recall = accuracy_score(test_y, y_pred), precision_score(test_y, y_pred,average = "micro"), recall_score(test_y, y_pred,average = "micro")
print(accuracy, precision, recall)

In [None]:
## TODO : Write a function which iterates over trees_amount, 
## train a classifier with a specified amount of trees and print accuracy, precision, and recall.
## Note: the calculations may take several minutes (depending on the computer efficiency).

def trees_amount_exploration(train_X, train_y, test_X, test_y, trees_amount=[1, 20, 50, 100]):
    #TODO
    accuracyList = []
    precisionList = []
    recallList = []
    for tree in trees_amount:
        gradientBoostingClassifier = GradientBoostingClassifier(n_estimators=tree)
        gradientBoostingClassifier.fit(train_X, train_y)
        y_pred = gradientBoostingClassifier.predict(test_X)
        accuracy , precision , recall = accuracy_score(test_y, y_pred), precision_score(test_y, y_pred,average = "micro"), recall_score(test_y, y_pred,average = "micro")
        print(accuracy, precision, recall)
        accuracyList.append(accuracy)
        precisionList.append(precision)
        recallList.append(recall)
    return accuracyList, precisionList, recallList


In [None]:
accuracyList, precisionList, recallList = trees_amount_exploration(train_X, train_y, test_X, test_y)

In [None]:
## TODO : Write a function which iterates over the learning rate, 
## train a classifier with a specified amount of trees and print accuracy, precision, and recall.
## Note: the calculations may take several minutes (depending on the computer efficiency).

def learning_rate_exploration(train_X, train_y, test_X, test_y, learning_rates = [0.1, 0.2, 0.3, 0.4, 0.5], trees_amount=100):
    #TODO
    accuracyList = []
    precisionList = []
    recallList = []
    for rate in learning_rates:
        gradientBoostingClassifier = GradientBoostingClassifier(n_estimators=trees_amount, learning_rate=rate)
        gradientBoostingClassifier.fit(train_X, train_y)
        y_pred = gradientBoostingClassifier.predict(test_X)
        accuracy , precision , recall = accuracy_score(test_y, y_pred), precision_score(test_y, y_pred,average = "micro"), recall_score(test_y, y_pred,average = "micro")
        print(accuracy, precision, recall)
        accuracyList.append(accuracy)
        precisionList.append(precision)
        recallList.append(recall)
    return accuracyList, precisionList, recallList

In [None]:
accuracyList_learning_rate_exploration, precisionList_learning_rate_exploration, recallList_learning_rate_exploration = learning_rate_exploration(train_X, train_y, test_X, test_y)

In [None]:
## TODO : Write a function which iterates over different depths, 
## train a classifier with a specified depth and print accuracy, precision, and recall
## Set trees_amount= 50 to make the calculations faster
## Note: the calculations may take several minutes (depending on the computer efficiency).

def max_depth_exploration(train_X, train_y, test_X, test_y, depths=[1,2,3,5]):
    # TODO
    accuracyList = []
    precisionList = []
    recallList = []
    for depth in depths:
        gradientBoostingClassifier = GradientBoostingClassifier(n_estimators=50, max_depth=depth)
        gradientBoostingClassifier.fit(train_X, train_y)
        y_pred = gradientBoostingClassifier.predict(test_X)
        accuracy , precision , recall = accuracy_score(test_y, y_pred), precision_score(test_y, y_pred,average = "micro"), recall_score(test_y, y_pred,average = "micro")
        print(accuracy, precision, recall)
        accuracyList.append(accuracy)
        precisionList.append(precision)
        recallList.append(recall)
    return accuracyList, precisionList, recallList

In [None]:
accuracyList_max_depth_exploration, precisionList_max_depth_exploration, recallList_max_depth_exploration = max_depth_exploration(train_X, train_y, test_X, test_y)

**TODO :**   

* How does the max_depth parameter influence the results? 
* How does the learning rate influence the results?
* How does the number of trees in the ensemble influence the results?
* Try to improve the accuracy by combining different max_depth, learning rate and number of trees. How well does your best model perform?

In [None]:
## TODO -  sklearn trees have the attribute feature_importances_
## make a plot, to show relative importance (maximum is 1) of your classifier and
## order features from most relevant feature to the least relevant in the plot

def plot_explained_variance(clf, X):
    # TODO plot the explained variance
    importances = clf.feature_importances_
    indices = np.argsort(importances)[::-1]
    plt.figure()
    plt.title("Feature importances")
    plt.bar(range(X.shape[1]), importances[indices],
            color="r", align="center")
    plt.xticks(range(X.shape[1]), indices, rotation=90)
    plt.xlim([-1, X.shape[1]])
    plt.show()

In [None]:
## TODO : display the plot
plot_explained_variance(gradientBoostingClassifier, train_X)

**TODO :** Interpret the plot.

**TODO (optional):** Try to remove the least-important features and see what happens. Does to quality improve or degrade? Why? 

## Prepare for deep learning
### Add all the necessary training functions 
*You can reuse them from previous practical exercises*

In [None]:
## TODO write a function that calculates the accuracy
## Hint - you can use yours from practical 3 

def accuracy(correct, total): 
    """
    function to calculate the accuracy given the
        correct: number of correctly classified samples
        total: total number of samples
    returns the ratio
    """

    return float (correct) / total

In [None]:
## TODO : Define a train and validation functions here
## Hint - you can use yours from practical 3 
def train(dataloader, optimizer, model, loss_fn, device, master_bar):
    """ method to train the model """
    epoch_loss = []
    epoch_correct, epoch_total = 0, 0

    for x, y in dataloader:
        optimizer.zero_grad()
        model.train()

        # Forward pass
        y_pred = model(x.to(device))
        # For calculating the accuracy, save the number of correctly classified 
        # images and the total number
        epoch_correct += sum(y.to(device) == y_pred.argmax(dim=1))
        epoch_total += len(y)

        # Compute loss
        loss = loss_fn(y_pred, y.to(device))

        # Backward pass
        loss.backward()
        optimizer.step()

        # For plotting the train loss, save it for each sample
        epoch_loss.append(loss.item())

    # Return the mean loss and the accuracy of this epoch
    return np.mean(epoch_loss), accuracy(epoch_correct, epoch_total)


def validate(dataloader, model, loss_fn, device, master_bar):
    """ method to compute the metrics on the validation set """
    # TODO: write a validation function that calculates the loss and accuracy on the validation set
    # you can also combine it with the training function
    epoch_loss = []
    epoch_correct, epoch_total = 0, 0

    model.eval()
    with torch.no_grad():
        for x, y in dataloader:
            # make a prediction on validation set
            y_pred = model(x.to(device))

            # For calculating the accuracy, save the number of correctly 
            # classified images and the total number
            epoch_correct += sum(y.to(device) == y_pred.argmax(dim=1))
            epoch_total += len(y)

            # Compute loss
            loss = loss_fn(y_pred, y.to(device))

            # For plotting the train loss, save it for each sample
            epoch_loss.append(loss.item())

    # Return the mean loss, the accuracy and the confusion matrix
    return np.mean(epoch_loss), accuracy(epoch_correct, epoch_total)

In [None]:
#TODO write a run_training function that 
# - calls the train and validate functions for each epoch
# - saves the train_losses, val_losses, train_accs, val_accs as arrays for each epoch
## Hint - you can use yours from practical 3 
from tqdm import tqdm




def run_training(model, optimizer, loss_function, device, num_epochs, train_dataloader, val_dataloader):
    """ method to run the training procedure """
    # TODO: write a run_training function that 
  # - calls the train and validate functions for each epoch
  # - saves the train_losses, val_losses, train_accs, val_accs as arrays for each epoch
    master_bar = trange(num_epochs)
    train_losses, val_losses, train_accs, val_accs = [],[],[],[]

    for epoch in master_bar:
        # Train the model
        epoch_train_loss, epoch_train_acc = train(train_dataloader, optimizer, model, 
                                                  loss_function, device, master_bar)
        # Validate the model
        epoch_val_loss, epoch_val_acc = validate(val_dataloader, model, loss_function, 
                                                 device, master_bar)

        # Save loss and acc for plotting
        train_losses.append(epoch_train_loss)
        val_losses.append(epoch_val_loss)
        train_accs.append(epoch_train_acc)
        val_accs.append(epoch_val_acc)
        
        master_bar.write(f'Train loss: {epoch_train_loss:.2f}, val loss: {epoch_val_loss:.2f}, train acc: {epoch_train_acc:.3f}, val acc: {epoch_val_acc:.3f}')
            
    return train_losses, val_losses, train_accs, val_accs


In [None]:
# TODO write a plot function 
## Hint - you can use yours from practical 2 or 3 
import seaborn as sns
import pandas as pd

def plot(train_metrics, validation_metrics, x, x_label, y_label, legend_names):
    df = pd.DataFrame(zip(x, train_metrics, [legend_names[0]] * len(train_metrics)), columns=[x_label, y_label, "type"])
    df = pd.concat([df, pd.DataFrame(zip(x, validation_metrics, [legend_names[1]] * len(train_metrics)), columns=[x_label, y_label, "type" ])])
    ax = sns.pointplot(data=df, x=x_label, y=y_label, hue='type', linestyles=['--', '--'])
    plt.plot()

### Convert a pandas dataframe to a PyTorch dataset

In [None]:
## TODO : Define the dataset, apply normalization in the getitem method
## Hint : you can use/adapt your code from practical 2
class TabularDataset(torch.utils.data.Dataset):
    def __init__(self, df_x, df_y, mean=None, std=None, normalise=True):
        '''
        TODO: save params to self attributes, 
        x is data without target column
        y is target column
        transform df to_numpy
        '''
        if type(df_x) != np.ndarray:
            self.x = df_x.to_numpy()
        else:
            self.x = df_x
        if type(df_y) != np.ndarray:
            self.y = df_y.to_numpy()
        else:
            self.y = df_y
        self.mean = mean
        self.std = std
        self.normalise = normalise
    
    def __len__(self):
        # TODO: return the length of the whole dataset
        return len(self.x)
    
    def __getitem__(self, index):
        ## TODO: return X, y by index, normalized if needed
        return self.x[index], self.y[index]

In [None]:
# def Convert categorical values to numerical
def convert_categorical_to_numerical(df):
    for column in df.columns:
        if df[column].dtype == 'object':
            df[column] = df[column].astype('category')
            df[column] = df[column].cat.codes
    return df

In [None]:
from sklearn.preprocessing import LabelEncoder


# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit and transform the categorical variable
Y = label_encoder.fit_transform(Y)

# Convert the numerical variable to the desired dtype
Y = Y.astype('int64')

In [None]:
## TODO : calculate mean and std for the train set
## Hint : be careful with categorical values. Convert them them to numerical 
## Hint : the response variable should be of datatype integer
X = convert_categorical_to_numerical(X)


X_train , X_test, y_train, y_test =train_test_split(X, Y, test_size=test_size, random_state=random_state)
mean = X_train.mean()
std = X_train.std()

print(mean)
print('=======')
print(std)
print(type(Y[0]))

In [None]:
# TODO : define new datasets with mean, std and normalise=True
# be careful with the labels, they should start from 0!
train_dataset = TabularDataset(X_train, y_train, mean, std, normalise=True)
test_dataset = TabularDataset(X_test, y_test, mean, std, normalise=True)
## TODO : define dataloaders, with specified batch size and shuffled
batch_size = 256
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

## Logistic regression

In [None]:
class LR(torch.nn.Module):
    """
    The logistic regression model inherits from torch.nn.Module 
    which is the base class for all neural network modules.
    """
    def __init__(self, input_dim, output_dim):
        """ Initializes internal Module state. """
        super(LR, self).__init__()
        # TODO define linear layer for the model
        self.linear = torch.nn.Linear(input_dim, output_dim)


    def forward(self, x):
        """ Defines the computation performed at every call. """
        # What are the dimensions of your input layer?
        x = x.to(torch.float32)
        # TODO run the data through the layer
        outputs = self.linear(x)
        return outputs

In [None]:
## TODO define model, loss and optimisers
## don't forget to move everything for the correct devices
## 
lr=0.001
input_dim, output_dim = X_train.shape[1], len(np.unique(y_train))
model = LR(input_dim, output_dim)
model.to(device)
model.train()
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

In [None]:
## TODO train the network
from tqdm import trange
     

num_epochs = 30
train_losses_lr, val_losses_lr, train_accs_lr, val_accs_lr = run_training(model = model, optimizer = optimizer, loss_function = loss_function, device = device, num_epochs =num_epochs, train_dataloader = train_dataloader, val_dataloader = val_dataloader)

In [None]:
## todo - plot losses and accuracies
plot(train_losses_lr, val_losses_lr, range(num_epochs), 'Epoch', 'Loss' ,['Train', 'Validation'])


In [None]:
plot(train_accs_lr, val_accs_lr, range(num_epochs), 'Epoch', 'acces', ['Train', 'Validation'])



## Create a simple MLP

As the default tree has 3 layers, let's make a MLP with 3 linear layers and ReLU.
Please notice that making convolutions on tabular data does not make much sense even though it is technically possible.   

**TODO :** Explain why making convolutions on tabular data does not make much sense. Why do we use an MLP, not a CNN from the previous homework?

In [None]:
class TabularNetwork(torch.nn.Module):
    def __init__(self, input_dim, output_dim):
        """ Initializes internal Module state. """
        super(TabularNetwork, self).__init__()
        self.network = nn.Sequential(
            # TODO : define 3 linear layer with sizes 
            # input_dim -> input_dim // 2 -> output_dim
            # using ReLU as nonlinearity
            nn.Linear(input_dim, input_dim // 2),
            nn.ReLU(),
            nn.Linear(input_dim // 2, output_dim),
        )
      

    def forward(self, x):
        """ Defines the computation performed at every call. """
        # TODO
        # run the data through the network
        x = x.to(self.network[0].weight.dtype)
        outputs = self.network(x)
        return outputs

In [None]:
## TODO : define model, optimiser, cross entropy loss,
## put model to the device, and train mode
## you can optionally apply regularisation between 0.0005 and 0.005 
lr=0.001
model = TabularNetwork(input_dim, output_dim)
model.to(device)
model.train()


In [None]:
## TODO : Train model
num_epochs = 50
train_losses_mlp , val_losses_mlp, train_accs_mlp, val_accs_mlp = run_training(model = model, optimizer = optimizer, loss_function = loss_function, device = device, num_epochs =num_epochs, train_dataloader = train_dataloader, val_dataloader = val_dataloader)

In [None]:
# TODO plot losses
plot(train_losses_mlp, val_losses_mlp, range(num_epochs), 'Epoch', 'Loss', ['Train', 'Validation'])


In [None]:
# TODO plot accuracies
plot(train_accs_mlp, val_accs_mlp, range(num_epochs), 'Epoch', 'acces', ['Train', 'Validation'])

**TODO:** Did your network perform better or worse than the GradientBoostingClassifier on this dataset? Why? 


## Bonus tasks (optional)
* Try to use SGD instead of Adam as optimiser. What do you notice?
Here are different opinions on this topic:
  * https://codeahoy.com/questions/ds-interview/33/#:~:text=Adam%20tends%20to%20converge%20faster,converges%20to%20more%20optimal%20solutions.
  * https://shaoanlu.wordpress.com/2017/05/29/sgd-all-which-one-is-the-best-optimizer-dogs-vs-cats-toy-experiment/ 
  * https://datascience.stackexchange.com/questions/30344/why-not-always-use-the-adam-optimization-technique

* Try to make your MLP twice deeper. What do you notice? Why?

## Advanced topic to read about:
**Tools which may be helpful for data exploration:**
* df.describe() - returns some basic statistics for your dataset - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
* ydata-profiling (previous pandas-profiling) - generates interactive data exploration report: basic statistics, nans, correlations between different features - https://github.com/ydataai/ydata-profiling

**Tree libraries**
* XGBoost - XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. https://xgboost.readthedocs.io/en/stable/tutorials/model.html
* LightGBM - industrial library for XGBoost from Miscrosoft. LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient. https://lightgbm.readthedocs.io/en/v3.3.2/