# Assignment 2

## Instructions
- Your submission should be the `.ipynb` file with your name,
  like `YusufMesbah.ipynb`. it should include the answers to the questions in
  markdown cells.
- You are expected to follow the best practices for code writing and model
training. Poor coding style will be penalized.
- You are allowed to discuss ideas with your peers, but no sharing of code.
Plagiarism in the code will result in failing. If you use code from the
internet, cite it.
- If the instructions seem vague, use common sense.

# Task 1: ANN (30%)
For this task, you are required to build a fully connect feed-forward ANN model
for a multi-label regression problem.

For the given data, you need do proper data preprocessing, design the ANN model,
then fine-tune your model architecture (number of layers, number of neurons,
activation function, learning rate, momentum, regularization).

For evaluating your model, do $80/20$ train test split.

### Data
You will be working with the data in `Task 1.csv` for predicting students'
scores in 3 different exams: math, reading and writing. The columns include:
 - gender
 - race
 - parental level of education
 - lunch meal plan at school
 - whether the student undertook the test preparation course

In [None]:
#Installations

#!pip install torch
#!pip install torchvision
#!pip install ray[all]

In [None]:
#Imports

#Numpy: Used for getting random values for hyper-parameter tuning
import numpy as np

#Pandas: Used for loading the dataset
import pandas as pd

#Sklearn: Used for data preparation 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

#Pytorch: Used for ANN
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import random_split, Dataset, DataLoader
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

#Ray: Used for hyper-parameter tuning
from ray import tune
from ray.tune import CLIReporter

from functools import partial

In [None]:
#Get the dataset

df = pd.read_csv("Task 1.csv")
print(f'number of entries: {len(df)}')
print(f'columns: {[column for column in df.columns]}')
features = ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']
targets = ['math score', 'reading score', 'writing score']
print(f"features: {features}")
print(f"targets: {targets}")
for feature in features:
  values = df[feature].unique()
  print(f'{feature}: {values}')

In [None]:
#Data preparation.
#References: Lab 2, Assignment 1 submission

#Features that have no natural ordered relationship
onehot_features = {
    'gender': ['male', 'female'],
    'race/ethnicity': ['group A', 'group B', 'group C', 'group D', 'group E'] 
}
#Features that have a natural ordered relationship
ordinal_features = {
    'parental level of education': ['some high school', 'high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"],
    'test preparation course': ['none', 'completed'],
    'lunch': ['standard', 'free/reduced']
}
#note that test preparation course and lunch features will be represented in the same way whether they get encoded ordinally or with one hot method

#Encoding categorical data
def ordinal_encoder(df, feature_name, categories):
  old_column = df[feature_name]
  old_column = np.array(old_column).reshape(-1,1)
  encoder = OrdinalEncoder(categories=[categories])
  new_column = encoder.fit_transform(old_column)
  new_column = pd.DataFrame(data=new_column, columns=[feature_name])
  new_df = df.drop(feature_name, axis=1)
  new_df = pd.concat([new_df, new_column], axis=1)
  return new_df

def onehot_encoder(df, features_names):
  encoder = OneHotEncoder(sparse=False, drop='first') #encoder model imported from sklearn
  new_columns = encoder.fit_transform(df[features_names])
  new_columns = pd.DataFrame(new_columns, dtype=int, columns=encoder.get_feature_names_out(features_names))
  new_df = df.drop(features_names, axis=1)
  new_df = pd.concat([new_df, new_columns], axis=1)   

  return new_df

def categorical_encoder(df, ordinal_features, onehot_features):
  new_df = df
  for key, val in ordinal_features.items():
    new_df = ordinal_encoder(new_df, key, val)
  onehot_features_names = []
  for key, val in onehot_features.items():
    onehot_features_names.append(key)
  new_df = onehot_encoder(new_df, onehot_features_names)
  return new_df

df = categorical_encoder(df=df, ordinal_features=ordinal_features, onehot_features=onehot_features)

print(f'columns: {[column for column in df.columns]}')

In [None]:
#Splitting the dataset

X = df.iloc[:, 3:].values #Features
y = df.iloc[:, 0:3].values #Results

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2) #split 80/20

In [None]:
#Scaling

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
#Loading Custom Dataset in pytorch
#Reference: Self Practice 7

class CustumData(Dataset):
  def __init__(self, X, y):
    super().__init__()
    self.y = torch.tensor(y).float()
    self.X = torch.tensor(X).float()

  def __len__(self):
    return len(self.X)

  def __getitem__(self, idx):
    return self.X[idx, :], self.y[idx]


In [None]:
#NN Model
#Reference: Lab 7, https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html, https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html#torch.nn.Sequential

class Net(nn.Module):
  def __init__(self, units = 64, layers = 10, activation = nn.ReLU()):
    super(Net, self).__init__()
    self.input = nn.Sequential(
      nn.Linear(8 ,units),
      activation
    )
    linear = nn.Sequential(
      nn.Linear(units, units),
      activation
    )
    self.linears = nn.ModuleList([linear for i in range(layers)])
    self.output = nn.Linear(units, 3)
    
  def forward(self, x):  
    x = self.input(x)
    for layer in self.linears:
      x = layer(x)
    x = self.output(x)
    return x


In [None]:
#RMSE loss 
#Reference: https://stackoverflow.com/questions/61990363/rmse-loss-for-multi-output-regression-problem-in-pytorch, https://gist.github.com/jamesr2323/33c67ba5ac29880171b63d2c7f1acdc5
class RMSELoss(torch.nn.Module):
    def __init__(self):
        super(RMSELoss,self).__init__()

    def forward(self,x,y):
        criterion = nn.MSELoss()
        eps = 1e-9
        loss = torch.sqrt(criterion(x, y) + eps)
        return loss


In [None]:
#Training and Hyper-parameter tuning
#Reference: https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html

def train(config):
  activation_dict = {
    "relu": nn.ReLU(),
    "leaky": nn.LeakyReLU(),
    "elu": nn.ELU(),
    "hardswish": nn.Hardswish(),
    "selu": nn.SELU(),
    "silu": nn.SiLU()
  }
  
  net = Net(config["units"], config["layers"], activation_dict[config["activation"]])
    
  device = "cpu"
  if torch.cuda.is_available():
    device = "cuda"
  
  net.to(device)

  crit_dict = {
    "l1": nn.L1Loss(),
    "rmse": RMSELoss()
  }

  criterion = crit_dict[config['loss']]
  
  #  "optim":  tune.choice(["sgd", "adam", "adadelta", "adagrad", "sparseadam", "adamw", "asgd", "lbfgs", "nadam", "radam", "rmsprop", "rprop"]),
  optim_dict = {
    "sgd": optim.SGD(net.parameters(), lr=config["lr"], momentum=config["momentum"]),
    "adam": optim.Adam(net.parameters(), lr=config['lr']),
    "adadelta": optim.Adadelta(net.parameters(), lr=config['lr']),
    "adagrad": optim.Adagrad(net.parameters(), lr=config['lr']),
    "adamw": optim.AdamW(net.parameters(), lr=config['lr']),
    "asgd": optim.ASGD(net.parameters(), lr=config['lr']),
    "nadam": optim.NAdam(net.parameters(), lr=config['lr']),
    "radam": optim.RAdam(net.parameters(), lr=config['lr']),
    "rmsprop": optim.RMSprop(net.parameters(), lr=config['lr'], momentum=config["momentum"]),
    "rprop": optim.Rprop(net.parameters(), lr=config['lr'])
  }
  optimizer = optim_dict[config["optim"]]

  trainset = CustumData(X_train, y_train)
  testset = CustumData(X_test, y_test) 

  trainloader = DataLoader(trainset, batch_size=int(config["batch_size"]), shuffle=True)
  testloader = DataLoader(testset, batch_size=int(config["batch_size"]), shuffle=True)
  
  #Reduce learning rate when a metric has stopped improving
  schedular = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5, cooldown=5)
    
  min_loss = 1e18
  
  for epoch in range(200):  # loop over the dataset multiple times
    #Training
    net.train()
    train_loss = 0.0
    for inputs, labels in trainloader:
      inputs, labels = inputs.to(device).float(), labels.to(device)

      optimizer.zero_grad()
    
      outputs = net(inputs)
            
      loss = criterion(outputs, labels)
            
      loss.backward()
      optimizer.step()
      train_loss += loss.item() * len(inputs)
        
    schedular.step(train_loss / len(trainloader.dataset))
    
    #Validation/Testing
    net.eval()
    test_loss = 0.0
    for inputs, labels in testloader:
      with torch.no_grad():
        inputs, labels = inputs.to(device).float(), labels.to(device)

        outputs = net(inputs)

        loss = criterion(outputs, labels)
                
        test_loss += loss.item() * len(inputs)
        
    epoch_loss = test_loss/len(testloader.dataset)
    if epoch_loss < min_loss:
      min_loss = epoch_loss

  tune.report(loss=min_loss)
  print("Finished Training")

In [None]:
#Main

def main_task1(num_samples=20, max_num_epochs=10):
  config = {
    "units": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
    "layers": tune.sample_from(lambda _: np.random.randint(1, 30)),
    "activation": tune.choice(["relu", "leaky", "elu", "hardswish", "selu", "silu"]),
    "optim":  tune.choice(["sgd", "adam", "adadelta", "adagrad", "adamw", "asgd",  "nadam", "radam", "rmsprop", "rprop"]),
    "loss": tune.choice(["l1", "rmse"]),
    "lr": tune.loguniform(1e-5, 1e-1),
    "momentum": tune.sample_from(lambda _: np.random.uniform(0,1.0)),
    "batch_size": tune.sample_from(lambda _: 2 ** np.random.randint(5, 10)),
  }
  reporter = CLIReporter(
    metric_columns=["loss"])
  result = tune.run(
    train,
    config=config,
    num_samples=num_samples,
    progress_reporter=reporter
  )
  best_trial = result.get_best_trial("loss", "min", "last")
  print("Best trial config: {}".format(best_trial.config))
  print("Best trial final validation loss: {}".format(
    best_trial.last_result["loss"]))
    
main_task1(num_samples=200, max_num_epochs=200)

### Questions
1. What preprocessing techniques did you use? Why?
    - Encoding:
      - OneHotEncoding
        - Features: 
          - gender 
          - race/ethnicity 
        - Reasoning: 
          - Because gender and race do not have a natural ordered relationship in themselves. So, one hot encoding is the appropriate here.
      - OrdinalEncoding:
        - Features: 
          - parental level of education  
          - test preparation course 
          - lunch 
        - Reasoning: 
          - parental level of education: there's a natural order from uneducated to educated.
          - Order:
            - some high school -> 0  
            - high school -> 1 
            - some college -> 2 
            - associate's degree -> 3 
            - bachelor's degree -> 4 
            - master's degree -> 5 
          - lunch: there's a natural order in terms of what is better, free/reduced or full priced. 
          - Order:
            - standard -> 0 
            - free/reduced -> 1 
          - test preparation course: there's a natural order in terms of who is prepared and not.
          - Order:
            - none -> 0 
            - completed -> 1 
      - Note: For the lunch and test preparation course features, both OneHotEncoding and OrdinalEncoding are valid due to the fact that there's only 2 possible values for each feature.
    - Scaling
      - Used standard scaler to scale the dataset
      - Reasoning: to make all features contribute equally instead of having ordinally encoded features(parental level of education) have more weight than other features.
2. Describe the fine-tuning process and how you reached your model architecture.
    - Fine-tuning: 
      - Used ray for hyperparamter tuning
        - Reference: https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html
      - Decided to tune the following:
        - Number of hidden layers: which is
          - Possible values: 
            - 2^i where i is a random integer between 2 and 9 inclusive
        - Number of neurons in hidden layers: which is
          - Possible values: 
            - random integer between 1 and 30 inclusive
        - Learning rate: which is
          - Possible values: 
            - random float between 0.1 and 0.00001 inclusive
        - Momuntem: which is
          - Possible values: 
            - random float between 0.0 and 1.0 inclusive
        - Activiation Functions: which is
          - Possible values: 
            - ReLU
            - LeakyReLU
            - Hardswish
            - ELU
            - SELU
            - SiLU
        - Optimizers: which is 
          - Possible values:
            - SGD
            - Adam
            - Adadelta
            - Adagrad
            - AdamW
            - ASGD
            - NAdam
            - RAdam
            - RMSprop
            - Rprop
        - Loss Functions: which is
          - Possible values:
            - L1Loss
            - RMSELoss
        - Batch size: which is:
          - Possible values:
            - 2^i where i is a random integer between 5 and 10 inclusive
      - Also used ReduceLROnPlateau to modify the lr in between epochs
      - Reference: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.ReduceLROnPlateau.html
    - After running for 200 samples where each sample ran for 200 epochs, the best parameters reached by the hyper paramter tuning is:
      - Number of hidden layers: 1 
      - Number of neurons: 32
      - Learning rate: 0.025644775234438002
      - Momuntem: 0.2103397790313165
      - Activiation Function: SiLU
      - Optimizer: ASGD
      - Loss Function: L1Loss
      - Batch size: 64
    - Which produced validation loss equal to 10.603941497802735

# Task 2: CNN (40%)
For this task, you will be doing image classification:
- First, adapt your best model from Task 1 to work on this task, and
fit it on the new data. Then, evaluate its performance.
- After that, build a CNN model for image classification.
- Compare both models in terms of accuracy, number of parameters and speed of
inference (the time the model takes to predict 50 samples).

For the given data, you need to do proper data preprocessing and augmentation,
data loaders.
Then fine-tune your model architecture (number of layers, number of filters,
activation function, learning rate, momentum, regularization).

### Data
You will be working with the data in `triple_mnist.zip` for predicting 3-digit
numbers writen in the image. Each image contains 3 digits similar to the
following example (whose label is `039`):

![example](https://github.com/shaohua0116/MultiDigitMNIST/blob/master/asset/examples/039/0_039.png?raw=true)

In [1]:
#Imports

#Numpy: Used for getting random values for hyper-parameter tuning
import numpy as np

#Pandas: Used for loading the dataset
import pandas as pd

#Sklearn: Used for data preparation 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler

#Pytorch: Used for ANN
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import random_split, Dataset, DataLoader
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms

#Ray: Used for hyper-parameter tuning
from ray import tune
from ray.tune import CLIReporter

from functools import partial

import zipfile

In [2]:
#Get the dataset
#Reference: https://stackoverflow.com/questions/3451111/unzipping-files-in-python
with zipfile.ZipFile('triple_mnist.zip', 'r') as zip_ref:
    zip_ref.extractall('/task2/')

PermissionError: [Errno 13] Permission denied: '/task2'

In [None]:
#Loading Custom Dataset in pytorch
#Reference: Self Practice 7

class CustumData(Dataset):
  def __init__(self, X, y):
    super().__init__()
    self.y = torch.tensor(y).float()
    self.X = torch.tensor(X).float()

  def __len__(self):
    return len(self.X)

  def __getitem__(self, idx):
    return self.X[idx, :], self.y[idx]


### Questions
1. What preprocessing techniques did you use? Why?
    - *Answer*
2. What data augmentation techniques did you use?
    - *Answer*
3. Describe the fine-tuning process and how you reached your final CNN model.
    - *Answer*

# Task 3: Decision Trees and Ensemble Learning (15%)

For the `loan_data.csv` data, predict if the bank should give a loan or not.
You need to do the following:
- Fine-tune a decision tree on the data
- Fine-tune a random forest on the data
- Compare their performance
- Visualize your DT and one of the trees from the RF

For evaluating your models, do $80/20$ train test split.

### Data
- `credit.policy`: Whether the customer meets the credit underwriting criteria.
- `purpose`: The purpose of the loan.
- `int.rate`: The interest rate of the loan.
- `installment`: The monthly installments owed by the borrower if the loan is funded.
- `log.annual.inc`: The natural logarithm of the self-reported annual income of the borrower.
- `dti`: The debt-to-income ratio of the borrower.
- `fico`: The FICO credit score of the borrower.
- `days.with.cr.line`: The number of days the borrower has had a credit line.
- `revol.bal`: The borrower's revolving balance.
- `revol.util`: The borrower's revolving line utilization rate.

In [None]:
# TODO: Implement task 3

### Questions
1. How did the DT compare to the RF in performance? Why?
    - *Answer*
2. After fine-tuning, how does the max depth in DT compare to RF? Why?
    - *Answer*
3. What is ensemble learning? What are its pros and cons?
    - *Answer*
4. Briefly explain 2 types of boosting methods and 2 types of bagging methods.
Which of these categories does RF fall under?
    - *Answer*

# Task 4: Domain Gap (15%)

Evaluate your CNN model from task 2 on SVHN data without retraining your model.

In [None]:
# TODO: Implement task 4

### Questions
1. How did your model perform? Why is it better/worse?
    - *Answer*
2. What is domain gap in the context of ML?
    - *Answer*
3. Suggest two ways through which the problem of domain gap can be tackled.
    - *Answer*