# Assignment 2

## Instructions
- Your submission should be the `.ipynb` file with your name,
  like `YusufMesbah.ipynb`. it should include the answers to the questions in
  markdown cells.
- You are expected to follow the best practices for code writing and model
training. Poor coding style will be penalized.
- You are allowed to discuss ideas with your peers, but no sharing of code.
Plagiarism in the code will result in failing. If you use code from the
internet, cite it.
- If the instructions seem vague, use common sense.

# Task 1: ANN (30%)
For this task, you are required to build a fully connect feed-forward ANN model
for a multi-label regression problem.

For the given data, you need do proper data preprocessing, design the ANN model,
then fine-tune your model architecture (number of layers, number of neurons,
activation function, learning rate, momentum, regularization).

For evaluating your model, do $80/20$ train test split.

### Data
You will be working with the data in `Task 1.csv` for predicting students'
scores in 3 different exams: math, reading and writing. The columns include:
 - gender
 - race
 - parental level of education
 - lunch meal plan at school
 - whether the student undertook the test preparation course

In [8]:
!pip3 install -U scikit-image
!pip3 install torch torchvision
!pip3 install -U albumentations

Collecting scikit-image
  Downloading scikit_image-0.19.3-cp310-cp310-macosx_12_0_arm64.whl (12.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.5/12.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting networkx>=2.2
  Using cached networkx-2.8.8-py3-none-any.whl (2.0 MB)
Collecting imageio>=2.4.1
  Using cached imageio-2.22.4-py3-none-any.whl (3.4 MB)
Collecting tifffile>=2019.7.26
  Using cached tifffile-2022.10.10-py3-none-any.whl (210 kB)
Collecting PyWavelets>=1.1.1
  Downloading PyWavelets-1.4.1-cp310-cp310-macosx_11_0_arm64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: tifffile, PyWavelets, networkx, imageio, scikit-image
Successfully installed PyWavelets-1.4.1 imageio-2.22.4 networkx-2.8.8 scikit-image-0.19.3 tifffile-2022.10.10
Collecting torchvision
  Downloading torchvision-0.14.0-cp310-cp310-mac

In [13]:


import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MinMaxScaler

import torch
from torch.utils.data import DataLoader, Dataset, Subset

In [14]:
# TODO: Implement task 1

df = pd.read_csv('Task 1.csv')

df = df.drop(['lunch'], axis=1)

df.head(5)

Unnamed: 0,gender,race/ethnicity,parental level of education,test preparation course,math score,reading score,writing score
0,male,group A,high school,completed,67,67,63
1,female,group D,some high school,none,40,59,55
2,male,group E,some college,none,59,60,50
3,male,group B,high school,none,77,78,68
4,male,group E,associate's degree,completed,78,73,68


In [15]:
assert not df.isnull().values.any()

df['gender'] = df['gender'].astype('category')
df['gender'] = df['gender'].cat.codes

df['test preparation course'] = df['test preparation course'].astype('category')
df['test preparation course'] = df['test preparation course'].cat.codes

df['race/ethnicity'] = df['race/ethnicity'].astype('category')
df['race/ethnicity'] = df['race/ethnicity'].cat.codes

df['parental level of education'] = df['parental level of education'].astype('category')
df['parental level of education'] = df['parental level of education'].cat.codes

scaler = MinMaxScaler()
scaler.fit(df)
df = scaler.transform(df)

df = pd.DataFrame(df)

df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,1.0,0.0,0.4,0.0,0.62069,0.547945,0.519481
1,0.0,0.75,1.0,1.0,0.310345,0.438356,0.415584
2,1.0,1.0,0.8,1.0,0.528736,0.452055,0.350649
3,1.0,0.25,0.4,1.0,0.735632,0.69863,0.584416
4,1.0,1.0,0.0,0.0,0.747126,0.630137,0.584416


In [19]:
class MyDataset(Dataset):

  def __init__(self, file_name):


    x=file_name.iloc[:,0:4].values
    y=file_name.iloc[:,4:7].values

    self.x_train=torch.tensor(x,dtype=torch.float32)
    self.y_train=torch.tensor(y,dtype=torch.float32)

  def __len__(self):
    return len(self.y_train)

  def __getitem__(self,idx):
    return self.x_train[idx],self.y_train[idx]



In [20]:
dataset = MyDataset(df)
for i, (data, labels) in enumerate(dataset):
    print(data.shape, labels.shape)
    print(data,labels)
    break;

torch.Size([4]) torch.Size([3])
tensor([1.0000, 0.0000, 0.4000, 0.0000]) tensor([0.6207, 0.5479, 0.5195])


### Questions
1. What preprocessing techniques did you use? Why?
    - *Answer*
2. Describe the fine-tuning process and how you reached your model architecture.
    - *Answer*

# Task 2: CNN (40%)
For this task, you will be doing image classification:
- First, adapt your best model from Task 1 to work on this task, and
fit it on the new data. Then, evaluate its performance.
- After that, build a CNN model for image classification.
- Compare both models in terms of accuracy, number of parameters and speed of
inference (the time the model takes to predict 50 samples).

For the given data, you need to do proper data preprocessing and augmentation,
data loaders.
Then fine-tune your model architecture (number of layers, number of filters,
activation function, learning rate, momentum, regularization).

### Data
You will be working with the data in `triple_mnist.zip` for predicting 3-digit
numbers writen in the image. Each image contains 3 digits similar to the
following example (whose label is `039`):

![example](https://github.com/shaohua0116/MultiDigitMNIST/blob/master/asset/examples/039/0_039.png?raw=true)

In [None]:
# TODO: Implement task 2

### Questions
1. What preprocessing techniques did you use? Why?
    - *Answer*
2. What data augmentation techniques did you use?
    - *Answer*
3. Describe the fine-tuning process and how you reached your final CNN model.
    - *Answer*

# Task 3: Decision Trees and Ensemble Learning (15%)

For the `loan_data.csv` data, predict if the bank should give a loan or not.
You need to do the following:
- Fine-tune a decision tree on the data
- Fine-tune a random forest on the data
- Compare their performance
- Visualize your DT and one of the trees from the RF

For evaluating your models, do $80/20$ train test split.

### Data
- `credit.policy`: Whether the customer meets the credit underwriting criteria.
- `purpose`: The purpose of the loan.
- `int.rate`: The interest rate of the loan.
- `installment`: The monthly installments owed by the borrower if the loan is funded.
- `log.annual.inc`: The natural logarithm of the self-reported annual income of the borrower.
- `dti`: The debt-to-income ratio of the borrower.
- `fico`: The FICO credit score of the borrower.
- `days.with.cr.line`: The number of days the borrower has had a credit line.
- `revol.bal`: The borrower's revolving balance.
- `revol.util`: The borrower's revolving line utilization rate.

In [None]:
# TODO: Implement task 3

### Questions
1. How did the DT compare to the RF in performance? Why?
    - *Answer*
2. After fine-tuning, how does the max depth in DT compare to RF? Why?
    - *Answer*
3. What is ensemble learning? What are its pros and cons?
    - *Answer*
4. Briefly explain 2 types of boosting methods and 2 types of bagging methods.
Which of these categories does RF fall under?
    - *Answer*

# Task 4: Domain Gap (15%)

Evaluate your CNN model from task 2 on SVHN data without retraining your model.

In [None]:
# TODO: Implement task 4

### Questions
1. How did your model perform? Why is it better/worse?
    - *Answer*
2. What is domain gap in the context of ML?
    - *Answer*
3. Suggest two ways through which the problem of domain gap can be tackled.
    - *Answer*