# Build Your First Machine Learning Model!

Welome to the **Machine Learning** programming assignment of the **Artificial Intelligence** course.

In this assignment, a Jupyter Notebook is prepared for examining almost all critical factors in designing a Machine Leaning model for a supervised learning task.

You will complete this notebook to build a ML model, and apply it to a binary classification problem. Additionally you will get familiar with varoius factors which may enhance the performance of a ML model.

In this notebook, we will take advantage of four principal python libraries for data science and machine learning tasks.

 - [Numpy](https://numpy.org): The fundamental package for **scientific computing** with Python.
 - [Pandas](https://pandas.pydata.org): An open source **data analysis and manipulation tool**, built on top of the Python programming language.
 - [Matplotlib](https://matplotlib.org) : A comprehensive library for **creating static, animated, and interactive visualizations** in Python.
 - [Scikit-Learn](https://scikit-learn.org): Simple and efficient tools for predictive data analysis and building Machine Learning models.
 
**After this assignment you will be able to:**
1. Run Exploratory Data Analysis (EDA) and know how to prepare data for a predictive model.
2. Build and apply a Machine Learning model for a supervised learning problem using known frameworks.
 
 
 **Before you start:** Please read the ***Submission*** section at the bottom of the notebook carefully.
 
 Let's get started!



# 0. User UI Preference - Introduction

<p align="center">
  <img src="images/UI.png" width=1000>
</p>

The **User Interface (UI)** is the graphical layout of an application. It consists of the buttons users click on, the text they read, the images, sliders, text entry fields, and all the rest of the items the user interacts with. This includes screen layout, transitions, interface animations and every single micro-interaction. Any sort of visual element, interaction, or animation must all be designed.

An ecommerce company has changed the UI of it's website recently but they do not know how it is possible to evaluate their new UI. Do customers like it? Do they prefer the new UI or the old one?

They have also collected feedbacks from their customers based on some pre-organized surveys. You as Machine Learning engineers are aksed to **build a Machine Learning Model so to identify the User UI preference on the basis of their UI engagement information.**

## Data

There are two files that you need to consider. *(i)* `train.csv` for training, and *(ii)* `test.csv` for testing purposes. They consist 14 columns with the following description:


| Column name | Description |
| ------------- | ------------- |
| **CustomerID** | Represents a unique identification of a user |
| **Age** | Represents the age of the user |
| **City** | Represents the city in which the user lives |
| **State** | Represents the state in which the user lives |
| **No_of_orders_placed** | Represents the total number of orders placed by a customer |
| **Last order placed_date** | Represents the last date when the customer placed the order |
| **is premium_member** | Represents whether a customer is a premium member or not. ( 0 or 1) |
| **Women's Clothing** | Represents user's engagement score in Women's_Clothing section ( 0 to 10 ) |
| **Men's Clothing** | Represents user's engagement score in Men's Clothing section( 0 to 10 ) |
| **Kid's Clothing** | Represents user's engagement score in Kid's Clothing section ( 0 to 10 ) |
| **Home &_Living** | Represents user's engagement score in Home_&_Living section ( 0 to 10 ) |
| **Beauty** | Represents user's engagement score in Beauty products section ( 0 to 10 ) |
| **Electronics** | Represents user's engagement score in Electronics products section( 0 to 10 ) |
| **Preferred_Theme** | Represents the preferred theme ( Old_UI or New_UI) |


## Packages
Let's first import all the packages that you will need during this assignment.

In [58]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from scipy.stats import mode
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier 

# 1. Data preprocessing


What's Data preprocessing? **Data preprocessing** is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model.

Some of the main preprocessing steps are:

  - **Handling missing values**: impute missing (NaN) values
  - **Standardization**: normalize our data for a more quick convergence of the model
  - **Handling Categorical Variables**: one-hot encoding is an appropirate solution
  - ...
  
In this assignment, you will apply some of these methods step by step.

In [2]:
# load the dataset into a dataframe
train_df = pd.read_csv('data/train.csv', index_col=0)
test_df = pd.read_csv('data/test.csv', index_col=0)

print(f'Number of training samples: {len(train_df)}')
print(f'Number of test samples: {len(test_df)}')
train_df.head(10)

Number of training samples: 10605
Number of test samples: 4545


Unnamed: 0,CustomerID,Age,Gender,City,State,No_of_orders_placed,Sign_up_date,Last_order_placed_date,is_premium_member,Women’s_Clothing,Men’s_Clothing,Kid’s_Clothing,Home_&_Living,Beauty,Electronics,Preferred_Theme
0,CusID_00685,19.0,Not_Specified,Bercelona,Singapore,14.0,2017-01-17,2020-09-19,1,3.445937,2.620136,5.4575,6.141905,8.10649,4.389933,Old_UI
1,CusID_06121,31.0,Male,Sydney,New South Wales,10.0,2016-01-22,2021-12-09,0,1.320304,9.025863,6.378695,2.825636,0.97743,7.789076,New_UI
2,CusID_09847,,Not_Specified,Toronto,Ontario,12.0,2019-08-07,2021-10-13,1,3.79341,0.72649,3.957772,3.0,2.589856,7.426173,Old_UI
3,CusID_01433,29.0,Not_Specified,Toronto,Ontario,15.0,2016-02-27,2020-10-22,1,2.362948,2.701855,3.522246,3.121818,2.259882,5.970419,Old_UI
4,CusID_02167,38.0,Female,?,British Columbia,3.0,2019-07-04,2020-03-17,0,7.568971,2.161103,7.535511,0.865851,4.63852,9.061972,Old_UI
5,CusID_02674,30.0,Female,Kolkata,West Bengal,10.0,2017-07-29,2021-02-21,0,8.790859,4.219936,3.0,6.395232,9.53983,1.06681,Old_UI
6,CusID_05594,27.0,Female,Kuala Lampur,Singapore,7.0,2017-09-29,2020-07-24,0,7.496507,6.065087,6.757594,8.390672,2.0,3.440012,New_UI
7,CusID_09297,27.0,Female,Munich,Catalonia,7.0,2018-04-18,2020-03-15,0,7.447626,0.724823,8.645873,7.822508,7.069303,2.882355,New_UI
8,CusID_03771,18.0,Female,Vienna,Vienna,4.0,2018-04-02,2021-02-15,0,7.573536,4.267514,4.359906,,6.521518,0.482545,New_UI
9,CusID_00667,19.0,Female,London,England,13.0,2017-08-11,2020-01-09,1,5.061261,3.822283,,7.37724,9.669255,2.410884,Old_UI


## 1.1 Missing values (10 points):

In any real-world dataset, there are always few null values. It doesn’t really matter whether it is a regression, classification or any other kind of problem, no model can handle these NULL or NaN values on its own so we need to intervene.

```
- In python NULL is reprsented with NaN. So don’t get confused between these two,they can be used interchangably.
```

First of all, we need to check whether we have null values in our dataset or not:

In [3]:
print("Training set:")
print(train_df.isna().sum())
print("============================")
print("Test set:")
print(test_df.isna().sum())

Training set:
CustomerID                  0
Age                       729
Gender                      0
City                        0
State                       0
No_of_orders_placed       535
Sign_up_date                0
Last_order_placed_date      0
is_premium_member           0
Women’s_Clothing            0
Men’s_Clothing              0
Kid’s_Clothing            648
Home_&_Living             595
Beauty                      0
Electronics                 0
Preferred_Theme             0
dtype: int64
Test set:
CustomerID                  0
Age                       274
Gender                      0
City                        0
State                       0
No_of_orders_placed       238
Sign_up_date                0
Last_order_placed_date      0
is_premium_member           0
Women’s_Clothing            0
Men’s_Clothing              0
Kid’s_Clothing            287
Home_&_Living             253
Beauty                      0
Electronics                 0
Preferred_Theme             0
dty

There are various ways for us to handle this problem. The easiest way to solve this problem is by dropping the rows or columns that contain null values.

Here is a link for you too choose your method and then apply it on both train and test set:
https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/

In [4]:
# choose an appropriate way to impute the missing values
# Please  mention the method you have selected by writing in a markdown cell.

################ YOUR CODE STARTS HERE ################
train_df = train_df.fillna(method='ffill')
test_df = test_df.fillna(method='ffill')

#check to see if there are any NaN values left
print("Training set:")
print(train_df.isna().sum())
print("============================")
print("Test set:")
print(test_df.isna().sum())
#No NaN values left so this method is good enough
################ YOUR CODE ENDS HERE ##################

Training set:
CustomerID                0
Age                       0
Gender                    0
City                      0
State                     0
No_of_orders_placed       0
Sign_up_date              0
Last_order_placed_date    0
is_premium_member         0
Women’s_Clothing          0
Men’s_Clothing            0
Kid’s_Clothing            0
Home_&_Living             0
Beauty                    0
Electronics               0
Preferred_Theme           0
dtype: int64
Test set:
CustomerID                0
Age                       0
Gender                    0
City                      0
State                     0
No_of_orders_placed       0
Sign_up_date              0
Last_order_placed_date    0
is_premium_member         0
Women’s_Clothing          0
Men’s_Clothing            0
Kid’s_Clothing            0
Home_&_Living             0
Beauty                    0
Electronics               0
Preferred_Theme           0
dtype: int64


## 1.2 Standardization (10 points):

It is another integral preprocessing step. In Standardization, we transform our values such that the mean of the values is 0 and the standard deviation is 1.

**Hint:** Use `sklearn.preprocessing.StandardScaler` to do it.

In [5]:
# Standardize your train/test data.
# Important note: You must use the mean and standard deviation of train data for test set! You know why :)

################ YOUR CODE STARTS HERE ################
cols = ["Age", "No_of_orders_placed", "Women’s_Clothing", \
    "Men’s_Clothing", "Kid’s_Clothing", "Home_&_Living", "Beauty", "Electronics"]

scaler = StandardScaler().fit(train_df[cols])
train_df[cols] = scaler.transform(train_df[cols])
test_df[cols] = scaler.transform(test_df[cols])
################ YOUR CODE ENDS HERE ##################

In [6]:
#just checking the training set after standardization
train_df

Unnamed: 0,CustomerID,Age,Gender,City,State,No_of_orders_placed,Sign_up_date,Last_order_placed_date,is_premium_member,Women’s_Clothing,Men’s_Clothing,Kid’s_Clothing,Home_&_Living,Beauty,Electronics,Preferred_Theme
0,CusID_00685,-1.233946,Not_Specified,Bercelona,Singapore,1.614207,2017-01-17,2020-09-19,1,0.124713,-0.619362,0.354941,0.647660,1.057802,0.120510,Old_UI
1,CusID_06121,0.355093,Male,Sydney,New South Wales,0.500834,2016-01-22,2021-12-09,0,-0.043409,1.524337,0.703492,-0.738850,-0.826738,0.375443,New_UI
2,CusID_09847,0.355093,Not_Specified,Toronto,Ontario,1.057520,2019-08-07,2021-10-13,1,0.152196,-1.253078,-0.212510,-0.665950,-0.400500,0.348226,Old_UI
3,CusID_01433,0.090253,Not_Specified,Toronto,Ontario,1.892550,2016-02-27,2020-10-22,1,0.039057,-0.592015,-0.377300,-0.615018,-0.487727,0.239045,Old_UI
4,CusID_02167,1.282032,Female,?,British Columbia,-1.447568,2019-07-04,2020-03-17,0,0.450816,-0.772980,1.141196,-1.558223,0.141057,0.470910,Old_UI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10600,CusID_09355,1.679291,Male,Kuala Lampur,Singapore,1.335864,2016-01-10,2021-09-18,1,0.139572,0.022257,0.841254,-1.424021,1.779937,0.207596,Old_UI
10601,CusID_02156,-0.704266,Male,London,England,1.335864,2016-01-30,2020-05-20,0,0.292578,-0.157586,-1.354068,0.776611,-0.188789,0.457578,New_UI
10602,CusID_06497,-0.704266,Male,Toronto,California,0.500834,2017-11-07,2020-10-07,0,0.111349,1.542969,-1.354068,0.114446,-0.662771,0.440885,Old_UI
10603,CusID_07331,-0.704266,Not_Specified,Toronto,Ontario,1.057520,2018-03-04,2020-11-26,1,0.062950,0.240591,0.295807,-0.983980,-0.820772,0.211391,Old_UI


In [7]:
#just checking the test set after standardization
test_df

Unnamed: 0,CustomerID,Age,Gender,City,State,No_of_orders_placed,Sign_up_date,Last_order_placed_date,is_premium_member,Women’s_Clothing,Men’s_Clothing,Kid’s_Clothing,Home_&_Living,Beauty,Electronics,Preferred_Theme
0,CusID_09265,1.944131,Male,Munich,Bavaria,1.614207,2019-04-17,2021-05-25,1,0.117018,0.269042,1.073437,-0.267650,-0.988934,0.189234,New_UI
1,CusID_05791,0.355093,Male,Kolkata,California,0.500834,2017-06-23,2021-07-30,0,-0.039028,1.288364,-1.767676,-0.332475,-0.971954,0.491256,Old_UI
2,CusID_07705,2.473810,Female,Florence,Tuscany,1.335864,2017-09-29,2021-01-06,1,0.388687,-0.977850,1.562905,-1.502137,1.437588,0.023630,Old_UI
3,CusID_08159,-0.439427,Not_Specified,Vienna,Central Hungary,1.057520,2018-11-06,2020-09-02,0,0.313424,-1.484451,0.380756,1.241508,-0.550542,-0.018703,Old_UI
4,CusID_09737,-0.571847,Female,Vancouver,British Columbia,-1.169225,2016-04-02,2020-07-13,1,0.345644,-0.406287,0.344027,-0.665950,1.101348,-1.033725,Old_UI
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4540,CusID_09757,-1.233946,Male,London,Tuscany,0.500834,2016-10-15,2021-04-01,0,0.088635,1.270232,0.046598,-0.682612,-0.621079,0.406094,New_UI
4541,CusID_06709,0.487512,Female,Sydney,Catalonia,0.500834,2017-05-25,2021-06-01,1,0.365588,-0.616121,0.046598,1.717865,3.863224,0.047784,Old_UI
4542,CusID_02496,-0.969106,Female,Berlin,Maharashtra,-0.334196,2019-10-14,2020-12-13,0,0.437151,-1.180012,1.898797,0.907879,1.248499,-0.009736,Old_UI
4543,CusID_02347,1.679291,Not_Specified,Mumbai,Central Hungary,1.057520,2018-04-26,2021-08-05,1,0.105961,-0.550430,0.228421,0.865079,-0.315529,0.132251,Old_UI


## 1.3 Handling categorical Variables (10 points):

Handling categorical variables is another integral aspect of Machine Learning. Categorical variables are basically the variables that are discrete and not continuous. One of the methods to do this is **One-hot encoding**:

**-One-hot encoding:**

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

```
red,	green,	blue
1,		0,		0
0,		1,		0
0,		0,		1
```

In [8]:
# Now you have to convert the categorical features in the 
# dataset to one-hot encoded representation.
# Hint: Use pd.DataFrame.get_dummies()

# before that, we drop 'City' column from both training and test set, we wont use it.
train_df.drop('City', axis=1, inplace=True)
test_df.drop('City', axis=1, inplace=True)

cols_to_be_encoded = ['Gender', 'State']

################ YOUR CODE STARTS HERE ################
train_df = pd.get_dummies(train_df, columns=cols_to_be_encoded)
train_df = pd.get_dummies(test_df, columns=cols_to_be_encoded)

################ YOUR CODE ENDS HERE ##################

In [9]:
train_df

Unnamed: 0,CustomerID,Age,No_of_orders_placed,Sign_up_date,Last_order_placed_date,is_premium_member,Women’s_Clothing,Men’s_Clothing,Kid’s_Clothing,Home_&_Living,...,State_New York,State_Ontario,State_Singapore,State_Taiwan,State_Tamil Nadu,State_Tokyo,State_Tuscany,State_Vienna,State_West Bengal,State_Western Australia
0,CusID_09265,1.944131,1.614207,2019-04-17,2021-05-25,1,0.117018,0.269042,1.073437,-0.267650,...,0,0,0,0,0,0,0,0,0,0
1,CusID_05791,0.355093,0.500834,2017-06-23,2021-07-30,0,-0.039028,1.288364,-1.767676,-0.332475,...,0,0,0,0,0,0,0,0,0,0
2,CusID_07705,2.473810,1.335864,2017-09-29,2021-01-06,1,0.388687,-0.977850,1.562905,-1.502137,...,0,0,0,0,0,0,1,0,0,0
3,CusID_08159,-0.439427,1.057520,2018-11-06,2020-09-02,0,0.313424,-1.484451,0.380756,1.241508,...,0,0,0,0,0,0,0,0,0,0
4,CusID_09737,-0.571847,-1.169225,2016-04-02,2020-07-13,1,0.345644,-0.406287,0.344027,-0.665950,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4540,CusID_09757,-1.233946,0.500834,2016-10-15,2021-04-01,0,0.088635,1.270232,0.046598,-0.682612,...,0,0,0,0,0,0,1,0,0,0
4541,CusID_06709,0.487512,0.500834,2017-05-25,2021-06-01,1,0.365588,-0.616121,0.046598,1.717865,...,0,0,0,0,0,0,0,0,0,0
4542,CusID_02496,-0.969106,-0.334196,2019-10-14,2020-12-13,0,0.437151,-1.180012,1.898797,0.907879,...,0,0,0,0,0,0,0,0,0,0
4543,CusID_02347,1.679291,1.057520,2018-04-26,2021-08-05,1,0.105961,-0.550430,0.228421,0.865079,...,0,0,0,0,0,0,0,0,0,0


The preprocess steps you have done till now,were the most essential methods to preprocess a raw data. Otherwise, you are free to do more on this phase.

For instance, **Feature Engineering** is a great example of preprocessing steps. It includes creating new features by mixing existing variables in order to help the ML model classify the sampels more accuratly.

# 2. Designing ML models

In this section, we are going to build, compile and train our ML models. There are a huge number of various models which you can use and fortunatly, almost all of them has been implemented in [Scikit-Learn](https://scikit-learn.org).

However in this assignment, **you have to choose one of the following ML algorithms**, and implement it from scratch **without using existing frameworks**.

   - **KNN**
   - **Naive Bayes**
   
There are a plenty of resourcse and tutorials for these two algorithm on the internet. Just search them and read about them.

## 2.1 KNN or Naive Bayes from scratch (25 points):

In [61]:
# Implement one of above-mentioned algorithms from scratch.
# You are free to use Numpy and also pandas functionalities to implement them.
# But avoid using sklearn!

################ YOUR CODE STARTS HERE ################
#dropping string columns since they cause some difficulties in the ML algorithms
traindf = train_df.drop('Sign_up_date', axis=1, inplace=False)
traindf.drop('Last_order_placed_date', axis=1, inplace=True)
traindf = traindf.set_index('CustomerID')

testdf = test_df.drop('Sign_up_date', axis=1, inplace=False)
testdf.drop('Last_order_placed_date', axis=1, inplace=True)
testdf = testdf.set_index('CustomerID')

################ YOUR CODE ENDS HERE ##################

# Before that use train_test_split() to split your train data into train/validation
# in order to evaluate your model based on validation data

X_train, X_val, y_train, y_val = train_test_split(traindf.drop('Preferred_Theme', axis=1),
                                                    traindf['Preferred_Theme'],
                                                    test_size=0.2)

In [14]:
################ YOUR CODE STARTS HERE ################
#implementing KNN from scratch
#note that Scikit-learn uses a KD Tree or Ball Tree to compute nearest neighbors in O[N log(N)] time
#while this algorithm is a direct approach including nested loops that requires O[N^2] time
#therefore processing 900 points in x_val takes about 9 minutes 

def eucledian_dist(p1, p2):
    p = p2 - p1
    return np.sum(p*p) #not using np.sqr because it doesn't change the result + it slows down the process

def KNN_predict(x_train, y_train, x_val, k):
    y_vals = []

    for i in range(len(x_val)):
        point_dist = []
        for j in range(len(x_train)): #number of rows in x_train
            distances = eucledian_dist(np.array(x_val.iloc[i]), np.array(x_train.iloc[j])) 
            point_dist.append(distances) 
        point_dist = np.array(point_dist) 
         
        #sort the array while preserving the index
        dist = np.argsort(point_dist)[:k]
         
        #val of the K nearest datapoints
        vals = y_train[dist]
         
        #mode of val
        v = mode(vals) 
        v = v.mode[0]
        y_vals.append(v)
    return y_vals

y_pred = KNN_predict(X_train, y_train, X_val, 5)
np.array(y_pred)
################ YOUR CODE ENDS HERE ##################

array(['New_UI', 'New_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'New_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_UI', 'Old_UI',
       'Old_UI', 'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI',
       'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI', 'Old_UI',
       'New_UI', 'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_UI',
       'New_UI', 'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_UI',
       'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI', 'New_UI',
       'Old_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI',
       'New_UI', 'Old_UI', 'New_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI',
       'Old_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI',
       'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'New_UI', 'New_UI',
       'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'New_UI', 'New_

## 2.2 Other common ML models (25 points):

In this part you have to use predefined ML algorithms in [Scikit-Learn](https://scikit-learn.org) to implement other known models. **You must at least try 2 differnt models** to get the full score of this part.

**Note 1:** But you are free to try other model as much as you can to get higher performances.

**Note 2:** Writing about how those algorithms work has bonus!


You can see a complete list of exisiting models in Scikit-learn documentation: https://scikit-learn.org/stable/supervised_learning.html

Answer: The first method used is https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification so that I can compare the results to my own implementation of it.

In [18]:
################ YOUR CODE STARTS HERE ################
#model 1: KNN
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred2 = classifier.predict(X_val)
y_pred2
################ YOUR CODE ENDS HERE ##################

array(['New_UI', 'New_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'New_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_UI', 'Old_UI',
       'Old_UI', 'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI',
       'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI', 'Old_UI',
       'New_UI', 'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_UI',
       'New_UI', 'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_UI',
       'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI', 'New_UI',
       'Old_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI',
       'New_UI', 'Old_UI', 'New_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI',
       'Old_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI',
       'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'New_UI', 'New_UI',
       'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'New_UI', 'New_

In [62]:
################ YOUR CODE STARTS HERE ################
#model2: Decision Tree
clf = DecisionTreeClassifier(max_depth=6) #max_depth=5 and 6 showed highest accuracy between values [1,9]
clf = clf.fit(X_train,y_train)
y_pred3 = clf.predict(X_val)
y_pred3
################ YOUR CODE ENDS HERE ################

array(['Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'New_UI', 'New_UI',
       'New_UI', 'Old_UI', 'Old_UI', 'New_UI', 'Old_UI', 'Old_UI',
       'Old_UI', 'Old_UI', 'New_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'New_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI', 'New_UI',
       'New_UI', 'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI',
       'New_UI', 'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI',
       'New_UI', 'New_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI', 'Old_UI',
       'Old_UI', 'New_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI',
       'New_UI', 'Old_UI', 'New_UI', 'Old_UI', 'New_UI', 'New_UI',
       'New_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI',
       'Old_UI', 'New_UI', 'New_UI', 'New_UI', 'Old_UI', 'New_UI',
       'Old_UI', 'New_UI', 'New_UI', 'Old_UI', 'New_UI', 'Old_UI',
       'Old_UI', 'Old_UI', 'Old_UI', 'Old_UI', 'New_UI', 'New_UI',
       'Old_UI', 'New_UI', 'New_UI', 'New_UI', 'Old_UI', 'Old_

## 2.3 Evaluation (20 points):

Great! You are now able to evaluate your model and see how nice your algorithm is able to classify positive and negative samples both on train and validation data.

For this, we are going to use 4 metrics. It is highly recommnded to read about them and know how benficial they are for different purposes:

- Accuracy
- Precision
- Recall
- F1-score

Use `sklearn.metrics` to implement them and report the results on both train and validation data.

In [57]:
################ YOUR CODE STARTS HERE ################
#Accuracy:how often is the classifier correct?
print("Accuracy1:", accuracy_score(y_val, y_pred))
print("Accuracy2:", accuracy_score(y_val, y_pred2))
print("Accuracy3:", accuracy_score(y_val, y_pred3))

#Precision: the ability of the classifier not to label as positive a sample that is negative.
#Weighted: Calculate metrics for each label, and find their average weighted by support 
#(the number of true instances for each label). This alters ‘macro’ to account for label 
#imbalance; it can result in an F-score that is not between precision and recall.
print("Precision1:", precision_score(y_val, y_pred, average='weighted'))
print("Precision2:", precision_score(y_val, y_pred2, average='weighted'))
print("Precision3:", precision_score(y_val, y_pred3, average='weighted'))

#Recall: the ability of the classifier to find all the positive samples.
print("Recall1:", recall_score(y_val, y_pred, average='weighted'))
print("Recall2:", recall_score(y_val, y_pred2, average='weighted'))
print("Recall3:", recall_score(y_val, y_pred3, average='weighted'))

#F1: a harmonic mean of the precision and recall
print("F1-score1:", f1_score(y_val, y_pred, average='weighted'))
print("F1-score2:", f1_score(y_val, y_pred2, average='weighted'))
print("F1-score3:", f1_score(y_val, y_pred3, average='weighted'))
################ YOUR CODE ENDS HERE ##################

Accuracy1: 0.6116611661166117
Accuracy2: 0.6490649064906491
Accuracy3: 0.6677667766776678
Precision1: 0.6119816925229575
Precision2: 0.649141892568211
Precision3: 0.6685193582234751
Recall1: 0.6116611661166117
Recall2: 0.6490649064906491
Recall3: 0.6677667766776678
F1-score1: 0.61155399109101
F1-score2: 0.6490649064906491
F1-score3: 0.6672051308938801


# 3. Submission

Please read the notes here carefully:

1. In addition to completing the code files, please send a report including your answer to these questions as well. Do not forget to put the diagrams and visualizations needed in each part.

2. The file you upload must be named as `[Student ID]-[Your name].zip.`
3. Your notebook must be executed without any problem. If not, you will lose points for each part consequently.
4. **Important Note:** The outputs of the code blocks must be remained in your notebook, otherwise, you definitly lose all the points of that part.



In case you have any questions, contact **mohammad99hashemi@gmail.com**.


Good luck :)