# Build Your First Machine Learning Model!

Welome to the **Machine Learning** programming assignment of the **Artificial Intelligence** course.

In this assignment, a Jupyter Notebook is prepared for examining almost all critical factors in designing a Machine Leaning model for a supervised learning task.

You will complete this notebook to build a ML model, and apply it to a binary classification problem. Additionally you will get familiar with varoius factors which may enhance the performance of a ML model.

In this notebook, we will take advantage of four principal python libraries for data science and machine learning tasks.

 - [Numpy](https://numpy.org): The fundamental package for **scientific computing** with Python.
 - [Pandas](https://pandas.pydata.org): An open source **data analysis and manipulation tool**, built on top of the Python programming language.
 - [Matplotlib](https://matplotlib.org) : A comprehensive library for **creating static, animated, and interactive visualizations** in Python.
 - [Scikit-Learn](https://scikit-learn.org): Simple and efficient tools for predictive data analysis and building Machine Learning models.
 
**After this assignment you will be able to:**
1. Run Exploratory Data Analysis (EDA) and know how to prepare data for a predictive model.
2. Build and apply a Machine Learning model for a supervised learning problem using known frameworks.
 
 
 **Before you start:** Please read the ***Submission*** section at the bottom of the notebook carefully.
 
 Let's get started!



# 0. User UI Preference - Introduction

<p align="center">
  <img src="images/UI.png" width=1000>
</p>

The **User Interface (UI)** is the graphical layout of an application. It consists of the buttons users click on, the text they read, the images, sliders, text entry fields, and all the rest of the items the user interacts with. This includes screen layout, transitions, interface animations and every single micro-interaction. Any sort of visual element, interaction, or animation must all be designed.

An ecommerce company has changed the UI of it's website recently but they do not know how it is possible to evaluate their new UI. Do customers like it? Do they prefer the new UI or the old one?

They have also collected feedbacks from their customers based on some pre-organized surveys. You as Machine Learning engineers are aksed to **build a Machine Learning Model so to identify the User UI preference on the basis of their UI engagement information.**

## Data

There are two files that you need to consider. *(i)* `train.csv` for training, and *(ii)* `test.csv` for testing purposes. They consist 14 columns with the following description:


| Column name | Description |
| ------------- | ------------- |
| **CustomerID** | Represents a unique identification of a user |
| **Age** | Represents the age of the user |
| **City** | Represents the city in which the user lives |
| **State** | Represents the state in which the user lives |
| **No_of_orders_placed** | Represents the total number of orders placed by a customer |
| **Last order placed_date** | Represents the last date when the customer placed the order |
| **is premium_member** | Represents whether a customer is a premium member or not. ( 0 or 1) |
| **Women's Clothing** | Represents user's engagement score in Women's_Clothing section ( 0 to 10 ) |
| **Men's Clothing** | Represents user's engagement score in Men's Clothing section( 0 to 10 ) |
| **Kid's Clothing** | Represents user's engagement score in Kid's Clothing section ( 0 to 10 ) |
| **Home &_Living** | Represents user's engagement score in Home_&_Living section ( 0 to 10 ) |
| **Beauty** | Represents user's engagement score in Beauty products section ( 0 to 10 ) |
| **Electronics** | Represents user's engagement score in Electronics products section( 0 to 10 ) |
| **Preferred_Theme** | Represents the preferred theme ( Old_UI or New_UI) |


## Packages
Let's first import all the packages that you will need during this assignment.

In [89]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 1. Data preprocessing


What's Data preprocessing? **Data preprocessing** is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model.

Some of the main preprocessing steps are:

  - **Handling missing values**: impute missing (NaN) values
  - **Standardization**: normalize our data for a more quick convergence of the model
  - **Handling Categorical Variables**: one-hot encoding is an appropirate solution
  - ...
  
In this assignment, you will apply some of these methods step by step.

In [90]:
# load the dataset into a dataframe
train_df = pd.read_csv('data/train.csv', index_col=0,na_values='?')
test_df = pd.read_csv('data/test.csv', index_col=0,na_values='?')

print(f'Number of training samples: {len(train_df)}')
print(f'Number of test samples: {len(test_df)}')
train_df.head(10)



Number of training samples: 10605
Number of test samples: 4545


Unnamed: 0,CustomerID,Age,Gender,City,State,No_of_orders_placed,Sign_up_date,Last_order_placed_date,is_premium_member,Women’s_Clothing,Men’s_Clothing,Kid’s_Clothing,Home_&_Living,Beauty,Electronics,Preferred_Theme
0,CusID_00685,19.0,Not_Specified,Bercelona,Singapore,14.0,2017-01-17,2020-09-19,1,3.445937,2.620136,5.4575,6.141905,8.10649,4.389933,Old_UI
1,CusID_06121,31.0,Male,Sydney,New South Wales,10.0,2016-01-22,2021-12-09,0,1.320304,9.025863,6.378695,2.825636,0.97743,7.789076,New_UI
2,CusID_09847,,Not_Specified,Toronto,Ontario,12.0,2019-08-07,2021-10-13,1,3.79341,0.72649,3.957772,3.0,2.589856,7.426173,Old_UI
3,CusID_01433,29.0,Not_Specified,Toronto,Ontario,15.0,2016-02-27,2020-10-22,1,2.362948,2.701855,3.522246,3.121818,2.259882,5.970419,Old_UI
4,CusID_02167,38.0,Female,,British Columbia,3.0,2019-07-04,2020-03-17,0,7.568971,2.161103,7.535511,0.865851,4.63852,9.061972,Old_UI
5,CusID_02674,30.0,Female,Kolkata,West Bengal,10.0,2017-07-29,2021-02-21,0,8.790859,4.219936,3.0,6.395232,9.53983,1.06681,Old_UI
6,CusID_05594,27.0,Female,Kuala Lampur,Singapore,7.0,2017-09-29,2020-07-24,0,7.496507,6.065087,6.757594,8.390672,2.0,3.440012,New_UI
7,CusID_09297,27.0,Female,Munich,Catalonia,7.0,2018-04-18,2020-03-15,0,7.447626,0.724823,8.645873,7.822508,7.069303,2.882355,New_UI
8,CusID_03771,18.0,Female,Vienna,Vienna,4.0,2018-04-02,2021-02-15,0,7.573536,4.267514,4.359906,,6.521518,0.482545,New_UI
9,CusID_00667,19.0,Female,London,England,13.0,2017-08-11,2020-01-09,1,5.061261,3.822283,,7.37724,9.669255,2.410884,Old_UI


## 1.1 Missing values (10 points):

In any real-world dataset, there are always few null values. It doesn’t really matter whether it is a regression, classification or any other kind of problem, no model can handle these NULL or NaN values on its own so we need to intervene.

```
- In python NULL is reprsented with NaN. So don’t get confused between these two,they can be used interchangably.
```

First of all, we need to check whether we have null values in our dataset or not:

In [91]:
print("Training set:")
print(train_df.isna().sum())
print("============================")
print("Test set:")
print(test_df.isna().sum())

Training set:
CustomerID                  0
Age                       729
Gender                      0
City                      311
State                       0
No_of_orders_placed       535
Sign_up_date              102
Last_order_placed_date      0
is_premium_member           0
Women’s_Clothing            0
Men’s_Clothing              0
Kid’s_Clothing            648
Home_&_Living             595
Beauty                      0
Electronics                 0
Preferred_Theme             0
dtype: int64
Test set:
CustomerID                  0
Age                       274
Gender                      0
City                      135
State                       0
No_of_orders_placed       238
Sign_up_date               52
Last_order_placed_date      0
is_premium_member           0
Women’s_Clothing            0
Men’s_Clothing              0
Kid’s_Clothing            287
Home_&_Living             253
Beauty                      0
Electronics                 0
Preferred_Theme             0
dty

There are various ways for us to handle this problem. The easiest way to solve this problem is by dropping the rows or columns that contain null values.

Here is a link for you too choose your method and then apply it on both train and test set:
https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/

In [92]:
# choose an appropriate way to impute the missing values
# Please  mention the method you have selected by writing in a markdown cell.

################ YOUR CODE STARTS HERE ################
train_df = train_df.dropna(how = 'all')
train_df = train_df.fillna(method ='bfill')

test_df = test_df.dropna(how = 'all')
test_df = test_df.fillna(method ='bfill')

print("Training set:")
print(train_df.isna().sum())
print("============================")
print("Test set:")
print(test_df.isna().sum())
################ YOUR CODE ENDS HERE ##################

Training set:
CustomerID                0
Age                       0
Gender                    0
City                      0
State                     0
No_of_orders_placed       0
Sign_up_date              0
Last_order_placed_date    0
is_premium_member         0
Women’s_Clothing          0
Men’s_Clothing            0
Kid’s_Clothing            0
Home_&_Living             0
Beauty                    0
Electronics               0
Preferred_Theme           0
dtype: int64
Test set:
CustomerID                0
Age                       0
Gender                    0
City                      0
State                     0
No_of_orders_placed       0
Sign_up_date              0
Last_order_placed_date    0
is_premium_member         0
Women’s_Clothing          0
Men’s_Clothing            0
Kid’s_Clothing            0
Home_&_Living             0
Beauty                    0
Electronics               0
Preferred_Theme           0
dtype: int64


## 1.2 Standardization (10 points):

It is another integral preprocessing step. In Standardization, we transform our values such that the mean of the values is 0 and the standard deviation is 1.

**Hint:** Use `sklearn.preprocessing.StandardScaler` to do it.

In [93]:
# Standardize your train/test data.
# Important note: You must use the mean and standard deviation of train data for test set! You know why :)
from sklearn.preprocessing import StandardScaler
################ YOUR CODE STARTS HERE ################
numeric_cols = ['No_of_orders_placed','Women’s_Clothing','Men’s_Clothing','Kid’s_Clothing','Home_&_Living','Beauty','Electronics']
scaler = StandardScaler()
scaler.fit(train_df[numeric_cols].values)
train_df_scaled_numeric = scaler.transform(train_df[numeric_cols])
test_df_scaled_numeric = scaler.transform(test_df[numeric_cols])
train_df_scaled_numeric = pd.DataFrame(train_df_scaled_numeric,columns=numeric_cols)
test_df_scaled_numeric = pd.DataFrame(test_df_scaled_numeric,columns=numeric_cols)
################ YOUR CODE ENDS HERE ##################

## 1.3 Handling categorical Variables (10 points):

Handling categorical variables is another integral aspect of Machine Learning. Categorical variables are basically the variables that are discrete and not continuous. One of the methods to do this is **One-hot encoding**:

**-One-hot encoding:**

For categorical variables where no such ordinal relationship exists, the integer encoding is not enough.

In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

In the “color” variable example, there are 3 categories and therefore 3 binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

```
red,	green,	blue
1,		0,		0
0,		1,		0
0,		0,		1
```

In [94]:
# Now you have to convert the categorical features in the 
# dataset to one-hot encoded representation.
# Hint: Use pd.DataFrame.get_dummies()

# before that, we drop 'City' column from both training and test set, we wont use it.
train_df.drop('City', axis=1, inplace=True)
test_df.drop('City', axis=1, inplace=True)

cols_to_be_encoded = ['Gender', 'State']

################ YOUR CODE STARTS HERE ################

train_df_encoded = pd.get_dummies(train_df[cols_to_be_encoded], prefix=['Gender', 'State'])
test_df_encoded = pd.get_dummies(test_df[cols_to_be_encoded], prefix=['Gender', 'State'])

################ YOUR CODE ENDS HERE ##################

The preprocess steps you have done till now,were the most essential methods to preprocess a raw data. Otherwise, you are free to do more on this phase.

For instance, **Feature Engineering** is a great example of preprocessing steps. It includes creating new features by mixing existing variables in order to help the ML model classify the sampels more accuratly.

# 2. Designing ML models

In this section, we are going to build, compile and train our ML models. There are a huge number of various models which you can use and fortunatly, almost all of them has been implemented in [Scikit-Learn](https://scikit-learn.org).

However in this assignment, **you have to choose one of the following ML algorithms**, and implement it from scratch **without using existing frameworks**.

   - **KNN**
   - **Naive Bayes**
   
There are a plenty of resourcse and tutorials for these two algorithm on the internet. Just search them and read about them.

## 2.1 KNN or Naive Bayes from scratch (25 points):

In [96]:
# Implement one of above-mentioned algorithms from scratch.
# You are free to use Numpy and also pandas functionalities to implement them.
# But avoid using sklearn!

# Before that use train_test_split() to split your train data into train/validation
# in order to evaluate your model based on validation data

#merge all columns
train_df = pd.concat([train_df_encoded, train_df_scaled_numeric,train_df['Preferred_Theme']], axis=1)
test_df = pd.concat([test_df_encoded, test_df_scaled_numeric,test_df['Preferred_Theme']], axis=1)

X_train, X_val, y_train, y_val = train_test_split(train_df.drop('Preferred_Theme', axis=1),
                                                    train_df['Preferred_Theme'],
                                                    test_size=0.2)
################ YOUR CODE STARTS HERE ################
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

def minkowski_distance(a, b, p=1):
    # Store the number of dimensions
    dim = len(a)
    # Set initial distance to 0
    distance = 0
    # Calculate minkowski distance using parameter p
    for d in range(dim):
        distance += abs(a[d] - b[d])**p
    distance = distance**(1/p)
    return distance

def knn_predict(X_train, X_test, y_train, y_test, k, p):
    # Counter to help with label voting
    from collections import Counter
    # Make predictions on the test data
    # Need output of 1 prediction per test data point
    y_hat_test = []
    for test_point in X_test:
        distances = []
        for train_point in X_train:
            distance = minkowski_distance(test_point, train_point, p=p)
            distances.append(distance)  
        # Store distances in a dataframe
        df_dists = pd.DataFrame(data=distances, columns=['dist'], 
                                index=y_train.index)      
        # Sort distances, and only consider the k closest points
        df_nn = df_dists.sort_values(by=['dist'], axis=0)[:k]
        # Create counter object to track the labels of k closest neighbors
        counter = Counter(y_train[df_nn.index])
        # Get most common label of all the nearest neighbors
        prediction = counter.most_common()[0][0]
        # Append prediction to output list
        y_hat_test.append(prediction)  
    return y_hat_test

y_hat_test = knn_predict(X_train.values, X_val.values, y_train, y_val, k=3, p=1)
print(accuracy_score(y_val, y_hat_test))
################ YOUR CODE ENDS HERE ##################

0.6360207449316361


## 2.2 Other common ML models (25 points):

In this part you have to use predefined ML algorithms in [Scikit-Learn](https://scikit-learn.org) to implement other known models. **You must at least try 2 differnt models** to get the full score of this part.

**Note 1:** But you are free to try other model as much as you can to get higher performances.

**Note 2:** Writing about how those algorithms work has bonus!


You can see a complete list of exisiting models in Scikit-learn documentation: https://scikit-learn.org/stable/supervised_learning.html

In [97]:
################ YOUR CODE STARTS HERE ################
from sklearn import tree
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

#DecisionTreeClassifier
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train.values, y_train.values)
y_hat_test_DT = clf.predict(X_test.values)
print('DecisionTreeClassifier: ',accuracy_score(y_test, y_hat_test_DT))

#SVC
clf = svm.SVC()
clf = clf.fit(X_train.values, y_train.values)
y_hat_test_SVC = clf.predict(X_test.values)
print('SVC: ',accuracy_score(y_test, y_hat_test_SVC))

#RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train.values, y_train.values)
y_hat_test_RF = clf.predict(X_test.values)
print('RandomForestClassifier: ',accuracy_score(y_test, y_hat_test_RF))
################ YOUR CODE ENDS HERE ##################

DecisionTreeClassifier:  0.5991199119911991
SVC:  0.6941694169416942
RandomForestClassifier:  0.6664466446644665


## 2.3 Evaluation (20 points):

Great! You are now able to evaluate your model and see how nice your algorithm is able to classify positive and negative samples both on train and validation data.

For this, we are going to use 4 metrics. It is highly recommnded to read about them and know how benficial they are for different purposes:

- Accuracy
- Precision
- Recall
- F1-score

Use `sklearn.metrics` to implement them and report the results on both train and validation data.

In [88]:
################ YOUR CODE STARTS HERE ################
from sklearn.metrics import precision_recall_fscore_support

#DecisionTreeClassifier
precision,recall,fscore,support = precision_recall_fscore_support(y_test, y_hat_test_DT)
print('precision for each label (DecisionTreeClassifier): ',precision)
print('recall for each label (DecisionTreeClassifier): ',recall)
print('fscore for each label (DecisionTreeClassifier): ',fscore)
print('support for each label (DecisionTreeClassifier): ',support)
print('---------------------------------')

#SVC
precision,recall,fscore,support = precision_recall_fscore_support(y_test, y_hat_test_SVC)
print('precision for each label (SVC): ',precision)
print('recall for each label (SVC): ',recall)
print('fscore for each label (SVC): ',fscore)
print('support for each label (SVC): ',support)
print('---------------------------------')

#RandomForestClassifier
precision,recall,fscore,support = precision_recall_fscore_support(y_test, y_hat_test_RF)
print('precision for each label (RandomForestClassifier): ',precision)
print('recall for each label (RandomForestClassifier): ',recall)
print('fscore for each label (RandomForestClassifier): ',fscore)
print('support for each label (RandomForestClassifier): ',support)
################ YOUR CODE ENDS HERE ##################

precision for each label (DecisionTreeClassifier):  [0.92450331 0.91622807]
recall for each label (DecisionTreeClassifier):  [0.91641138 0.92433628]
fscore for each label (DecisionTreeClassifier):  [0.92043956 0.92026432]
support for each label (DecisionTreeClassifier):  [2285 2260]
---------------------------------
precision for each label (SVC):  [0.72904762 0.69161554]
recall for each label (SVC):  [0.67002188 0.74823009]
fscore for each label (SVC):  [0.69828962 0.71880978]
support for each label (SVC):  [2285 2260]
---------------------------------
precision for each label (RandomForestClassifier):  [0.90752417 0.94182825]
recall for each label (RandomForestClassifier):  [0.94485777 0.90265487]
fscore for each label (RandomForestClassifier):  [0.92581475 0.92182558]
support for each label (RandomForestClassifier):  [2285 2260]


# 3. Submission

Please read the notes here carefully:

1. In addition to completing the code files, please send a report including your answer to these questions as well. Do not forget to put the diagrams and visualizations needed in each part.

2. The file you upload must be named as `[Student ID]-[Your name].zip.`
3. Your notebook must be executed without any problem. If not, you will lose points for each part consequently.
4. **Important Note:** The outputs of the code blocks must be remained in your notebook, otherwise, you definitly lose all the points of that part.



In case you have any questions, contact **mohammad99hashemi@gmail.com**.


Good luck :)