<img src="Logo.png" width="100" align="left"/> 

# <center> Unit 3 Project </center>
#  <center> Third section : supervised task </center>

In this notebook you will be building and training a supervised learning model to classify your data.

For this task we will be using another classification model "The random forests" model.

Steps for this task: 
1. Load the already clustered dataset 
2. Take into consideration that in this task we will not be using the already added column "Cluster" 
3. Split your data.
3. Build your model using the SKlearn RandomForestClassifier class 
4. classify your data and test the performance of your model 
5. Evaluate the model ( accepted models should have at least an accuracy of 86%). Play with hyper parameters and provide a report about that.
6. Provide evidence on the quality of your model (not overfitted good metrics)
7. Create a new test dataset that contains the testset + an additional column called "predicted_class" stating the class predicted by your random forest classifier for each data point of the test set.

## 1. Load the data and split the data:

In [265]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd 
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import precision_score, accuracy_score

In [266]:
# To-Do:  load the data 
df = pd.read_csv("clustered_HepatitisC.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Category,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster
0,0,0.0,32,1,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0,3
1,1,0.0,32,1,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5,3
2,2,0.0,32,1,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3,3
3,3,0.0,32,1,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7,3
4,4,0.0,32,1,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7,3


In [267]:
# To-Do : keep only the columns to be used : all features except ID, cluster 
# The target here is the Category column 
# Do not forget to split your data (this is a classification task)
# test set size should be 20% of the data 

In [268]:
df=df.drop(["Unnamed: 0","cluster"],axis=1)

In [269]:
y=df["Category"]
x=df.drop(["Category"],axis=1)

x_train ,x_test ,y_train ,y_test=train_test_split(x,y,test_size=0.2,random_state=100)

In [270]:
(y_test==4).sum()

3

## 2. Building the model and training and evaluate the performance: 

In [271]:
# To-do build the model and train it 
# note that you will be providing explanation about the hyper parameter tuning 
# So you will be iterating a number of times before getting the desired performance 


In [272]:
#In order to tune our hyper parameters we're gonna use a randomizedSearch cross validation which will
#evaluate our hyperparameters (some that we chose , we shouldn't tune them all due to computation time)
#and give us a certain range which we will use afterwards

from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'bootstrap': [True, False]}


In [273]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(x_train, y_train)

rf_random.best_params_

Fitting 3 folds for each of 100 candidates, totalling 300 fits


{'n_estimators': 1000,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 10,
 'bootstrap': False}

In [279]:
#And now we will do a GridSearchCV

from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [False],
    'max_depth': [10, 20, 30, 40, 50],
    'max_features': ["auto"],
    'min_samples_leaf': [1, 2, 3],
    'min_samples_split': [2, 4, 6],
    'n_estimators': [600, 1000, 1400, 1800]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)

# Fit the grid search to the data
grid_search.fit(x_train, y_train)
grid_search.best_params_


Fitting 3 folds for each of 180 candidates, totalling 540 fits


{'bootstrap': False,
 'max_depth': 50,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 600}

In [280]:
rfc = RandomForestClassifier(n_estimators=600,criterion="entropy",max_depth=50,bootstrap=False)
y_hat = rfc.fit(x_train,y_train).predict(x_test)

In [281]:
# To-do : evaluate the model in terms of accuracy and precision 
# Provide evidence that your model is not overfitting 

print(precision_score(y_test,y_hat,labels=[0,1,2,3,4],average=None))
print(accuracy_score(y_test,y_hat))

[0.92727273 0.57142857 1.         1.         1.        ]
0.9105691056910569


We can see that our model has an error of about 9% which means it is precise and not overfitted

> Hint : A Perfect accuracy on the train set suggest that we have an overfitted model So the student should be able to provide a detailed table about the hyper parameters / parameters tuning with a good conclusion stating that the model has at least an accuracy of 86% on the test set without signs of overfitting  

## 3. Create the summary test set with the additional predicted class column: 
In this part you need to add the predicted class as a column to your test dataframe and save this one 

In [282]:
# To-Do : create the complete test dataframe : it should contain all the feature column + the actual target and the ID as well  
test_df=pd.read_csv("clustered_HepatitisC.csv").drop(["Category"],axis=1)
test_df.rename(columns={'Unnamed: 0': "ID"}, inplace=True)
test_df["Category"] = y

In [283]:
test_df.head()

Unnamed: 0,ID,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster,Category
0,0,32,1,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0,3,0.0
1,1,32,1,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5,3,0.0
2,2,32,1,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3,3,0.0
3,3,32,1,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7,3,0.0
4,4,32,1,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7,3,0.0


In [284]:
# To-Do : Add the predicted_class column 
test_df["Predicted_class"] = rfc.predict(x)

In [285]:
test_df.head()

Unnamed: 0,ID,Age,Sex,ALB,ALP,ALT,AST,BIL,CHE,CHOL,CREA,GGT,PROT,cluster,Category,Predicted_class
0,0,32,1,38.5,52.5,7.7,22.1,7.5,6.93,3.23,106.0,12.1,69.0,3,0.0,0.0
1,1,32,1,38.5,70.3,18.0,24.7,3.9,11.17,4.8,74.0,15.6,76.5,3,0.0,0.0
2,2,32,1,46.9,74.7,36.2,52.6,6.1,8.84,5.2,86.0,33.2,79.3,3,0.0,0.0
3,3,32,1,43.2,52.0,30.6,22.6,18.9,7.33,4.74,80.0,33.8,75.7,3,0.0,0.0
4,4,32,1,39.2,74.1,32.6,24.8,9.6,9.15,4.32,76.0,29.9,68.7,3,0.0,0.0


> Make sure you have 16 column in this test set  

In [286]:
# Save the test set 
test_df.to_csv("test_summary.csv")