# The Machine Learning process

<img src="images/process_ML.png" alt="Drawing" style="width: 1000px;"/>

# 0. Identify Business needs

First of all, we need to identify well the business needs.

<img src="images/phase01.png" alt="Drawing" style="width: 500px;"/>

We already saw this in the exercise of the previous class.

# 1. Import the needed libraries

The first step is always to import the needed libraries that we are going to use.
- The library `pandas` is a library used for data manipulation and analysis.
- In the end, we are going to try to apply a Decision Tree Classifier. In that way, we need to import from `sklearn.tree`a `DecisionTreeClassifier`
- Since we are going to create a predictive model, we need to split our data into at least two datasets: the train dataset (used to built the model) and the validation dataset (used to evaluate the performance of our model). As so, we need to import the function `train_test_split`from `sklearn.model_selection`
- Finally, we want to assess the quality of our model. This time we are going to import the `confusion_matrix` from `sklearn.metrics`


__`Step 1`__ Import the following libraries/functions: 
    - pandas as pd 
    - DecisionTreeClassifier from sklearn.tree
    - train_test_split from sklearn.model_selection
    - confusion_matrix from sklearn.metrics

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# 2. Import data

<img src="images/phase02.png" alt="Drawing" style="width: 500px;"/>

The second step is to import our data. To do that, we can use the pandas library.

__`Step 2`__ Import the sheet `ClassifiedData` from the excel file `Exercise1.xlsx` and store it in the object `drugs_truth`

In [2]:
drugs_truth = pd.read_excel('Exercise1.xlsx', sheet_name = 'ClassifiedData', index_col='ID')

__`Step 3`__ Import the sheet `Data2Classify` from the excel file `Exercise1.xlsx` and store it in the object `drugs_2classify`

In [3]:
drugs_2classify = pd.read_excel('Exercise1.xlsx', sheet_name = 'Data2Classify', index_col='ID')

# 3. Explore the data

It is time to explore and understand the data we have.

<img src="images/phase03.png" alt="Drawing" style="width: 500px;"/>

__`Step 4`__ Check the first five rows of the dataset `drugs_truth` using the method `.head()`

In [4]:
drugs_truth.head()

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,DrugPlant,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1001.0,789.0,68.0,16.0,90782.0,0.0,0.0,29.0,66.0,1402.0,0.0,...,,,,,,,,,,
1002.0,623.0,78.0,20.0,113023.0,0.0,0.0,31.0,6.0,1537.0,0.0,...,Row Labels,Average of BD9,Average of BD7,Average of BD8,Average of BD6,Average of BD5,Average of BD4,Average of BD3,Average of BD2,Average of BD1
1003.0,583.0,24.0,18.0,28344.0,1.0,0.0,4.0,69.0,44.0,0.0,...,0,550.330564,13.363259,62.820716,0.491223,0.442985,67573.95337,16.721828,46.632989,899.088838
1004.0,893.0,59.0,19.0,93571.0,0.0,1.0,21.0,10.0,888.0,0.0,...,1,1640.487896,32.651769,52.521415,0.193669,0.063315,103339.467412,16.932961,67.404097,892.603352
1006.0,792.0,32.0,20.0,22386.0,1.0,1.0,5.0,65.0,56.0,0.0,...,Grand Total,623.507375,14.658,62.129375,0.47125,0.4175,69974.7135,16.736,48.02725,898.6535


__`Step 5`__ Using the method `.info()`, check the data types of the variables of `drugs_truth` and if there are any missing values.

In [5]:
drugs_truth.info()

<class 'pandas.core.frame.DataFrame'>
Float64Index: 8002 entries, 1001.0 to nan
Data columns (total 25 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   BD1          8000 non-null   float64
 1   BD2          8000 non-null   float64
 2   BD3          8000 non-null   float64
 3   BD4          8000 non-null   float64
 4   BD5          8000 non-null   float64
 5   BD6          8000 non-null   float64
 6   BD7          8000 non-null   float64
 7   BD8          8000 non-null   float64
 8   BD9          8000 non-null   float64
 9   DrugPlant    8002 non-null   float64
 10  Model 1      8002 non-null   object 
 11  Dif 1        8000 non-null   float64
 12  Model 2      8000 non-null   float64
 13  Dif 2        8000 non-null   float64
 14  Unnamed: 15  0 non-null      float64
 15  Unnamed: 16  5 non-null      object 
 16  Unnamed: 17  6 non-null      object 
 17  Unnamed: 18  6 non-null      object 
 18  Unnamed: 19  4 non-null      object 
 19  

__`Step 6`__ Get the main descriptive statistics for all the variables in `drugs_truth` using the method `.describe()`

In [6]:
drugs_truth.describe()

Unnamed: 0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,DrugPlant,Dif 1,Model 2,Dif 2,Unnamed: 15
count,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8002.0,8000.0,8000.0,8000.0,0.0
mean,898.6535,48.02725,16.736,69974.7135,0.4175,0.47125,14.658,62.129375,623.507375,0.134225,0.222125,0.0,0.067125,
std,202.201258,17.236775,1.871161,27540.800759,0.493178,0.499204,11.937173,68.38289,645.552196,6.007555,0.415701,0.0,0.250254,
min,550.0,18.0,12.0,10000.0,0.0,0.0,1.0,0.0,6.0,0.0,0.0,0.0,0.0,
25%,723.0,33.0,15.0,47841.5,0.0,0.0,4.0,26.0,63.0,0.0,0.0,0.0,0.0,
50%,894.0,48.0,17.0,70176.0,0.0,0.0,12.0,53.0,385.5,0.0,0.0,0.0,0.0,
75%,1075.25,63.0,18.0,92076.25,1.0,1.0,24.0,79.0,1076.0,0.0,0.0,0.0,0.0,
max,1250.0,78.0,20.0,139730.0,1.0,1.0,56.0,549.0,3052.0,537.0,1.0,0.0,1.0,


__NOTE:__ In this dataset we don't have categorical variables. However, if we want to check the descriptive statistics for categorical data we just need to use the method `.describe(include =['O'] `

__`Step 7`__ What is the mean value of `BD4` when the target `DrugPlant` is equal to 0? And when is equal to 1?

In [7]:
drugs_truth.groupby('DrugPlant')['BD4'].mean()

DrugPlant
0.000000       67573.953370
0.067125                NaN
1.000000      103339.467412
537.000000              NaN
Name: BD4, dtype: float64

__`Step 8`__ How many observations do we have where DrugPlant is equal to 0? And to 1?

In [8]:
drugs_truth['DrugPlant'].value_counts()

0.000000      7463
1.000000       537
537.000000       1
0.067125         1
Name: DrugPlant, dtype: int64

__`Step 9`__ How many observations do we have where `BD3` is equal to 15?

In [9]:
drugs_truth[drugs_truth['BD3'] == 15]

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,DrugPlant,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1013.0,657.0,60.0,15.0,95586.0,0.0,0.0,29.0,54.0,1407.0,1.0,...,,,,,,,,,,
1020.0,589.0,26.0,15.0,40944.0,1.0,0.0,4.0,67.0,52.0,0.0,...,,,,,,,,,,
1026.0,824.0,56.0,15.0,90390.0,0.0,0.0,22.0,52.0,959.0,0.0,...,,,,,,,,,,
1034.0,657.0,32.0,15.0,54327.0,1.0,0.0,4.0,21.0,47.0,0.0,...,,,,,,,,,,
1049.0,927.0,35.0,15.0,45434.0,1.0,1.0,4.0,14.0,90.0,0.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10980.0,835.0,65.0,15.0,89275.0,0.0,0.0,23.0,5.0,1022.0,1.0,...,,,,,,,,,,
10981.0,771.0,74.0,15.0,120600.0,0.0,0.0,38.0,16.0,1978.0,0.0,...,,,,,,,,,,
10984.0,616.0,47.0,15.0,74348.0,0.0,1.0,6.0,97.0,153.0,0.0,...,,,,,,,,,,
10995.0,1025.0,39.0,15.0,58121.0,1.0,1.0,4.0,6.0,61.0,0.0,...,,,,,,,,,,


__`Step 10`__ Look for correlations between the different features with the method `.corr(method = 'spearman')`

In [10]:
#compute the correlation matrix of the features
drugs_truth.drop(columns = 'DrugPlant').corr(method = 'spearman')

  drugs_truth.drop(columns = 'DrugPlant').corr(method = 'spearman')


Unnamed: 0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,Dif 1,Model 2,Dif 2,Unnamed: 15
BD1,1.0,-0.017646,0.001168,-0.021609,0.005449,0.024921,0.169162,-0.004261,0.181528,0.112285,,-0.008087,
BD2,-0.017646,1.0,0.180672,0.938112,-0.616049,0.006364,0.837983,-0.068571,0.849685,0.561905,,0.302672,
BD3,0.001168,0.180672,1.0,0.164215,-0.072241,0.089748,0.127293,0.000734,0.13815,0.067146,,0.025491,
BD4,-0.021609,0.938112,0.164215,1.0,-0.585185,0.006957,0.874746,-0.068718,0.890018,0.57524,,0.318559,
BD5,0.005449,-0.616049,-0.072241,-0.585185,1.0,-0.057875,-0.603314,0.049657,-0.610291,-0.371909,,-0.192657,
BD6,0.024921,0.006364,0.089748,0.006957,-0.057875,1.0,-0.123633,-0.021259,-0.113225,-0.196037,,-0.149166,
BD7,0.169162,0.837983,0.127293,0.874746,-0.603314,-0.123633,1.0,-0.110537,0.993572,0.633204,,0.342952,
BD8,-0.004261,-0.068571,0.000734,-0.068718,0.049657,-0.021259,-0.110537,1.0,-0.107492,-0.02789,,-0.004545,
BD9,0.181528,0.849685,0.13815,0.890018,-0.610291,-0.113225,0.993572,-0.107492,1.0,0.632482,,0.34239,
Dif 1,0.112285,0.561905,0.067146,0.57524,-0.371909,-0.196037,0.633204,-0.02789,0.632482,1.0,,-0.064029,


# 4. Modify the data

<img src="images/phase04.png" alt="Drawing" style="width: 500px;"/>

After the exploration and understanding of data, we need to fix possible problems on data like missing values or outliers and we can create new variables in order to get variables with higher predictive power. <br>
At this moment, we are going to ignore this. <br>However, to create a predictive model we need to identify what are our independent variables and the dependent one (the target), as also we need to split our data into at least two different datasets - the train and the validation.

__`Step 11`__ Create a new dataset named as `X` that will include all the independent variables.

In [11]:
X = drugs_truth.iloc[:,:-1]

__`Step 12`__ Create a new dataset named as `y` that will include the dependent variable (the last column - DrugPlant)

In [12]:
y = drugs_truth.iloc[:,-1]

__`Step 13`__ Using the `train_test_split()`, split the data into train and validation, where the training dataset should contain 70% of the observations. (We are going to talk more about this in a future class). 

In [13]:
X_train, X_validation,y_train, y_validation = train_test_split(X,y,
                                                               train_size = 0.7, 
                                                               shuffle = True, 
                                                               stratify = y)

ValueError: Input contains NaN

# 5. Modelling - Create a predictive model

It is time to create a model. At this step, we are going to implement a simple algorithm named as "Decision Trees". 

<img src="images/phase05.png" alt="Drawing" style="width: 500px;"/>

__`Step 14`__ Create an instance of a DecisionTreeClassifier named as `dt` with the default parameters and fit the instance to the training data (again, we are going to talk more about this later).

In [None]:
dt = DecisionTreeClassifier().fit(X_train, y_train)

__`Step 15`__ Using the model just created in the previous step, predict the values of the target in the train dataset using the method `.predict()`. Assign those values to the object `predictions_train`

In [None]:
predictions_train = dt.predict(X_train)

__`Step 16`__ Similarly to what you have done in the previous step, predict the target values for the validation dataset and assign those values to the object `predictions_val`

In [None]:
predictions_val = dt.predict(X_validation)

# 6. Assess

We already have the ground truth and the predicted values. In this way we can start evaluating the performance of our model in the train and the validation dataset.

<img src="images/phase06.png" alt="Drawing" style="width: 500px;"/>

__`Step 17`__ Using the method `.score()`, check the mean accuracy of the model `dt`in the train dataset.

In [None]:
dt.score(X_train, y_train)

__`Step 18`__ Similarly to what you have done in step 17, check the mean accuracy now for the validation dataset.

In [None]:
dt.score(X_validation, y_validation)

Are we dealing with a case of __overfitting__? <br>
Yes, decision trees are known to be prone to overfitting. <br>
Luckily, there are strategies to avoid this problem. <br>
We are going to understand better what is overfitting and how to avoid it in the different algorithms in the next classes.

It is time to check the confusion matrix of the model for the training and the validation dataset. <br> <br>
__`Step 19`__ Check the confusion matrix for the training dataset, passing as parameters the ground truth (y_train) and the predicted values (predictions_train)<br>
[[TN, FP],<br>
[FN. TP]]

In [None]:
confusion_matrix(y_train, predictions_train)

__`Step 20`__ Do the same for the validation dataset.

In [None]:
confusion_matrix(y_validation, predictions_val)

__Can we conclude something from the results above?__ <br>It seems that our model is not so good at predicting the 1's in the target. <br>__Why?__ <br>Because we are dealing with an unbalanced dataset (more about this in the future). 

We are going to learn also different metrics that allow to understant better the performance of our model in unbalanced datasets - the mean accuracy is not a good metric to evaluate those cases.

# 7. Deploy

In the end, we want to classify the unclassified data. If we are already satisfied with our model, we can now predict the target to the new dataset.

__`Step 21`__ Check the dataset that we want to classify, imported as `drugs_2classify`

In [None]:
drugs_2classify

__`Step 22`__ Using the `.predict()` method and the model created named as `dt`, predict the target on the new dataset and assign those values to a column named as `DrugPlant`

In [None]:
drugs_2classify['DrugPlant'] = dt.predict(drugs_2classify)

__`Step 23`__ Check the new dataset.

In [None]:
drugs_2classify

Now we have already predicted the target for our new dataset! Next, if we wish to save a set of predictions, we can export a solution to a csv file.

In [None]:
#export test data predictions
drugs_2classify['DrugPlant'].to_csv('Exercise1_predictions.csv')