# The Machine Learning process

<img src="images/process_ML.png" alt="Drawing" style="width: 1000px;"/>

# 0. Identify Business needs

First of all, we need to identify well the business needs.

<img src="images/phase01.png" alt="Drawing" style="width: 500px;"/>

We already saw this in the exercise of the previous class.

# 1. Import the needed libraries

The first step is always to import the needed libraries that we are going to use.
- The library `pandas` is a library used for data manipulation and analysis.
- In the end, we are going to try to apply a Decision Tree Classifier. In that way, we need to import from `sklearn.tree`a `DecisionTreeClassifier`
- Since we are going to create a predictive model, we need to split our data into at least two datasets: the train dataset (used to built the model) and the validation dataset (used to evaluate the performance of our model). As so, we need to import the function `train_test_split`from `sklearn.model_selection`
- Finally, we want to assess the quality of our model. This time we are going to import the `confusion_matrix` from `sklearn.metrics`


__`Step 1`__ Import the following libraries/functions: 
    - pandas as pd 
    - DecisionTreeClassifier from sklearn.tree
    - train_test_split from sklearn.model_selection
    - confusion_matrix from sklearn.metrics

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# 2. Import data

<img src="images/phase02.png" alt="Drawing" style="width: 500px;"/>

The second step is to import our data. To do that, we can use the pandas library.

__`Step 2`__ Import the sheet `ClassifiedData` from the excel file `Exercise1.xlsx` and store it in the object `drugs_truth`

In [2]:
drugs_truth = pd.read_excel('Exercise1.xlsx', sheet_name = 'ClassifiedData')
drugs_truth

Unnamed: 0,ID,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,DrugPlant
0,1001,789,68,16,90782,0,0,29,66,1402,0
1,1002,623,78,20,113023,0,0,31,6,1537,0
2,1003,583,24,18,28344,1,0,4,69,44,0
3,1004,893,59,19,93571,0,1,21,10,888,0
4,1006,792,32,20,22386,1,1,5,65,56,0
...,...,...,...,...,...,...,...,...,...,...,...
7995,10995,1025,39,15,58121,1,1,4,6,61,0
7996,10996,967,28,17,54292,0,0,23,72,1011,1
7997,10997,637,76,15,125962,0,1,33,75,1668,0
7998,10998,586,69,19,99628,0,0,30,98,1469,0


__`Step 3`__ Import the sheet `Data2Classify` from the excel file `Exercise1.xlsx` and store it in the object `drugs_2classify`

In [1]:
drugs_2classify = pd.read_excel('Exercise1.xlsx', sheet_name = 'Data2Classify')
drugs_2classify

NameError: name 'pd' is not defined

# 3. Explore the data

It is time to explore and understand the data we have.

<img src="images/phase03.png" alt="Drawing" style="width: 500px;"/>

__`Step 4`__ Check the first five rows of the dataset `drugs_truth` using the method `.head()`

In [None]:
drugs_truth.head()

__`Step 5`__ Using the method `.info()`, check the data types of the variables of `drugs_truth` and if there are any missing values.

In [None]:
drugs_truth.info()

__`Step 6`__ Get the main descriptive statistics for all the variables in `drugs_truth` using the method `.describe()`

In [None]:
drugs_truth.describe()

__NOTE:__ In this dataset we don't have categorical variables. However, if we want to check the descriptive statistics for categorical data we just need to use the method `.describe(include =['O'] `

__`Step 7`__ What is the mean value of `BD4` when the target `DrugPlant` is equal to 0? And when is equal to 1?

In [None]:
drugs_truth.groupby('DrugPlant')['BD4'].mean()

__`Step 8`__ How many observations do we have where DrugPlant is equal to 0? And to 1?

In [None]:
drugs_truth['DrugPlant'].value_counts()

__`Step 9`__ How many observations do we have where `BD3` is equal to 15?

In [None]:
drugs_truth[drugs_truth['BD3'] == 15]

__`Step 10`__ What is the pearson correlation between all the variables?

In [None]:
drugs_truth.corr(method = 'pearson')
# drugs_truth.corr(method = 'spearman')

# 4. Modify the data

<img src="images/phase04.png" alt="Drawing" style="width: 500px;"/>

After the exploration and understanding of data, we need to fix possible problems on data like missing values or outliers and we can create new variables in order to get variables with higher predictive power. <br>
At this moment, we are going to ignore this. <br>However, to create a predictive model we need to identify what are our independent variables and the dependent one (the target), as also we need to split our data into at least two different datasets - the train and the validation.

__`Step 11`__ Create a new dataset named as `X` that will include all the independent variables.

In [None]:
X = drugs_truth.iloc[:,:-1]
X

__`Step 12`__ Create a new dataset named as `y` that will include the dependent variable (the last column - DrugPlant)

In [None]:
y = drugs_truth.iloc[:,-1]
y

__`Step 13`__ Using the `train_test_split()`, split the data into train and validation, where the training dataset should contain 70% of the observations. (We are going to talk more about this in a future class). 

In [None]:
X_train, X_validation,y_train, y_validation = train_test_split(X,y,
                                                               train_size = 0.7, 
                                                               shuffle = True, 
                                                               stratify = y)

# 5. Modelling - Create a predictive model

It is time to create a model. At this step, we are going to implement a simple algorithm named as "Decision Trees". 

<img src="images/phase05.png" alt="Drawing" style="width: 500px;"/>

__`Step 14`__ Create an instance of a DecisionTreeClassifier named as `dt` with the default parameters and fit the instance to the training data (again, we are going to talk more about this later).

In [None]:
dt = DecisionTreeClassifier().fit(X_train, y_train)

__`Step 15`__ Using the model just created in the previous step, predict the values of the target in the train dataset using the method `.predict()`. Assign those values to the object `predictions_train`

In [None]:
predictions_train = dt.predict(X_train)
predictions_train 

__`Step 16`__ Similarly to what you have done in the previous step, predict the target values for the validation dataset and assign those values to the object `predictions_val`

In [None]:
predictions_val = dt.predict(X_validation)
predictions_val

# 6. Assess

We already have the ground truth and the predicted values. In this way we can start evaluating the performance of our model in the train and the validation dataset.

<img src="images/phase06.png" alt="Drawing" style="width: 500px;"/>

__`Step 17`__ Using the method `.score()`, check the mean accuracy of the model `dt`in the train dataset.

In [None]:
dt.score(X_train, y_train)

__`Step 18`__ Similarly to what you have done in step 17, check the mean accuracy now for the validation dataset.

In [None]:
dt.score(X_validation, y_validation)

Are we dealing with a case of __overfitting__? <br>
Yes, decision trees are known to be prone to overfitting. <br>
Luckily, there are strategies to avoid this problem. <br>
We are going to understand better what is overfitting and how to avoid it in the different algorithms in the next classes.

It is time to check the confusion matrix of the model for the training and the validation dataset. <br> <br>
__`Step 19`__ Check the confusion matrix for the training dataset, passing as parameters the ground truth (y_train) and the predicted values (predictions_train)<br>
[[TN, FP],<br>
[FN. TP]]

In [None]:
confusion_matrix(y_train, predictions_train)

__`Step 20`__ Do the same for the validation dataset.

In [None]:
confusion_matrix(y_validation, predictions_val)

__Can we conclude something from the results above?__ <br>It seems that our model is not so good at predicting the 1's in the target. <br>__Why?__ <br>Because we are dealing with an unbalanced dataset (more about this in the future). 

We are going to learn also different metrics that allow to understant better the performance of our model in unbalanced datasets - the mean accuracy is not a good metric to evaluate those cases.

# 7. Deploy

In the end, we want to classify the unclassified data. If we are already satisfied with our model, we can now predict the target to the new dataset.

__`Step 21`__ Check the dataset that we want to classify, imported as `drugs_2classify`

In [None]:
drugs_2classify

__`Step 22`__ Using the `.predict()` method and the model created named as `dt`, predict the target on the new dataset and assign those values to a column named as `DrugPlant`

In [None]:
drugs_2classify['DrugPlant'] = dt.predict(drugs_2classify)

__`Step 23`__ Check the new dataset.

In [None]:
drugs_2classify

Now we have already predicted the target for our new dataset!