<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML241ENSkillsNetwork820-2023-01-01">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# ***Supervised Machine Learning: Classification***

## ***Final Project***

$ \ $

One of the main objectives of this course is to help you gain hands-on experience in communicating insightful and impactful findings to stakeholders. In this project you will use the tools and techniques you learned throughout this course to train a few classification models on a data set that you feel passionate about, select the regression that best suits your needs, and communicate insights you found from your modeling exercise.

$ \ $

-----

## ***Objetives***

After completing this lab you will be able to:

* Understand the dataset.

* Use ***LabelEncoder*** to turn categorical features of the dataset to nuemerical values.

* Scale our predictive data.

* Split the data into training and testing data.

* Take several classification models and use their output as input to the metaclassifier for the final prediction.





$ \ $

----

## ***Load and explore dataset***

$ \ $

$(1)$ Install and import required packages.

In [31]:
import pandas as pd
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
import seaborn as sns 
from sklearn import preprocessing
from sklearn.ensemble import StackingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix

$ \ $ 

$(2)$ Ignore error warnings.

In [4]:
# Surpress numpy data type warnings
import warnings
warnings.filterwarnings("ignore", category = DeprecationWarning) 
warnings.filterwarnings("ignore", category = UserWarning)
warnings.filterwarnings("ignore", category = RuntimeWarning) 
warnings.filterwarnings("ignore", category = FutureWarning)

$ \ $

----

## ***Dataset to study***

$ \ $

We are going to take the collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications:

* Drug $A$, 

* Drug $B$,

* Drug $C$, 

* Drug $X$ 

* Drug  $Y$.

Part of our job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The $\color{lightblue}{\text{features}}$ of this dataset are:

* $\color{lightblue}{\text{Age}}$, 

* $\color{lightblue}{\text{Sex}}$, 

* $\color{lightblue}{\text{Blood Pressure}}$,

* $\color{lightblue}{\text{the Cholesterol of the patients}}$,

and the $\color{aquamarine}{\text{target}}$ is the $\color{aquamarine}{\text{drug}}$ that each patient responded to. It is a sample of multiclass classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe a drug to a new patient.

$ \ $

$(1)$ Let's load the dataset.



In [5]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/drug200.csv", delimiter=",")
df.head()

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY


$ \ $ 

$(2)$ Let's create the  $X$  and  $y$  for our dataset.

In [15]:
X = df[['Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K']].values
y = df["Drug"] 

$ \ $

$(3)$ Now let's use a <code>LabelEncoder</code> to turn categorical features into numerical.

In [16]:
# creamos un codificador de etiquetas para la primera columna de X
le_sex = preprocessing.LabelEncoder()

# entrenamos nuestro codificador de etiquetas con ['F','M']
le_sex.fit(['F','M'])

# aplicamos nuestro modelo entrenado que funciona como una transformacion a la data X[:, 1]
X[:,1] = le_sex.transform(X[:, 1]) 

In [17]:
# creamos un codificador de etiquetas para la segunda columna de X
le_BP = preprocessing.LabelEncoder()

# entrenamos nuestro codificador de etiquetas con ['LOW', 'NORMAL', 'HIGH']
le_BP.fit(['LOW', 'NORMAL', 'HIGH'])

# aplicamos nuestro modelo entrenado que funciona como una transformacion a la data X[:, 2]
X[:,2] = le_BP.transform(X[:, 2])

In [18]:
# creamos un codificador de etiquetas para la tercera columna de X
le_Chol = preprocessing.LabelEncoder()

# entrenamos nuestro codificador de etiquetas con ['NORMAL', 'HIGH']
le_Chol.fit(['NORMAL', 'HIGH'])

# aplicamos nuestro modelo entrenado que funciona como una transformacion a la data X[:, 2]
X[:, 3] = le_Chol.transform(X[:, 3]) 

$ \ $

$(4)$  We scale our predective data.

In [20]:
# creamos un modelo escalador
scaler = preprocessing.StandardScaler()

# entrenamos nuestro modelo escalador con la data X
scaler.fit(X)

# tranformamos la data X en una data numerica usando el modelo escalador
X = scaler.transform(X)

$ \ $

$(5)$ Split the data into training and testing data with a $80/20$ split.

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 4)

In [37]:
print('Train set:', X_train.shape,  y_train.shape)
print('Test set:', X_test.shape,  y_test.shape)

Train set: (160, 5) (160,)
Test set: (40, 5) (40,)


$ \ $

$(6)$ We have our dictionary of estimators, the individual model objects or base learners.

In [23]:
estimators = [('SVM', SVC(random_state = 42)), ('knn', KNeighborsClassifier()), ('dt', DecisionTreeClassifier())]
estimators

[('SVM', SVC(random_state=42)),
 ('knn', KNeighborsClassifier()),
 ('dt', DecisionTreeClassifier())]

$ \ $ 

$(7)$  We create a Stacking Classifier.

In [24]:
clf = StackingClassifier(estimators = estimators, final_estimator = LogisticRegression())
clf 

$ \ $ 

$(8)$ In order to alter the base models in the dictionary of hyperparameter values, we add the key value of each model followed by the parameter of the model we would like to vary.

In [25]:
param_grid = {'dt__max_depth': [n for n in range(10)],
              'dt__random_state': [0], 
              'SVM__C': [0.01, 0.1, 1], 
              'SVM__kernel': ['linear', 'poly', 'rbf'],
              'knn__n_neighbors': [1, 4, 8, 9]}

$ \ $

$(9)$ We use ***GridSearchCV*** to search over specified parameter values of the model.


In [26]:
# creamos un objeto GridSearchCV con el modelo clf anterior y los parametros establecidos previamente
search = GridSearchCV(estimator = clf, param_grid = param_grid, scoring = 'accuracy')

# entrenamos nuestro modelo con la data de entrenamiento
search.fit(X_train, y_train)

$ \ $

$(10)$  We can find the accuracy test data.

In [27]:
def get_accuracy(X_train, X_test, y_train, y_test, model):

  # calculamos la precision del modelo en la data de prueba
  accuracy_test = metrics.accuracy_score(y_test, model.predict(X_test))

  # calculamos la precision del modelo en la data de entrenamiento
  accuracy_train = metrics.accuracy_score(y_train, model.predict(X_train))

  # la funcion devuelve un diccionario mostrando la precision del modelo sobre la data de entrenamiento y la data de prueba
  return  {"test Accuracy": accuracy_test, "train Accuracy": accuracy_train}

In [28]:
get_accuracy(X_train, X_test, y_train, y_test, search)

{'test Accuracy': 0.95, 'train Accuracy': 1.0}

$ \ $

$(11)$ We can find the best parameter values.

In [29]:
search.best_params_

{'SVM__C': 0.01,
 'SVM__kernel': 'linear',
 'dt__max_depth': 4,
 'dt__random_state': 0,
 'knn__n_neighbors': 1}

In [39]:
svm_model = SVC(kernel = 'linear', C = 0.01)
svm_model.fit(X_train, y_train)
get_accuracy(X_train, X_test, y_train, y_test, svm_model)

{'test Accuracy': 0.45, 'train Accuracy': 0.65}

In [40]:
dt_model = DecisionTreeClassifier(random_state = 0, max_depth = 4)
dt_model.fit(X_train, y_train)
get_accuracy(X_train, X_test, y_train, y_test, dt_model)

{'test Accuracy': 0.95, 'train Accuracy': 1.0}

In [41]:
knn_model = KNeighborsClassifier(n_neighbors = 1)
knn_model.fit(X_train, y_train)
get_accuracy(X_train, X_test, y_train, y_test, knn_model)

{'test Accuracy': 0.875, 'train Accuracy': 1.0}

$ \ $

----

## ***SUMMARY***

$ \ $

We have taken a collected data about a set of patients all of whom suffered from the same illness and we have found out the following things:

$ \ $

* The dataset features are 'Age', 'Sex', 'BP', 'Cholesterol', 'Na_to_K' and the dataset target is "Drug".

$ \ $

* We split the data as:

$$\begin{cases}
\text{Test data} \longrightarrow 20\%,\\
\text{Train data} \longrightarrow 80\%.
\end{cases}$$

$ \ $

* We take as estimators:

$$\begin{cases}
\text{SVC} \longrightarrow \text{SUPPORT VECTOR MACHINE},\\
\text{KNN} \longrightarrow \text{KneighborsClassifier},\\
\text{dt} \longrightarrow \text{DecisionTreeClassifier}
\end{cases}$$

$ \ $

* In order to alter the base models in the dictionary of hyperparameter values, we add the key value of each model followed by the parameter of the model we would like to vary:

$$\text{param_grid} = \begin{cases}
\text{dt__max_depth: } [0,1,2,3,4,5,6,7,8,9], \\ \\
\text{dt__random_state: } [0], \\ \\
\text{SVM__C: } [0.01, 0.1, 1], \\ \\
\text{SVM__kernel: } ['linear', 'poly', 'rbf'],\\ \\
\text{knn__n_neighbors': } [1, 4, 8, 9]            
\end{cases}$$


and we have shown that the best parameters values are 

$$\begin{cases}
\text{dt__max_depth: } 4, \\ \\
\text{dt__random_state: } 0, \\ \\
\text{SVM__C = } 0.01, \\ \\
\text{SVM__kernel = } 'linear',\\ \\
\text{knn__n_neighbors': } 1            
\end{cases}$$

$ \ $

* Lastly we choose the DecisionTreeClassifier model, because its accuracy is better.