# Machine Learning Lab 4 - Illustrate Decision Tree Classifier on Drug Dataset
<hr/>

Submitted by <br>
Name: **Drashty Ranpara** <br>
Register Number: **21122018** <br>
Class: **2MSCDS** <br> 
<hr/>

# **The End-to-End Pipeline of an ML Project**
<br>

---
<br>

## **This notebook covers**
* Getting familiar with the end-to-end pipeline for conducting an ML  project

* Preparing data for ML models (data collection and preprocessing)

* Generating and selecting features to enhance the performance of the ML algorithm

* Building up linear regression and decision tree models

* Fine-tuning an ML model with grid search
<br>
---
<br>

## **1.1 An overview of the end-to-end pipeline**
1. **Problem Framing and Data Collection -** Frame the problem as an ML problem and collect the data you need
2. **Data Preprocessing and Feature Engineering -** Process data into suitable format that can be input into ML Algorithms. Select/ Generate features that are related to target output to improve the performance of the algorithms 
3. **ML Algorithm Selection -** Try various algos suitable for the problem statement and choose the best.
4. **Model Training and Evaluation -** Appy the selected algorithm to train an ML model with your training data, and evaluate its performance on validation set.
5. **Hyperparameter Tuning -** Attempt to achieve better performance by iteratively tuning the model's parameters.
6. **Service Deploy and Monitoring -** Deploy the final ML solution and monitor its performance in order to update and improve pipeline continuously.

![picture](https://drek4537l1klr.cloudfront.net/song/Figures/02-01.png)

### **Problem Statement :** 
Let's start working on a real problem to get you familiar with each component in the pipeline. The problem we explore here is how to use scikit-learn, pickle, Flask, Microsoft Azure and ipywidgets to fully deploy a Python machine learning algorithm into a live, production environment.

### **Dataset :** 
I selected a dataset from kaggle (https://www.kaggle.com/prathamtripathi/drug-classification)

<br>

The Python code to develop a predictive machine learning algorithm to classify drug prescriptions given a range of patient criteria is as follows -

# **Step 1: Develop a Machine Learning Algorithm**
---


In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, cross_val_score

df_drug = pd.read_csv("/content/sample_data/drug200.csv")

label_encoder = LabelEncoder()

categorical_features = [feature for feature in df_drug.columns if df_drug[feature].dtypes == 'O']
for feature in categorical_features:
    df_drug[feature]=label_encoder.fit_transform(df_drug[feature])
    
X = df_drug.drop("Drug", axis=1)
y = df_drug["Drug"]

model = DecisionTreeClassifier(criterion="entropy")
model.fit(X, y)

kfold = KFold(random_state=42, shuffle=True)
cv_results = cross_val_score(model, X, y, cv=kfold, scoring="accuracy")
print(cv_results.mean(), cv_results.std())

0.99 0.012247448713915901


At this point we can see that we have a machine learning algorithm trained to predict drug presriptions and that cross validation (i.e. folding the data) has been used to evaluate the **model accuracy at 99%.**

### **Saving the Trained Model**
Once you’re confident enough to take your trained and tested model into the production-ready environment, the first step is to save it into a .h5 or .bin file using a library like pickle.

In [2]:
import pickle

pickle_file = open('model.pkl', 'ab')
pickle.dump(model, pickle_file)                     
pickle_file.close()

Now whenever we want to use the trained model, we simply need to reload its state from the model.pkl file rather than re-executing the training step.

# **Step 2: Make an Individual Prediction from the Trained Model**

Couple of assumptions -

1. Consumers of the machine learning algorithm have a requirements to make predictions for individual patients rather than a batch of patients.
2. Those consumers wish to communicate with the algorithm using text-like values for the parameters (for example blood pressure = ```“NORMAL”``` or ```“HIGH”``` rather than their label encoded equivalents like ```0 and 1```.

In [3]:
df_drug = pd.read_csv("/content/sample_data/drug200.csv")

label_encoder = LabelEncoder()

categorical_features = [feature for feature in df_drug.columns if df_drug[feature].dtypes == 'O']
for feature in categorical_features:
    print(feature, list(df_drug[feature].unique()), list(label_encoder.fit_transform(df_drug[feature].unique())), "\n")


Sex ['F', 'M'] [0, 1] 

BP ['HIGH', 'LOW', 'NORMAL'] [0, 1, 2] 

Cholesterol ['HIGH', 'NORMAL'] [0, 1] 

Drug ['DrugY', 'drugC', 'drugX', 'drugA', 'drugB'] [0, 3, 4, 1, 2] 



And there we have it, a list of each categorical feature with the unique values that appear in the data and the corresponding numerical values as transformed by the ```LabelEncoder()```.

Armed with this knowledge we can provide a set of dictionaries that map the text-like values (e.g. “HIGH”, “LOW” etc.) into their encoded equivalents and then develop a simple function to make an individual predictions as follows …

In [4]:
gender_map = {"F": 0, "M": 1}
bp_map = {"HIGH": 0, "LOW": 1, "NORMAL": 2}
cholestol_map = {"HIGH": 0, "NORMAL": 1}
drug_map = {0: "DrugY", 3: "drugC", 4: "drugX", 1: "drugA", 2: "drugB"}

def predict_drug(Age, 
                 Sex, 
                 BP, 
                 Cholesterol, 
                 Na_to_K):

    # 1. Read the machine learning model from its saved state ...
    pickle_file = open('model.pkl', 'rb')     
    model = pickle.load(pickle_file)
    
    # 2. Transform the "raw data" passed into the function to the encoded / numerical values using the maps / dictionaries
    Sex = gender_map[Sex]
    BP = bp_map[BP]
    Cholesterol = cholestol_map[Cholesterol]

    # 3. Make an individual prediction for this set of data
    y_predict = model.predict([[Age, Sex, BP, Cholesterol, Na_to_K]])[0]

    # 4. Return the "raw" version of the prediction i.e. the actual name of the drug rather than the numerical encoded version
    return drug_map[y_predict] 

This implementation can then be verified by invoking the function to make some predictions based on values from the original data so that we know what the outputs should be …

In [5]:
predict_drug(47, "F", "LOW",  "HIGH", 14)

  "X does not have valid feature names, but"


'drugC'

In [6]:
predict_drug(60, "F", "LOW",  "HIGH", 20)

  "X does not have valid feature names, but"


'DrugY'

Note that our ```predict_drug``` function does not need to train the model, rather it "rehydrates" the model that previously had its state saved by ```pickle``` into the ```model.pkl``` file and we can see from the output that the predictions for drug recommendation are correct.

# **Step 3: Develop a Web Service Wrapper**
* This is where web services come in. A web service is a ```“wrapper”``` that receives requests from clients and consumers using http GET and http PUT commands, invokes the Python code and returns the result as an HTML response.

* This means that the clients and callers only need to be able to formulate HTTP requests and nearly all programming languages and environments will have a way of doing this.

* In the Python world there are several different approaches available but the one I have selected is to use ```flask``` to construct our web service wrapper.

### **References:**
1. https://medium.com/@nikovrdoljak/deploy-your-flask-app-on-azure-in-3-easy-steps-b2fe388a589e
2. Machine Learning in Action Book