# End to end Machine Learning with Deployment

**Problem statement** Create a medical disgnostic app for predicting diabetes in women

**Dataset** The Pima Indian Dataset from Kaggle

**Steps to follow** 
1. Data exploration
2. Data Cleaning
3. EDA
4. Data pre processing
5. Model fitting and evaluation
6. Model optimization
7. Model interpretation
8. Model Deployment

In [1]:
!pip install imbalanced_learn



In [2]:
!pip install shap




In [3]:
!pip install streamlit



In [4]:
!pip install xgboost



In [5]:
!pip install -U scikit-learn==1.2.2



In [6]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import roc_auc_score, f1_score, confusion_matrix, classification_report


import shap
import pickle
import streamlit as st
print('All libraries are imported')

All libraries are imported


In [7]:
data = pd.read_csv('diabetes.csv')
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,Yes
1,1,85,66,29,0,26.6,0.351,31,No
2,8,183,64,0,0,23.3,0.672,32,Yes
3,1,89,66,23,94,28.1,0.167,21,No
4,0,137,40,35,168,43.1,2.288,33,Tested_Positive


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


**Columns of the dataset are**
1.   Pregnancies : number of pregnencies           
2.   Glucose :  mmol/L (millimoles per liter), Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3.   BloodPressure : diastolic blood pressure in mm of hg
4.   SkinThickness : triceps skin fold thickness in mm
5.   Insulin : the 2-Hour serum insulin (mu U/ml)
6.   BMI : Body mass index (weight in kg/(height in m)^2)                     
7.   DiabetesPedigreeFunction : Diabetes pedigree function
8.   Age : Age (years)                
9.   **Outcome** : whether the pateint has diabetes or not. Class variable (0 or 1) 268 of 768 are 1, the others are 0  


In [9]:
data.shape

(768, 9)

### Data Cleaning

- Check for Null values
- Check for duplicate rows
- Check for corrupt characters
- Check for nonsensical numbers
- Check for inconsistant labels

In [10]:
data.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [11]:
data.duplicated().sum()

0

In [12]:
data[~data.applymap(np.isreal).any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


**The dataset with 768 rows and 9 columns with no null, no duplicates and corruput characters**

In [13]:
data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


In [14]:
data['Outcome'].value_counts()

Outcome
No                 470
Yes                248
Tested_Negative     30
Tested_Positive     20
Name: count, dtype: int64

**We need to impute the 0 values in column 1 to 6 with column median**

In [15]:
df = data.copy()

In [16]:
zerofill = lambda x : x.replace(0, x.median())
cols = df.columns[1:6]
df[cols] = df[cols].apply(zerofill, 0)

In [17]:
d = {'Yes' : 1 , "Tested_Positive": 1, "No" : 0, 'Tested_Negative': 0}
df['Outcome'] = df['Outcome'].map(d)

In [18]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,121.65625,30.438286,44.0,99.75,117.0,140.25,199.0
BloodPressure,768.0,72.386719,12.096642,24.0,64.0,72.0,80.0,122.0
SkinThickness,768.0,27.334635,9.229014,7.0,23.0,23.0,32.0,99.0
Insulin,768.0,94.652344,105.547598,14.0,30.5,31.25,127.25,846.0
BMI,768.0,32.450911,6.875366,18.2,27.5,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [19]:
!jupyter --version

Selected Jupyter core packages...
IPython          : 8.15.0
ipykernel        : 6.25.0
ipywidgets       : 8.0.4
jupyter_client   : 7.4.9
jupyter_core     : 5.3.0
jupyter_server   : 1.23.4
jupyterlab       : 3.6.3
nbclient         : 0.5.13
nbconvert        : 6.5.4
nbformat         : 5.9.2
notebook         : 6.5.4
qtconsole        : 5.4.2
traitlets        : 5.7.1
