# Business Problem


For a certain AI based Healthcare diaognostic firm, there is a requirement for diaognosis of Diabetes patient. Using the dataset available on Kaggle we need to create the machine learning model which will efficiently predict that whether the patient is diabetic or not. The dataset we have got the following features:

- <b>Pregnancies</b> decribes the number of times the person has been pregnant.
- <b>Gluose</b> describes the blood glucose level on testing.
- <b>Blood pressure</b> describes the diastolic blood pressure.
- <b>Skin Thickenss</b> describes the skin fold thickness of the triceps.
- <b>Insulin</b> describes the amount of insulin in a 2hour serum test.
- <b>BMI</b> describes he body mass index.
- <b>DiabetesPedigreeFunction</b> describes the family history of the person.
- <b>Age</b> describes the age of the person
- <b>Outcome</b> describes if the person is predicted to have diabetes or not.

# Present Scenario & Requirements


We need to develop machine learning webapp deployed over cloud platform with proper User Interface which will provide proper and detail analysis and results after filling out the desired requirements in app itself.

# Solution

In [1]:
# Importing essential libraries
import numpy as np
import pandas as pd
import pickle

In [2]:
# Loading the dataset
df = pd.read_csv('kaggle_diabetes.csv')

In [3]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,2,138,62,35,0,33.6,0.127,47,1
1,0,84,82,31,125,38.2,0.233,23,0
2,0,145,0,0,0,44.2,0.63,31,1
3,0,135,68,42,250,42.3,0.365,24,1
4,1,139,62,41,480,40.7,0.536,21,0


In [4]:
# Renaming DiabetesPedigreeFunction as DPF
df = df.rename(columns={'DiabetesPedigreeFunction':'DPF'})

In [5]:
# Replacing the 0 values from ['Glucose','BloodPressure','SkinThickness','Insulin','BMI'] by NaN
df_copy = df.copy(deep=True)
df_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = df_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

In [6]:
df_copy.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DPF,Age,Outcome
0,2,138.0,62.0,35.0,,33.6,0.127,47,1
1,0,84.0,82.0,31.0,125.0,38.2,0.233,23,0
2,0,145.0,,,,44.2,0.63,31,1
3,0,135.0,68.0,42.0,250.0,42.3,0.365,24,1
4,1,139.0,62.0,41.0,480.0,40.7,0.536,21,0


In [9]:
# Replacing NaN value by mean, median depending upon distribution
df_copy['Glucose'].fillna(df_copy['Glucose'].mean(), inplace=True)
df_copy['BloodPressure'].fillna(df_copy['BloodPressure'].mean(), inplace=True)
df_copy['SkinThickness'].fillna(df_copy['SkinThickness'].median(), inplace=True)
df_copy['Insulin'].fillna(df_copy['Insulin'].median(), inplace=True)
df_copy['BMI'].fillna(df_copy['BMI'].median(), inplace=True)

In [12]:
# Model Building
from sklearn.model_selection import train_test_split
X = df.drop(columns='Outcome')
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

In [13]:
# Creating Random Forest Model
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20)
classifier.fit(X_train, y_train)

RandomForestClassifier(n_estimators=20)

In [14]:
# Creating a pickle file for the classifier
filename = 'diabetespredictor_randomforest_model.pkl'
pickle.dump(classifier, open(filename, 'wb'))