**Problem Statement**
As a data professional working for a pharmaceutical company, you need to develop a model that predicts whether a patient will be diagnosed with diabetes.The model needs to have an accuracy score greater than 0.85.

You will be required to document the following steps:

● Data Importation

● Data Exploration

● Data Cleaning

● Data Preparation

● Data Modeling (Using Decision Trees, Random Forest and Logistic Regression)

● Model Evaluation

● Hyparameter Tuning

● Findings and Recommendations

# **Data Importation**

In [1]:
#load dataset

import pandas as pd

diabetes = pd.read_csv('https://bit.ly/DiabetesDS')

# **Data Exploration**

In [2]:
#first 5 rows
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
#no. of rows and columns
diabetes.shape

(768, 9)

# **Data Cleaning**

In [4]:
#check for missing data
diabetes.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [8]:
diabetes.duplicated().sum()

0

# **Data Preparation**

In [10]:
from sklearn.model_selection import train_test_split

#divide dataset into features and target
features = diabetes.drop(['Outcome'], axis =1)
target = diabetes['Outcome']

features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)

# **Data Modeling**

In [90]:
#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=12345, n_estimators=4)

rf_model.fit(features_train, target_train)

rf_pred = rf_model.predict(features_valid)

In [29]:
#Logistic Regression 
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(random_state=12345, solver='liblinear')

model_lr.fit(features_train, target_train)

lr_pred = model_lr.predict(features_valid)

In [30]:
#DecisionTree Classifier
from sklearn.tree import DecisionTreeClassifier
model_dt = DecisionTreeClassifier()

model_dt.fit(features_train, target_train)

dt_pred = model_dt.predict(features_valid)

# **Model Evaluation**

In [32]:
#Accuracy score of each model
#RF accuracy
print("Random Forest Accuracy: ",rf_model.score(features_valid, target_valid))

#LR accuracy
print("Logistic Regression Accuracy:", model_lr.score(features_valid, target_valid))

#DT accuracy
print("Decision Tree Accuracy: ",model_dt.score(features_valid, target_valid))

Random Forest Accuracy:  0.734375
Logistic Regression Accuracy: 0.7916666666666666
Decision Tree Accuracy:  0.75


## **Hyperparameter Tuning** 

In [89]:
#Random Forest Tuning
rf_model = RandomForestClassifier(random_state=12345, n_estimators=40)
#rf_model = RandomForestClassifier(random_state=12345, n_estimators=42, max_depth= 120)
rf_model.fit(features_train, target_train)
print("Random Forest Accuracy: ",rf_model.score(features_valid, target_valid))

Random Forest Accuracy:  0.8072916666666666


In [52]:
#Decision Tree Tuning
model_dt = DecisionTreeClassifier(max_depth=6, random_state= 12345)
model_dt.fit(features_train, target_train)
print("Decision Tree Accuracy: ",model_dt.score(features_valid, target_valid))

Decision Tree Accuracy:  0.8229166666666666


In [81]:
#Logisitic Regression Tuning
model_lr = LogisticRegression(random_state=12345, solver='lbfgs', penalty='none', max_iter=130)
#model_lr = LogisticRegression(random_state=12345, solver='liblinear', penalty='l1')

model_lr.fit(features_train, target_train)
print("Logistic Regression Accuracy:", model_lr.score(features_valid, target_valid))

Logistic Regression Accuracy: 0.828125


# **Findings and Recommendations**

- Based on the accuracy metric, logistic regression is the best model for predicting the outcome.