# Instructions

## Problem Statement

As a data professional working for a pharmaceutical company, you need to develop a model that predicts whether a patient will be diagnosed with diabetes. The model needs to have an accuracy score greater than 0.85.

You will be required to document the following steps:

● Data Importation

● Data Exploration

● Data Cleaning

● Data Preparation

● Data Modeling (Using Decision Trees, Random Forest and Logistic Regression)

● Model Evaluation

● Hyparameter Tuning

● Findings and Recommendations

## Loading the necessary Libraries

In [7]:
# Importing our libraries
# ---
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

## Data Importation

In [8]:
# Load the dataset from url https://bit.ly/DiabetesDS

df = pd.read_csv('https://bit.ly/DiabetesDS')

## Data Exploration

In [14]:
# print first 5 rows of data
# ---
#
df.head()



Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [15]:
# print last 5 rows of data

df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [16]:
# Sample 10 rows of data

df.sample(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
665,1,112,80,45,132,34.8,0.217,24,0
733,2,106,56,27,165,29.0,0.426,22,0
135,2,125,60,20,140,33.8,0.088,31,0
502,6,0,68,41,0,39.0,0.727,41,1
314,7,109,80,31,0,35.9,1.127,43,1
286,5,155,84,44,545,38.7,0.619,34,0
264,4,123,62,0,0,32.0,0.226,35,1
335,0,165,76,43,255,47.9,0.259,26,0
484,0,145,0,0,0,44.2,0.63,31,1
351,4,137,84,0,0,31.2,0.252,30,0


In [17]:

# check number of rows and columns

df.shape

(768, 9)

In [18]:

# Check datatypes
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

Observation:
The dataset has 768 rows & 9 variables which are mainly of int64 type


# Performing Data Cleaning

In [19]:
# Check on missing entries for all variables

df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

## Observation from the dataset

no missing values

In [20]:

# Change column names and headers to have lower case characters
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Observation: All columns headings changed to lower case

In [21]:
# Checking for all Dublicates
# ---
df.duplicated().sum()

0

Observation:  No dublicates found in the data

In [22]:
# Checking if any of the columns are all null

df.isnull().all(axis = 0)

pregnancies                 False
glucose                     False
bloodpressure               False
skinthickness               False
insulin                     False
bmi                         False
diabetespedigreefunction    False
age                         False
outcome                     False
dtype: bool

Observation:  None of the columns contain all null values

In [23]:

# Checking if any of the rows have completely null

sum(df.isnull().all(axis = 1))

0

Observation;  No row contains completely null values

In [24]:
#creating a copy of our dataframe and exploring top 5 records

df_clean_copy = df.copy()
df_clean_copy.head()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [25]:
#printing information about the copy dataset
df_clean_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   pregnancies               768 non-null    int64  
 1   glucose                   768 non-null    int64  
 2   bloodpressure             768 non-null    int64  
 3   skinthickness             768 non-null    int64  
 4   insulin                   768 non-null    int64  
 5   bmi                       768 non-null    float64
 6   diabetespedigreefunction  768 non-null    float64
 7   age                       768 non-null    int64  
 8   outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [26]:
#Checking statiscal properties and transposing the clean df

df_clean_copy.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
bloodpressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
skinthickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
bmi,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
diabetespedigreefunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [27]:
df_clean_copy.columns

Index(['pregnancies', 'glucose', 'bloodpressure', 'skinthickness', 'insulin',
       'bmi', 'diabetespedigreefunction', 'age', 'outcome'],
      dtype='object')

In [28]:
#features = ['pregnancies','glucose','bloodpressure','skinthickness','insulin','bmi','diabetespedigreefunction','age']
#target = df_clean['outcome']



x = df_clean_copy.drop(['outcome'], axis = 1)
y = df_clean_copy.loc[:,"outcome"].values

In [2]:
!pip install scikit-learn

!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.post7.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post7-py3-none-any.whl size=2951 sha256=0696aa4175046873b3e867deaabd00aa2c5d314a9d1a79a9f756f8a0347b6bae
  Stored in directory: /root/.cache/pip/wheels/c8/9c/85/72901eb50bc4bc6e3b2629378d172384ea3dfd19759c77fd2c
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post7


In [33]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


In [29]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

mdl = LogisticRegression(random_state=12345, solver='liblinear')
mdl.fit(x, y)
print (mdl.score(x, y))


0.7747395833333334


In [30]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

x = df_clean_copy.drop(['outcome'], axis = 1)
y = df_clean_copy.loc[:,"outcome"].values

testx = df_clean_copy.drop(['outcome'], axis = 1)
testy = df_clean_copy.loc[:,"outcome"].values

mdl = DecisionTreeClassifier(random_state=12345,max_depth=5)

mdl.fit(x, y)


train_predictions = mdl.predict(x)
test_predictions = mdl.predict(testx)

print('Accuracy')
print('Training set:', accuracy_score(testy, train_predictions))

Accuracy
Training set: 0.8372395833333334


In [31]:
from sklearn.ensemble import RandomForestClassifier
mdl = RandomForestClassifier(random_state=12345, n_estimators=3)
#x= feautures
# y = target
mdl.fit(x, y)
z= mdl.score(x, y)

print(z)

0.9466145833333334


# The Results :

DecisionTreeClassifier=0.8372395833333334


LogisticRegression=0.7747395833333334


RandomForestClassifier=0.9466145833333334



From the analysis , the best Model to use is RandomForestClassifier with accuracy of  0.9479166666666666 which is greater than 0.85 as desired






