<a href="https://colab.research.google.com/github/Lilwm/Model-Quality-and-Improvements/blob/main/Model_Quality_and_Improvements_Assignment_Lillian_Miiri.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
As a data professional working for a pharmaceutical company, you need to develop a model that predicts whether a patient will be diagnosed with diabetes.
The model needs to have an accuracy score greater than 0.85.


#1. Data Importation

In [56]:
#Data Importation
import pandas as pd
import numpy as np

pharm_df = pd.read_csv("https://bit.ly/DiabetesDS")


# 2. Data Exploration

In [57]:
#check number of rows and columns
pharm_df.shape

(768, 9)

In [58]:
#check the first five records
pharm_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [59]:
#check the last five records
pharm_df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [60]:
#check data types and column info
pharm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [61]:
# check summary of the data
pharm_df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


## observations
- data had 9 columns and 768 records
-  The data frame seems to have no missing(NaN) values but we observe some columns have zeros

#3. Data Cleaning


In [62]:
# Many columns seems to have min value zero let us check the Proportion ok zero's in each column
print("Proportion of missing values")
missing_percentage = (pharm_df==0).sum()*100/pharm_df.shape[0]
missing_percentage

Proportion of missing values


Pregnancies                 14.453125
Glucose                      0.651042
BloodPressure                4.557292
SkinThickness               29.557292
Insulin                     48.697917
BMI                          1.432292
DiabetesPedigreeFunction     0.000000
Age                          0.000000
Outcome                     65.104167
dtype: float64

In [63]:
#observe if there's any dependency of missing values in skinfold_thickness and other columns
bp_df = pharm_df.loc[pharm_df['SkinThickness']==0]
print("Count of zeros in blood_pressure:", (bp_df['BloodPressure']==0).sum())
print("Count of zeros in skinfold_thickness:", (bp_df['SkinThickness']==0).sum())
print("Count of zeros in insulin:", (bp_df['Insulin']==0).sum())

Count of zeros in blood_pressure: 33
Count of zeros in skinfold_thickness: 227
Count of zeros in insulin: 227


In [64]:
# observe if their is any dependency of missing values in insulin and other columns
bp_df = pharm_df.loc[pharm_df['Insulin']==0]
print("Count of zeros in blood_pressure:", (bp_df['BloodPressure']==0).sum())
print("Count of zeros in skinfold_thickness:", (bp_df['SkinThickness']==0).sum())
print("Count of zeros in insulin:", (bp_df['Insulin']==0).sum())

Count of zeros in blood_pressure: 35
Count of zeros in skinfold_thickness: 227
Count of zeros in insulin: 374


## Observations
- Pregnancy  and outcome columns are features hence zeros in the columns could not considered as Missing Value.
- zeros present in rest of the columns are considered as a missing value and hence need to be replaced with  NaN
- Count of zeros in "blood_pressure" column is 35 which is seems to have dependency on other column like Insulin
- missing values in "skinfold_thickness" is correlated/dependent on "insulin" column as missing records of 227 and it is same.
- Missing records in insulin, skinfthickness are almost 50%, 30% respectively we can not drop the columns as this will significantly affect the data


#4.  Data Preparation


In [44]:
# replace the missing values in skin thickness column with mean
mean_value = pharm_df['SkinThickness'].mean()

#convert the float to int
mean_value = mean_value.astype(int)
mean_value

20

In [45]:
#replace 0 with the mean
pharm_df['SkinThickness'].replace(0, mean_value, inplace=True)

#confirm replacement by checking unique records
pharm_df['SkinThickness'].unique()

array([35, 29, 20, 23, 32, 45, 19, 47, 38, 30, 41, 33, 26, 15, 36, 11, 31,
       37, 42, 25, 18, 24, 39, 27, 21, 34, 10, 60, 13, 22, 28, 54, 40, 51,
       56, 14, 17, 50, 44, 12, 46, 16,  7, 52, 43, 48,  8, 49, 63, 99])

In [None]:
pharm_df.info()

#5. Data Modelling
 model using Decision Trees, Random Forest and Logistic Regression


In [303]:
#import Libraries
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score


In [149]:
# Split dataset into training set and test set
train_df, test_df = train_test_split(pharm_df, test_size=0.25, random_state=12345) # 75% training and 25% test

features_train = train_df.drop(['Outcome'], axis=1)
target_train = train_df['Outcome']

features_test = test_df.drop(['Outcome'], axis=1)
target_test = test_df['Outcome']

In [150]:
#Decision Tree

model = DecisionTreeClassifier()

#train the model
model.fit(features_train, target_train)
#predict
DTC_prediction = model.predict(features_test)

#evaluate accuracy 
print("Accuracy Score:",accuracy_score(target_test, DTC_prediction))

Accuracy Score: 0.7864583333333334


In [151]:
#Random Forest
model=RandomForestClassifier()
#train the model
model.fit(features_train, target_train)
#predict
RF_prediction = model.predict(features_test)
#evaluate accuracy 
print("Accuracy Score:",accuracy_score(target_test, RF_prediction))

Accuracy Score: 0.8020833333333334


In [152]:
#logistic regression

model=LogisticRegression()
#train the model
model.fit(features_train, target_train)
#predict
LR_prediction = model.predict(features_test)
#evaluate accuracy 
print("Accuracy Score:",accuracy_score(target_test, LR_prediction))

Accuracy Score: 0.828125


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


# 6. Model Evaluation


In [153]:
print("Accuracy Scores: before Tuning")
print("Decision Tree Classifier:",accuracy_score(target_test, DTC_prediction))
print("Random Forest:",accuracy_score(target_test, RF_prediction))
print("Linear Regression:",accuracy_score(target_test, LR_prediction))

Accuracy Scores: before Tuning
Decision Tree Classifier: 0.7864583333333334
Random Forest: 0.8020833333333334
Linear Regression: 0.828125


## Observations


*   Linear regression had the highest accuracy score at 82.8% followed by Random forest at 80.7% and lastly decision tree classifier at 77%
*   Given that we're looking for an accuracy score > 85%, none of the models meet the criteria.
*   We need to tune the hyperparameters to achieve a greater accuracy score





#7. Hyparameter Tuning


In [147]:
#Decision Tree


for depth in range(1, 8):
  model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
  model.fit(features_train, target_train)
  DTC_prediction = model.predict(features_test)
  print('max_depth =', depth, ': ', end='')
  print("Accuracy Score:",accuracy_score(target_test, DTC_prediction))

max_depth = 1 : Accuracy Score: 0.7835497835497836
max_depth = 2 : Accuracy Score: 0.7662337662337663
max_depth = 3 : Accuracy Score: 0.7748917748917749
max_depth = 4 : Accuracy Score: 0.7662337662337663
max_depth = 5 : Accuracy Score: 0.7965367965367965
max_depth = 6 : Accuracy Score: 0.7835497835497836
max_depth = 7 : Accuracy Score: 0.7575757575757576


We get the best accuracy score(82.29%) at max_depth = 6

In [320]:
#Random Forest
for trees in range(20, 30):
  model=RandomForestClassifier(n_estimators=trees,min_samples_split= 2, min_samples_leaf= 2, max_features= 'sqrt',max_depth= 10, bootstrap= True )
  model.fit(features_train, target_train)
  RF_prediction = model.predict(features_test)
  print('n_estimators =', trees, ': ', end='')
  print("Accuracy Score:",accuracy_score(target_test, RF_prediction))

n_estimators = 20 : Accuracy Score: 0.78125
n_estimators = 21 : Accuracy Score: 0.8177083333333334
n_estimators = 22 : Accuracy Score: 0.8020833333333334
n_estimators = 23 : Accuracy Score: 0.8020833333333334
n_estimators = 24 : Accuracy Score: 0.7864583333333334
n_estimators = 25 : Accuracy Score: 0.8125
n_estimators = 26 : Accuracy Score: 0.8489583333333334
n_estimators = 27 : Accuracy Score: 0.828125
n_estimators = 28 : Accuracy Score: 0.8177083333333334
n_estimators = 29 : Accuracy Score: 0.8125


We get an improved accuracy score of 84.895% at n_estimators = 26. This comes at a tradeoff since with that many n_estimators, the duration of training takes longer

# 8. Findings and Recommendations




*   Random Forest  had the highest accuracy score ~ 85% after setting n_estimators to 26
*   After hypertuning, accuracy scores for decision tree classifier improved from 78% to 82%

