## Model Quality and Improvements

### Problem Statement

As a data professional working for a pharmaceutical company, you need to develop a model that predicts whether a patient will be diagnosed with diabetes. You will be required to document the following steps:

* Data Importation
* Data Exploration
* Data Cleaning
* Data Preparation
* Data Modeling (Using Decision Trees, Random Forest and Logistic Regression)
* Model Evaluation
* Hyparameter Tuning
* Findings and Recommendations

Dataset: https://bit.ly/DiabetesDS

Project Source: https://bit.ly/3CU4b7d

### Measure of Success

The model needs to have an accuracy score greater than 0.80

### Data Importation

Import Libraries

In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler 
from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
from sklearn.tree import DecisionTreeClassifier     # Decision Tree Classifier
from sklearn.ensemble import RandomForestClassifier # Random Forest Classifier
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, mean_squared_error

Load Data

In [42]:
df = pd.read_csv('https://bit.ly/DiabetesDS')

### Data Exploration, Cleaning, and Preparation

In [4]:
df.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


In [7]:
df.tail(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [5]:
df.shape

(768, 9)

In [6]:
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [10]:
df.duplicated().sum()

0

In [9]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

### Data Modeling

Set predictors and the target

In [43]:
predictors = df.drop('Outcome', axis=1)
target = df['Outcome']

Split data into training and testing sets

In [44]:
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size=0.25, random_state=12345)

In [75]:
predictors_train

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
731,8,120,86,0,0,28.4,0.259,22
33,6,92,92,0,0,19.9,0.188,28
221,2,158,90,0,0,31.6,0.805,66
534,1,77,56,30,56,33.3,1.251,24
62,5,44,62,0,0,25.0,0.587,36
...,...,...,...,...,...,...,...,...
420,1,119,88,41,170,45.3,0.507,26
129,0,105,84,0,0,27.9,0.741,62
285,7,136,74,26,135,26.0,0.647,51
485,0,135,68,42,250,42.3,0.365,24


Decision Tree Classifier

In [60]:
# instantiate classifier
decision_classifier = DecisionTreeClassifier()

# fitting
decision_classifier.fit(predictors_train, target_train)

# predict results
decision_prediction = decision_classifier.predict(predictors_test)

Random Forest Classifier

In [61]:
# instantiate classifier
random_classifier = RandomForestClassifier()

# fitting
random_classifier.fit(predictors_train, target_train)

# predict results
random_prediction = random_classifier.predict(predictors_test)

Logistic Regression

In [62]:
# instantiate classifier
logistic_classifier = LogisticRegression(max_iter = 1000)

# fitting
logistic_classifier.fit(predictors_train, target_train)

# predict results
logistic_prediction = logistic_classifier.predict(predictors_test)

### Model Evaluation

Accuracy scores

In [63]:
print(accuracy_score(decision_prediction, target_test))
print(accuracy_score(random_prediction, target_test))
print(accuracy_score(logistic_prediction, target_test))

0.7552083333333334
0.8177083333333334
0.8229166666666666


Classification report

In [64]:
print('Decision Tree Classifier:')
print(classification_report(target_test, decision_prediction))

print('\n')

print('Random Forest Classifier:')
print(classification_report(target_test, random_prediction))

print('\n')

print('Logic Regression:')
print(classification_report(target_test, logistic_prediction))

Decision Tree Classifier:
              precision    recall  f1-score   support

           0       0.82      0.83      0.82       132
           1       0.61      0.60      0.61        60

    accuracy                           0.76       192
   macro avg       0.71      0.71      0.71       192
weighted avg       0.75      0.76      0.75       192



Random Forest Classifier:
              precision    recall  f1-score   support

           0       0.84      0.90      0.87       132
           1       0.75      0.63      0.68        60

    accuracy                           0.82       192
   macro avg       0.79      0.77      0.78       192
weighted avg       0.81      0.82      0.81       192



Logic Regression:
              precision    recall  f1-score   support

           0       0.82      0.95      0.88       132
           1       0.82      0.55      0.66        60

    accuracy                           0.82       192
   macro avg       0.82      0.75      0.77       192


Confusion matrix

In [65]:
print('Decision Tree Classifier:')
print(confusion_matrix(target_test, decision_prediction))

print('\n')

print('Random Forest Classifier:')
print(confusion_matrix(target_test, random_prediction))

print('\n')

print('Logic Regression:')
print(confusion_matrix(target_test, logistic_prediction))

Decision Tree Classifier:
[[109  23]
 [ 24  36]]


Random Forest Classifier:
[[119  13]
 [ 22  38]]


Logic Regression:
[[125   7]
 [ 27  33]]


Root mean square error

In [66]:
print('Decision Tree Classifier:')
print(mean_squared_error(target_test, decision_prediction))

print('\n')

print('Random Forest Classifier:')
print(mean_squared_error(target_test, random_prediction))

print('\n')

print('Logic Regression:')
print(mean_squared_error(target_test, logistic_prediction))

Decision Tree Classifier:
0.24479166666666666


Random Forest Classifier:
0.18229166666666666


Logic Regression:
0.17708333333333334


### Model Testing

Testing with the best model: logic regression

In [None]:
new_data = [[0, 120, 140, 40, 0, 37, 0.627, 26]]
test_df = pd.DataFrame(new_data, columns =['Pregnancies',	'Glucose',	'BloodPressure',	'SkinThickness',	'Insulin',	'BMI',	'DiabetesPedigreeFunction',	'Age'])
test_df.sample()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,0,120,140,40,0,37,0.627,26


In [None]:
new_data2 = [[10, 140, 150, 45, 0, 41, 0.627, 36]]
test_df2 = pd.DataFrame(new_data2, columns =['Pregnancies',	'Glucose',	'BloodPressure',	'SkinThickness',	'Insulin',	'BMI',	'DiabetesPedigreeFunction',	'Age'])
test_df2.sample()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,10,140,150,45,0,41,0.627,36


In [None]:
print('Test 1', logistic_classifier.predict(test_df))
print('Test 2', logistic_classifier.predict(test_df2))

Test 1 [0]
Test 2 [1]


### Hyparameter Tuning

In [72]:
for depth in range(1, 6):
        decision_classifier =  DecisionTreeClassifier(random_state=12345,max_depth=depth)

        decision_classifier.fit(predictors_train, target_train)

        predictions = decision_classifier.predict(predictors_test)

        print("max_depth =", depth)
        print(accuracy_score(target_test, predictions))

max_depth = 1
0.7708333333333334
max_depth = 2
0.7708333333333334
max_depth = 3
0.7604166666666666
max_depth = 4
0.75
max_depth = 5
0.8177083333333334


### Findings and Recommendations

Logic Regression was the best model with an accuracy score of 82%