<a href="https://colab.research.google.com/github/Sirmuchai/Machine-Learning/blob/main/Week4_Project_2_Model_Quality_and_Improvements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Model Quality and Improvements**

## **Problem Statement**
As a data professional working for a pharmaceutical company, you need to develop a model that predicts whether a patient will be diagnosed with diabetes. The model needs to have an accuracy score greater than 0.85.

You will be required to document the following steps:
* Data Importation
* Data Exploration
* Data Cleaning
* Data Preparation
* Data Modeling (Using Decision Trees, Random Forest and Logistic Regression)
* Model Evaluation
* Hyparameter Tuning
* Findings and Recommendations

### **Dataset**
* Dataset URL: https://bit.ly/DiabetesDS
* Project Source: https://bit.ly/3CU4b7d

# Data Importation

In [2]:
# Module to be used

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [3]:
# reading dataset
diabetes_df = pd.read_csv("https://bit.ly/DiabetesDS")

diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# Data Exploration

In [4]:
# dataset shape

diabetes_df.shape

(768, 9)

In [5]:
# Dataset informations
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [6]:
# checking for null values
diabetes_df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

* **No missing values in our dataset**




In [7]:
diabetes_df.skew()

Pregnancies                 0.901674
Glucose                     0.173754
BloodPressure              -1.843608
SkinThickness               0.109372
Insulin                     2.272251
BMI                        -0.428982
DiabetesPedigreeFunction    1.919911
Age                         1.129597
Outcome                     0.635017
dtype: float64

In [8]:
#finding unique value counts of the Insulin feature
diabetes_df.Insulin.value_counts().sort_index(ascending = True)

0      374
14       1
15       1
16       1
18       2
      ... 
579      1
600      1
680      1
744      1
846      1
Name: Insulin, Length: 186, dtype: int64

In [9]:
diabetes_df.BloodPressure.value_counts().sort_index(ascending = True)

0      35
24      1
30      2
38      1
40      1
44      4
46      2
48      5
50     13
52     11
54     11
55      2
56     12
58     21
60     37
61      1
62     34
64     43
65      7
66     30
68     45
70     57
72     44
74     52
75      8
76     39
78     45
80     40
82     30
84     23
85      6
86     21
88     25
90     22
92      8
94      6
95      1
96      4
98      3
100     3
102     1
104     2
106     3
108     2
110     3
114     1
122     1
Name: BloodPressure, dtype: int64

In [10]:
diabetes_df.DiabetesPedigreeFunction .value_counts().sort_values(ascending = True)

1.174    1
0.318    1
0.325    1
1.222    1
0.179    1
        ..
0.261    5
0.207    5
0.268    5
0.254    6
0.258    6
Name: DiabetesPedigreeFunction, Length: 517, dtype: int64

**From above we seem to have not good distribution of `BloodPressure `, `Inslulin`, `DiabetesPedigreeFunction` dataset and checking the unique values count in Insuline we have more zero.**
We recommend:
* Drop `Insulin` and `DiabetesPedigreeFunction` field
* Replacing  `BloodPressure` zero with mea

## Data Cleaning

In [11]:
#drop(['Insulin','DiabetesPedigreeFunction'])
clean_df = diabetes_df.drop(['Insulin','DiabetesPedigreeFunction'], axis =1)
clean_df.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,BMI,Age,Outcome
0,6,148,72,35,33.6,50,1
1,1,85,66,29,26.6,31,0


In [12]:
#Replace zeros in 'BloodPressure' with mean:
for i in ['BloodPressure','SkinThickness']:
    #make zeros to be na
    clean_df[i].replace(0,np.nan,inplace =True)
    #fillna with mean
    clean_df[i].fillna(clean_df[i].mean(),inplace = True)

In [13]:
clean_df.skew()

Pregnancies      0.901674
Glucose          0.173754
BloodPressure    0.137305
SkinThickness    0.822173
BMI             -0.428982
Age              1.129597
Outcome          0.635017
dtype: float64

## Data Preparation

In [14]:
#feature dataset
features = clean_df.drop('Outcome', axis =1)
features.head(2)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,BMI,Age
0,6,148,72.0,35.0,33.6,50
1,1,85,66.0,29.0,26.6,31


In [15]:
#Target dataset
target = clean_df['Outcome']
target.head(2)

0    1
1    0
Name: Outcome, dtype: int64

In [16]:
#Spliting features and target to train and test data
feature_train , feature_test, target_train, target_test = train_test_split(features, target, random_state=17, test_size=0.2)

print(f"Features Train: {feature_train.shape}")
print(f"Features Test: {feature_test.shape}")
print(f"Target Train: {target_train.shape}")
print(f"Target Test: {target_test.shape}")

Features Train: (614, 6)
Features Test: (154, 6)
Target Train: (614,)
Target Test: (154,)


# Data Modeling (Using Decision Trees, Random Forest and Logistic Regression)


### Using Decision Trees

In [17]:

model_1 = DecisionTreeClassifier(random_state = 17, max_depth=4)

model_1.fit(feature_train, target_train)

DT_target_predict = model_1.predict(feature_test)

### Using Random Forest

In [18]:
model_2 = RandomForestClassifier(random_state = 17, max_depth=4)

model_2.fit(feature_train, target_train)

RF_target_predict = model_2.predict(feature_test)

### Using Logistic Regression

In [19]:
model_3 = LogisticRegression(random_state = 17)

model_3.fit(feature_train, target_train)

LR_target_predict = model_3.predict(feature_test)

## Model Evaluation

In [20]:
dtm_accuracy = accuracy_score(target_test, DT_target_predict)
rfm_accuracy = accuracy_score(target_test, RF_target_predict)
lrm_accuracy = accuracy_score(target_test, LR_target_predict)

print(f"The accuracy for our Decision Tree model is: {dtm_accuracy}")
print(f"The accuracy for our Random Forest model is: {rfm_accuracy}")
print(f"The accuracy for our Logistic Regression model is: {lrm_accuracy}")

The accuracy for our Decision Tree model is: 0.7532467532467533
The accuracy for our Random Forest model is: 0.7727272727272727
The accuracy for our Logistic Regression model is: 0.7792207792207793


### Hyparameter Tuning


In [21]:
#Tuned Logistic Regression model
tuned_lr_model= LogisticRegression(random_state = 53, solver='sag',penalty='none',max_iter=1000)
tuned_lr_model.fit(feature_train, target_train)

TLR_target_predict = tuned_lr_model.predict(feature_test)

In [22]:
tuned_lr_accuracy = accuracy_score(target_test, TLR_target_predict)
print(f"The accuracy for our Tuned Logistic Regression model is: {tuned_lr_accuracy}")

The accuracy for our Tuned Logistic Regression model is: 0.7077922077922078


In [23]:
tuned_rfm = RandomForestClassifier(bootstrap=True,
                                   max_depth=70,
                                   max_features='auto',
                                   min_samples_leaf= 4,
                                   min_samples_split=10,
                                   n_estimators= 400)

tuned_rfm.fit(feature_train, target_train)

TRF_target_predict = tuned_rfm.predict(feature_test)

In [24]:
tuned_rf_accuracy = accuracy_score(target_test, TRF_target_predict)
print(f"The accuracy for our Tuned Random Forest model is: {tuned_rf_accuracy}")

The accuracy for our Tuned Random Forest model is: 0.7727272727272727


## Findings and Recommendations

* Random Forest has the highest accuracy
* Logistic Regression has the lowest accuracy