<a href="https://colab.research.google.com/github/Jesyldah/Projects/blob/main/Project_Model_Quality_and_Improvements_Jesyldah.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Quality and Improvements Project

## 1. Defining the Question


### a) Specifying the Data Analysis Question

Predict whether a patient will be diagnosed with diabetes

### b) Defining the Metric for Success

The analysis question will be answered by providing a model that predicts whether a patient will be diagnosed with diabetes.The model needs to have an accuracy score greater than 0.85.

### c) Understanding the context 

As a data professional working for a pharmaceutical company, you need to develop a model that predicts whether a patient will be diagnosed with diabetes. 




### d) Recording the Experimental Design

1. Reading in the data from the source so that it is available for analysis
2. Explore the data in order to understand the structure of the data
3. Prepare the data for analysis:
* Checking for and handling missing values
* Finding and removing duplicate records
* Deleting null columns & rows
* Renaming columns
* Checking for uniformity of data in the columns, correcting errors in values and datatypes
4. Modeling
* Define and train the model
* Hyparameter Tuning
* Make predictions using the model
5. Model Evaluation
6. Findings and Recommendations



### e) Data Relevance

The dataset includes patient information that is relevant in answering the research question


## 2. Reading the Data

In [1]:
# Importing our libraries

import pandas as pd

import numpy as np

In [2]:
# Load the dataset

# Dataset url = https://bit.ly/DiabetesDS
diabetes_df = pd.read_csv('https://bit.ly/DiabetesDS')

In [3]:
# Checking the first 5 rows of data
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Checking the last 5 rows of data
diabetes_df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [5]:
# Determine the size of the dataset
diabetes_df.shape

(768, 9)

In [6]:
# Checking datatypes
diabetes_df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [7]:
# View the features in the dataset
diabetes_df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

The dataset provided has a total of 768 observations and 9 variables. The variables are:
* Pregnancies: Number of times pregnant
* Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
* BloodPressure: Diastolic blood pressure (mm Hg)
* SkinThickness: Triceps skin fold thickness (mm)
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)2)
* DiabetesPedigreeFunction: Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
* Age: Age (years)
* Outcome: Class variable (0 if non-diabetic, 1 if diabetic)

## 3. External Data Source Validation

The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database,therefore valid for this analysis

## 4. Data Preparation

### Performing Data Cleaning

In [8]:
# Standardize column names - remove whitespaces and convert to lowercase
diabetes_df.columns = diabetes_df.columns.str.strip().str.lower()
list(diabetes_df.columns)

['pregnancies',
 'glucose',
 'bloodpressure',
 'skinthickness',
 'insulin',
 'bmi',
 'diabetespedigreefunction',
 'age',
 'outcome']

In [9]:
# Checking for duplicate rows in the dataset

sum(diabetes_df.duplicated())

0

In [10]:
# Checking if any of the columns are all null

diabetes_df.isnull().all(1).any()

False

In [11]:
# Checking if any of the rows are all null

diabetes_df.isnull().all(0).any()

False

In [12]:
# Check for missing values

print(diabetes_df.isnull().values.any())

diabetes_df.isnull().sum()

False


pregnancies                 0
glucose                     0
bloodpressure               0
skinthickness               0
insulin                     0
bmi                         0
diabetespedigreefunction    0
age                         0
outcome                     0
dtype: int64

In [13]:
# Obtain summary statistics for the dataset

diabetes_df.describe()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


## 5. Solution Implementation

### Data Preparation

Some variables contain 0 as the data value, which is unsual. We can replace these values with the mean of the variables

In [14]:
# Replacing 0 values with the mean value.
diabetes_df.loc[diabetes_df['glucose'] == 0, 'glucose'] = 120.89
diabetes_df.loc[diabetes_df['bloodpressure'] == 0, 'bloodpressure'] = 69.11
diabetes_df.loc[diabetes_df['skinthickness'] == 0, 'skinthickness'] = 20.54
diabetes_df.loc[diabetes_df['insulin'] == 0, 'insulin'] = 79.80
diabetes_df.loc[diabetes_df['bmi'] == 0, 'bmi'] = 32.0

diabetes_df.describe()


Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.681576,72.255013,26.607526,118.660417,32.450911,0.471876,33.240885,0.348958
std,3.369578,30.436016,12.115878,9.63058,93.080252,6.875366,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.75,64.0,20.54,79.8,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,79.8,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


### Defining and training the modelling

#### Datasets, features and target selection

In [15]:
# Split the dataset into training and validation datasets

# Import train_test_split from the sklearn.model_selection module
from sklearn.model_selection import train_test_split

# Split the dataset into train data and validation data
df_train, df_valid = train_test_split(diabetes_df, test_size=0.25, random_state=12345)

# Declare features and target variables
features_train = df_train.drop(['outcome'], axis=1)
target_train = df_train['outcome']
features_valid = df_valid.drop(['outcome'], axis=1)
target_valid = df_valid['outcome']

### Hyperprameter tuning

#### Decision tree classifier  model

In [16]:
# import decision tree from the sklearn library
from sklearn.tree import DecisionTreeClassifier

# Loop over the 'max_depth' hyperparameter to determine the one with highest accuracy score
for depth in range(1, 11):
        model1 = DecisionTreeClassifier(random_state=12345, max_depth=depth) # Define the model

        model1.fit(features_train, target_train) # Train the model

        print("max_depth =", depth, ": ", end='') # Loop through the different depths
        print(model1.score(features_valid, target_valid)) # Display the accuracies

max_depth = 1 : 0.7708333333333334
max_depth = 2 : 0.7708333333333334
max_depth = 3 : 0.7604166666666666
max_depth = 4 : 0.7395833333333334
max_depth = 5 : 0.7708333333333334
max_depth = 6 : 0.78125
max_depth = 7 : 0.7708333333333334
max_depth = 8 : 0.7552083333333334
max_depth = 9 : 0.7552083333333334
max_depth = 10 : 0.7552083333333334


#### Random Forest

In [17]:
# import random forest classifier from the sklearn library
from sklearn.ensemble import RandomForestClassifier

# Loop over the 'n_estimators' hyperparameter to determine the one with highest accuracy score
for estimator in range(1, 11):
        model2 = RandomForestClassifier(random_state=12345, n_estimators=estimator) # Define the model

        model2.fit(features_train, target_train) # Train the model

        print("n_estimators =", estimator, ": ", end='') # Loop through the different number of estimators
        print(model2.score(features_valid, target_valid)) # Display the accuracies

n_estimators = 1 : 0.7447916666666666
n_estimators = 2 : 0.734375
n_estimators = 3 : 0.7083333333333334
n_estimators = 4 : 0.7395833333333334
n_estimators = 5 : 0.7291666666666666
n_estimators = 6 : 0.765625
n_estimators = 7 : 0.7552083333333334
n_estimators = 8 : 0.7760416666666666
n_estimators = 9 : 0.7760416666666666
n_estimators = 10 : 0.8072916666666666


#### Logistic Regression)

In [18]:
# import logistic regression from the sklearn library
from sklearn.linear_model import LogisticRegression

model3 = LogisticRegression(random_state=12345, solver='liblinear') # Define the model

model3.fit(features_train, target_train) # Train the model

print(model3.score(features_valid, target_valid)) # Display model accuracy

0.7916666666666666


#### Model selection

The random forest classifier model, with n_estimators = 10  attained the highest accuracy score of 0.81 on the validation dataset , and 0.98 when trained using the whole data set. This will therefore be adopted as the final model

In [21]:
# Training the best performing model using the whole train dataset for higher accuracy

# Declare features and target variables
features = diabetes_df.drop(['outcome'], axis=1)
target = diabetes_df['outcome']

model = RandomForestClassifier(random_state=12345, n_estimators=10) # Define the model

model.fit(features, target) # Train the model

print(model.score(features, target)) # Display model accuracy

0.9817708333333334


### Prediction using the trained model

In [22]:
# Preview the dataset
diabetes_df.sample()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
167,4,120.0,68.0,20.54,79.8,29.6,0.709,34,0


In [27]:
# Create new observations to be used in the model
new_features = pd.DataFrame(
    [
        [1, 100, 80, 21, 85, 30, 0.65, 28],
        [3, 115, 85, 23, 75, 25, 0.68, 35],
        [0, 140, 78, 22, 80, 35, 0.72, 25],
     
    ],
    columns=features.columns,
)

new_features

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age
0,1,100,80,21,85,30,0.65,28
1,3,115,85,23,75,25,0.68,35
2,0,140,78,22,80,35,0.72,25


In [28]:
# Predict the target variable using the new observations
answers = model.predict(new_features)

print(answers)

[0 0 1]


### Recommendations

1. The company can adopt the model that will predict whether a patient will be diagnosed with diabetes using the following patient information: 'pregnancies', 'glucose', 'bloodpressure',
 'skinthickness', 'insulin', 'bmi', 'diabetespedigreefunction',
 'age',
2. The company will need to collect and maintain accurate and updated patient data to use the model. Most patients had 0 data values which could be erronous and affect the model accuracy
3. Since the data used to train the model only included females of ages 21 and above, is it advisable that the company applies the model to only patients who fall under this category. A more generalized model should be trained with diverse data values - include all patients; male or female, young and old


## Challenging your Solution

What if instead of imputing the 0 data values with the mean of the variable, we drop those observations from the dataset. Will this improve the model accuracy?

In [36]:
# Import the dataset
diabetes_df2 = pd.read_csv('https://bit.ly/DiabetesDS')

# get names of indexes for which the variables have 0 as the data value
index_names = diabetes_df2[ (diabetes_df2['Glucose'] == 0) | (diabetes_df2['BloodPressure'] == 0) | 
                          (diabetes_df2['SkinThickness'] == 0) | (diabetes_df2['Insulin'] == 0) | (diabetes_df2['BMI'] == 0)].index
  
# drop these given row indices from dataFrame
diabetes_df2.drop(index_names, inplace = True)
  
diabetes_df2.shape

(392, 9)

In [37]:
diabetes_df2.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,3.30102,122.627551,70.663265,29.145408,156.056122,33.086224,0.523046,30.864796,0.331633
std,3.211424,30.860781,12.496092,10.516424,118.84169,7.027659,0.345488,10.200777,0.471401
min,0.0,56.0,24.0,7.0,14.0,18.2,0.085,21.0,0.0
25%,1.0,99.0,62.0,21.0,76.75,28.4,0.26975,23.0,0.0
50%,2.0,119.0,70.0,29.0,125.5,33.2,0.4495,27.0,0.0
75%,5.0,143.0,78.0,37.0,190.0,37.1,0.687,36.0,1.0
max,17.0,198.0,110.0,63.0,846.0,67.1,2.42,81.0,1.0


Define datasets, features and the target variable

In [39]:
# Import train_test_split from the sklearn.model_selection module
from sklearn.model_selection import train_test_split

# Split the dataset into train data and validation data
df_train2, df_valid2 = train_test_split(diabetes_df2, test_size=0.25, random_state=12345)

# Declare features and target variables
features_train2 = df_train2.drop(['Outcome'], axis=1)
target_train2 = df_train2['Outcome']
features_valid2 = df_valid2.drop(['Outcome'], axis=1)
target_valid2 = df_valid2['Outcome']

### New model - Hyperparameter tuning and evaluation

#### Decision tree model

In [40]:
for depth in range(1, 11):
        model11 = DecisionTreeClassifier(random_state=12345, max_depth=depth) # Define the model

        model11.fit(features_train2, target_train2) # Train the model

        print("max_depth =", depth, ": ", end='') # Loop through the different depths
        print(model11.score(features_valid2, target_valid2)) # Display the accuracies

max_depth = 1 : 0.7346938775510204
max_depth = 2 : 0.7653061224489796
max_depth = 3 : 0.7857142857142857
max_depth = 4 : 0.7142857142857143
max_depth = 5 : 0.8061224489795918
max_depth = 6 : 0.7448979591836735
max_depth = 7 : 0.7755102040816326
max_depth = 8 : 0.7551020408163265
max_depth = 9 : 0.7857142857142857
max_depth = 10 : 0.7653061224489796


#### Random forest classifier model

In [41]:
for estimator in range(1, 11):
        model22 = RandomForestClassifier(random_state=12345, n_estimators=estimator) # Define the model

        model22.fit(features_train2, target_train2) # Train the model

        print("n_estimators =", estimator, ": ", end='') # Loop through the different number of estimators
        print(model22.score(features_valid2, target_valid2)) # Display the accuracies

n_estimators = 1 : 0.7653061224489796
n_estimators = 2 : 0.8571428571428571
n_estimators = 3 : 0.8061224489795918
n_estimators = 4 : 0.8367346938775511
n_estimators = 5 : 0.8163265306122449
n_estimators = 6 : 0.8367346938775511
n_estimators = 7 : 0.7959183673469388
n_estimators = 8 : 0.8469387755102041
n_estimators = 9 : 0.8367346938775511
n_estimators = 10 : 0.826530612244898


#### Logistic regression model

In [42]:
model33 = LogisticRegression(random_state=12345, solver='liblinear') # Define the model

model33.fit(features_train2, target_train2) # Train the model

print(model33.score(features_valid2, target_valid2)) # Display model accuracy

0.7755102040816326


### Findings

* Dropping the values has significantly increased the accuracy score for the random forest classifier and the decision tress models,while reducing the accuracy for the logistic regression model
* The best performing model is the random forest classifier with n_estimators = 2

Let us now train the best performing model using the whole dataset and observe the changes in the prediction for new features. Will they be similar with the previous findings?

In [48]:
# Training the new best performing model using the whole train dataset for higher accuracy

# Declare features and target variables
features2 = diabetes_df2.drop(['Outcome'], axis=1)
target2 = diabetes_df2['Outcome']

model2 = RandomForestClassifier(random_state=12345, n_estimators=2) # Define the model

model2.fit(features2, target2) # Train the model

print(model2.score(features2, target2)) # Display model accuracy

0.8903061224489796


In [49]:
# Predict the target variable using the new observations
answers2 = model2.predict(new_features)

print(answers)
print(answers2)

[0 0 1]
[0 0 0]


### Summary
* While the accuracy on the validation dataset has increased slightly from 0.81 to 0.85, the accuracy on the whole dataset has reduced from 0.98 to 0.89. The predictions also differ, with the new model predicting that none of the 3 new cases will be classified as diabetic

* The company can provide guidance on which would be more feasible: whether the variable means are an accurate representation of the patient data or ignoring those data values altogether

## 7. Follow up questions

### a). Did we have the right data?

Yes

### b). Do we need other data to answer our question?

The data provided had only 768 observations. When some of these observations have to be modified or dropped because of invalid data values, then this affects the model performance since the model will be trained with only limited data values.
More accurate data is needed to ensure quality is maintained

Additionaly, the data provided belonged only to female patients - which is biased and may not generalize well in the whole population. For instance, the variable 'pregnancies' will not be available for men, children or the older population

### c). Did we have the right question?

Yes we did