# Pima Indians Diabetes Database
# Inspiration
# Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

# Wondering what is Pima ?
The Pima are a group of Native Americans living in an area consisting of what is now central and southern Arizona, as well as northwestern Mexico in the states of Sonora and Chihuahua. The majority population of the surviving two bands of the Akimel O'odham are based in two reservations: the Keli Akimel Oʼotham on the Gila River Indian Community and the On'k Akimel O'odham on the Salt River Pima-Maricopa Indian Community.

Wikipedia

# 1)Let's import all the required Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data=pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

In [None]:
data.head()

# 2)Split the dataset into features and target variable

Features are the Independant variables

Outcome is the Class label : 0 - No Diabetes , 1 - Diabetes Present

In [None]:
Features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness','Insulin','BMI','DiabetesPedigreeFunction', 'Age']
X = data[Features]   
y = data.Outcome 

# 3)Data Munging
Check the descriptive information of Dataset using pandas describe and info methods
# pandas describe - This is an important step to understand the distribution of the data

In [None]:
data.describe() 

# pandas info - This is an important step as it gives you information about the data types

In [None]:
data.info() 

# Check for nulls in the dataset

In [None]:
data.isnull()

In [None]:
data.isnull().sum()

# We have No Null values in this dataset so lets get going!!

# 4)PLOTS for Visualization and Insights
Let us Plot a suitable graph to find multicollinearity i.e relationship between independent features in the given dataset.

# Pair Plot is a good idea to represent the relationship between the independant features 
eg. Glucose vs Insulin or Age vs BMI

We have aleady imported matplotlib and seaborn libraries so we are good to go ahead to map the Pair Plot!!

In [None]:
plt.rcParams['figure.figsize'] = (40, 41)
plt.style.use('dark_background')

sns.pairplot(data, hue = 'Outcome', palette = 'husl')
plt.title('Pair plot for the data', fontsize = 40)
plt.show()

# Pair Plot analysis from above graph :
The histograms that appear across the diagonal in the graph is because its Pregnanies vs Pregnancies ..Glucose vs Glucose...BloodPressure vs BloodPressure ..etc

So, now focus on the actual pairplots above to draw some insights :

# 0 - pink color scatter indicates No Diabetes
# 1 - blue color scatter indicates Has Diabetes
So, look at -

# Age vs Glucose
we can see from the plot that when the glucose level is below 100 aprx and check the age in Y axis they are classified as Not having Diabetes but as the glucose level increases beyond 100 to 200 and check the age in Y axis they are classified as having Diabetes

Similarly look at the pair plot -

# Glucose vs BMI
for BMI between 20 and 40 range when the glucose level is above 100 apprx then the people are classified as having Diabetes.

Such intuitions can be infered using PairPlots

# 5)Correlation Matrix for Visualization and Insights between the correlation of Independant Features and Dependant Feature by using Heatmaps.

The color scale on the right hand side is indicative of the correlation trending towards 1 (max positive correlation value)

In [None]:
plt.rcParams['figure.figsize'] = (15, 15)

sns.heatmap(data.corr(), annot = True)
plt.title('Correlation Plot')
plt.show()

Below can be the insights from the Correlation Matrix :

# There are No negatively co-related features

# Positively Correlated features :

# Glucose
It has a value of .47 which is the highest correlation value in this graph and it is the most important factor determining Diabetes so this is how the machine has found the important feature. Hence Glucose is positively correlated and we can infer as glucose level increases patient is having Diabetes

# BMI, Age, and Pregnancy
It has .29 , .24 and .22 values repectively so they can also be important features to predict presence of Diabetes

# Low contributors to the correlation
Pedigree and Insulin are contributing to .17 and .13 correlation only

On the other hand BP, Skin thickness have low values of .065 and .075 so they are insignificant predictors for Diabetes diagnosis

# 6)Apply Feature scaling to standardize the data columns to common units to avoid biased model.
All the Features are in different units of measurements so this is a very important preprocessing step to get good results!

Hence we use StandardScaler library to fit the data and then transform to convert the original values to the standardized values hence avoiding bias

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled

Incase you would like to see the transformed standardized data frame do the below step - 

In [None]:
pd.DataFrame(X_scaled)

# 7)Splitting the data
To understand the model performance, we divide the dataset into a training set and a test set. Let's split the dataset by using function train_test_split() from sklearn. The 3 parameters to be passed are Features, outcome, and test_set size. Additionally, you can use random_state to select records randomly.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.33,random_state=42) 
x_train.shape

# 8)Trying to fit the model
# I am using a simple Logistic Regression model for this Binary classification problem!!
I have tried to hypertune the parameters however there is not much difference in the Accuracy results so using default parameters Only. I have also tried to drop the poorly correlated features however the Accuracy remains the same. I am a newbie to Machine Learning so still learning other algorithms to experiment in future for better accuracy results!!

In [None]:
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)

In [None]:
print(y_pred)

# 9)Evaluating the model
Since this is a medical diagnosis along with Accuracy score always evaluate model results with Confusion Matrix/Precision/Recall/F1score to really understand the sensitivity and specificity of the predictions as you surely want to zero out those False Negatives to avoid any misses in diagnosing a patient who Truly has Diabetes for effective identification and treatment

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(clf.predict(x_train), y_train))
print(accuracy_score(y_pred, y_test))

# The Training Accuracy is 78 %
...this to me is Okay on the Training set and not going to improve further as it can lead to Overfitting and our focus is more on How our model is going to perform on the Test Data!!

# The Testing Accuracy is 74 %
..We need to surely improve the Testing Accuray score,though I tried to hypertune the model parameters and tried to drop some of the poorly correlated features however the score did not change much.

I have been told Neural Networks will yeild better results in such cases of Binary Classification

# 10)Confusion Matrix
Since this is a medical diagnosis along with Accuracy score always evaluate model results with Confusion Matrix as you surely want to zero out those False Negatives to avoid any misses in diagnosing a patient who Truly has Diabetes for effective identification and treatment

In [None]:
from sklearn.metrics import confusion_matrix 
confusion_matrix(y_pred, y_test)

There are 32 False Negatives which mean 32 people have Diabetes present but my Model classified them as Not having Diabetes so definitely need to improve model performance!! I have been told Neural Networks will yeild better results in such cases of Binary Classification so I am still coming up that learning curve! 

# 11)Precision/Recall/F1score to really understand the sensitivity and specificity of the predictions

In [None]:
from sklearn.metrics import classification_report
classification_report(y_pred, y_test)

# I hope what I have learnt and shared is helpful to someone in someway!

# PASSION FOR TECHNOLOGY