## PIMA INDIANS DIABETES | EXPLORATORY DATA ANALYSIS | MODEL BUILDING

## What is Diabetes?
Diabetes is a medical condition which a significant percentage of population has to undergo. It impairs the body’s ability to process blood glucose, otherwise known as blood sugar. In the absence of careful attention, highly diabetic condition can increase risk to complicated health problems such as stroke, heart disease etc. 

## Dataset and Objective
The datasets here consists of several medical predictor variables and a variable indicating diabetes condition (i.e. positively diagnosed or not). Our objective with the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

The columns of dataset are:
1. Pregnancies - Number of times pregnant
2. Glucose - Plasma glucose concentration a 2 hours in an oral glucose tolerance
3. BloodPressure - Diastolic blood pressure (mm Hg)
4. SkinThickness - Triceps skin fold thickness (mm)
5. Insulin - 2 Hour serum insulin (mu U/ml)
6. BMI - Body mass index (weight in kg/(height in m)^2)
7. DiabetesPedigreeFunction - Diabetes pedigree function (indicates likelihood of diabetes based on family history)
8. Age -  Age (years)
9. Outcome - Class variable (0 or 1), 1 for diabetic and 0 for non diabetic

## Let's Start

### **1. Import Required Libraries and Load Dataset**

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# To filter warning by ignoring
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load dataset
diabetes_df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

### **2. Data Preparation and Cleaning**

The real word datasets normally required preparation and cleaning before performing any analysis and the same is carried out in this section. As a first step, we will have a quick look at the data as a pandas dataframe.

In [None]:
# View the data as dataframe
diabetes_df

So this is our data it has 768 rows and 9 columns. Now we need more information about the dataframe including the data types and columns, non-null values etc and we can use the following method.

In [None]:
# To check more information of data
diabetes_df.info()

The whole 768 rows of dataset have all non-null entries. Also all columns have either integer of float values. Now we will check the Outcome column's distribution. 

In [None]:
# To check distribution of target class
sns.countplot(diabetes_df['Outcome']);

There is some considerable difference between positive and negative examples (but not skewed), this is to be taken care of later while splitting examples for training and testing. Its time to look at individual columns and get an idea about central tendency, dispersion etc of them.

In [None]:
# To get information regarding distributions of values in columns
diabetes_df.describe()

We can see that columns other than Age has minimum value of '0'. And these are to be handled. To better understand how values are distributed we can plot histograms for columns.

In [None]:
# Getting names of columns into cols, will be useful in plotting
cols = diabetes_df.columns
print(cols)

In [None]:
#Plotting histograms for different column values:
fig, axes = plt.subplots(3,3, figsize=(10,10), gridspec_kw = dict(hspace=0.5, wspace=0.6))
fig.suptitle('Frequency plot for different column values')
for col, az in zip(cols, axes.flat):
    sns.histplot(diabetes_df[col], ax = az)

We list some of observation from above plots:
* It can be observed that many '0' values appear for columns Glucose, Insulin, BloodPressure and BMI. 
* Also we can observe a SkinThickness value close to 100, which is an unlikely.
* The columns Pregnancies, DiabetesPedigreeFunction and Age looks fine. 
* Although Number of pregnancies above 10 is also observed, this is to be cross checked with age.

To be more clear of zero entries we will count the number of zeros in each column.

In [None]:
#To check the zero entries for each column
(diabetes_df[cols]== 0).sum()

We can assume that the missing values are replaced with zeros in the data collection stage considering huge share of zeros. 

Thinking of handling these zero entries, since there are very large number of '0' for columns SkinThickness and Insulin, dropping of rows is not a good idea since we have only 768 rows of data. 

So we are ready to do the following data handling tasks.
1. Deal with entries equal to zero in columns Glucose, BloodPressure, SkinThickness, Insulin and BMI.
2. Replace the unlikely SkinThickness value close to 100.
3. Check and correct wrong entries in Pregnancies column if any.

### Task 1
First we replace all zero entries of above mentioned columns with np.NaN. And then replace them with mean or median of column with the helps of histogram plots we have already plotted.

In [None]:
#replacing zero values with np.NaN
diabetes_df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes_df[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0, np.NaN)
(diabetes_df[cols]== 0).sum()

The zero values are now replacced from the required colums.

By looking at the histogram plots one can conclude that BloodPressure and BMI have somewhat symmetric plots and Glucose, SkinThickness and Insulin have skewed plots. So we will now replace the missing values for BloodPressure and BMI with **mean** and Glucose, SkinThickness and Insulin with **median**.


In [None]:
#Replacing missing values
diabetes_df['BloodPressure'] = diabetes_df['BloodPressure'].fillna(diabetes_df['BloodPressure'].mean())
diabetes_df['BMI'] = diabetes_df['BMI'].fillna(diabetes_df['BMI'].mean())
diabetes_df['Glucose'] = diabetes_df['Glucose'].fillna(diabetes_df['Glucose'].median())
diabetes_df['SkinThickness'] = diabetes_df['SkinThickness'].fillna(diabetes_df['SkinThickness'].median())
diabetes_df['Insulin'] = diabetes_df['Insulin'].fillna(diabetes_df['Insulin'].median())

### Task 2

Let's find out the index of high value entry in SkinThickness and replace it with median.

In [None]:
# To find the corresponding row
diabetes_df[diabetes_df['SkinThickness'] > 90]

In [None]:
# Replace wrong entry with median
diabetes_df['SkinThickness'].loc[579] = diabetes_df['SkinThickness'].median()
diabetes_df.loc[579]

### Task 3

Let's check the number of rows having more than 10 pregnancies and less than 30 age.

In [None]:
# Checking for number of pregnancies vs age for more than 10 values
diabetes_df[(diabetes_df['Pregnancies'] > 10) & (diabetes_df['Age'] < 30)].shape

No entries with more than 10 pregnancies for age below 30. So the number of pregnancies column is error free. We have no task to complete here.

Once again we will look at the distribution of values in columns to see effect of our updating.

In [None]:
# To get information regarding distributions of values in columns
diabetes_df.describe()

We can see that the column minimum values have changed.

### **3. Exploratory Data Analysis**

Now we will look at pair plots and try to get some insights.

In [None]:
#Pair plots
sns.pairplot(diabetes_df, hue = 'Outcome', height = 2);

Looking at the pair plots we can observe few things: 
* Higher glucose level, higher BMI and greater age have more associated with positive diabetic. 
* Effects are much evident in case of glucose level. 
* The diabetisPredictionFunctionVariable is not showing any high influeces on diabetes chances.

Now if we look at the correlation.

In [None]:
#plotting heatmap for correlation
plt.figure(figsize=(10,10))
sns.heatmap(diabetes_df.corr(), square=True, linewidths=.5, annot=True, cbar=False);

From the plot we have observed a maximum correlation of 0.56 only and looking at values, we can conclude that the variables are weekly correlated.

### **4. Building Models**

### 4.1. Data Splitting

The examples are to be split into training and test set keeping thier relative class frequencies approximately same. For this we need set `stratify` as `y` in `train_test_split`. We can use our test data to evaluate performance of our Machine Learning models. Here 25% of data is taken as test examples.

In [None]:
# To split data into training and test sets
from sklearn.model_selection import train_test_split

X = diabetes_df.drop('Outcome', axis = 1) #drop target column to get X
y = diabetes_df['Outcome'] #target column is y

X_train, X_test, y_train,  y_test = train_test_split(X, y, test_size = 0.25, stratify = y)

In [None]:
print('X_train shape : ', X_train.shape)
print('y_train shape : ', y_train.shape)
print('X_test shape  : ', X_test.shape)
print('y_test shape  : ', y_test.shape)

### 4.2. Feature Scaling 

Since our eight features are having quite different ranges, we need to do feature scaling. This will ensure that during training of our model more weight is not given to features having higher values. 
Here we will use StandardScaler which standardize features by removing the mean and scaling to them to unit variance.
i.e. rescale them to distribution of 0 mean and 1 standard deviation.

In [None]:
# Feature scaling
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

### 4.3. Logistic Regression Model

In [None]:
# Import metrics to check performance of models
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

In [None]:
#Logistic Regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print('Accuracy : ' + '{:.2f}'.format(accuracy_score(y_test, y_pred)*100) +" %")
print('F1 score : ' + '{:.2f}'.format(f1_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

### 4.4. Support Vector Machines Model

Support vector machines (SVMs) are a particularly powerful and flexible class of supervised algorithms for classification. They work well with high-dimensional data in finding a decision boundary. Now we will use it on our data.

In [None]:
#Support vector machines
from sklearn.svm import SVC

svc = SVC(kernel = 'rbf')

svc.fit(X_train, y_train)

y_pred = svc.predict(X_test)

print('Accuracy : ' + '{:.2f}'.format(accuracy_score(y_test, y_pred)*100) +" %")
print('F1 score : ' + '{:.2f}'.format(f1_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

### 4.5. Random Forest Classifier Model

Random Forest uses multiple decision trees for prediction. These decision trees trained with randomly selected subset of training set and gives their prediction on test set. By majority vote Random Forest combines these predictions and give model prediction. 

In [None]:
#Random Forest
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()

rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)

print('Accuracy : ' + '{:.2f}'.format(accuracy_score(y_test, y_pred)*100) +" %")
print('F1 score : ' + '{:.2f}'.format(f1_score(y_test, y_pred)))
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

### **5. Conclusion**

* By looking at pair plot we found that higher glucose level much likely for women with diabetic condition. 
* Likelihood of diabetes based on family history was not evident from plots.
* Performance of three algorithms namely Logistic Regression, Support vector Machines, Random Forest on the dataset is shown.
* Based on requirement we need to select performance metric for models. That is if we do not want to miss any positive diagnosis we may require higher precision on positive cases.