# Diabetes Prediction


This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

## Introduction

So what is diabetes and what leads to someone getting it? The following is from the Centers for Disease Control and Prevention (CDC).  

Insulin is a hormone made by your pancreas that acts like a key to let blood sugar into the cells in your body for use as energy. If you have type 2 diabetes, cells don’t respond normally to insulin; this is called insulin resistance. Your pancreas makes more insulin to try to get cells to respond. Eventually your pancreas can’t keep up, and your blood sugar rises, setting the stage for prediabetes and type 2 diabetes. High blood sugar is damaging to the body and can cause other serious health problems, such as heart disease, vision loss, and kidney disease.

Type 2 diabetes symptoms often develop over several years and can go on for a long time without being noticed (sometimes there aren’t any noticeable symptoms at all). Because symptoms can be hard to spot, it’s important to know the risk factors and to see your doctor to get your blood sugar tested if you have any of them.

The data set we will be using is the PIMA Indian Diabetes data set. The PIMA Indians are a tribe in Arizona and more about their history can be found here https://en.wikipedia.org/wiki/Pima_people

The data set consist of females over the age of 21. There are a total of 9 features including outcome, which is what we will be trying to predict

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function(A function that scores the likelihood of diabetes based on family history)
8. Age (years)
9. Outcome (0 or 1)

# Importing libraries

In [None]:
#Importing the necessary python libraries
import numpy as np
import pandas as pd

# Creating the dataset

In [None]:
#Creating the data
diabetes_data = pd.read_csv(r'diabetes.csv')

In [None]:
#Reading the dataset
diabetes_data.head()

In [None]:
#Observing the shape of dataframe

print(diabetes_data.shape)

As it is observed above that we have 768 rows and 9 columns.   
The first 8 columns represent the features and the last column represent the target/label.

# Basic EDA & statistical analysis

Exploratory Data Analysis or EDA , is an important step to be performed in Data Science projects.
EDA is generally the process of visualising datasets to find out different patterns in the datasets, analyzing the anomalies behaviour of the datasets and building assumptions or hypothesis based on the understanding of the data

In [None]:
# Using the describe we will try and gain more insights of our data:
#for descriptive statistics of the data 
diabetes_data.describe()

In [None]:
#For complete information about the data
diabetes_data.info()

In [None]:
#for datatypes in the data
diabetes_data.dtypes

In [None]:
#Finding missing values
print(diabetes_data.isnull().sum())

In [None]:
#To check if there are any special characters in place of values 
for i in diabetes_data.columns:
    print({i:diabetes_data[i].unique()})

There are no missing values or any unique values available in the data but, there are some values which are termed as zero(0). 
From the above columns some columns such as-
1. Glucose
2. BloodPressure
3. SkinThickness
4. Insulin
5. BMI,  
have zero values which does not make any sense as these values can't be 0.  
So,we will consider these zero values as missing values.  

It is better to replace zeros with NaN since after that counting them would be easier and zeros need to be replaced with some suitable values.


In [None]:
import warnings
warnings.filterwarnings("ignore")
diabetes_data_df = diabetes_data.copy(deep = True)
diabetes_data_df[['Glucose','BloodPressure',
                    'SkinThickness','Insulin','BMI']] = diabetes_data_df[['Glucose','BloodPressure','SkinThickness',
                                                                            'Insulin','BMI']].replace(0,np.nan)

## showing the count of Nans
print(diabetes_data_df.isnull().sum()/len(diabetes_data_df)*100)

To fill up these NaN values understanding the data distribution is necessary

In [None]:
x = diabetes_data_df.hist(figsize = (20,20))

# Imputing NaN values

In [None]:
diabetes_data_df['Glucose'].fillna(diabetes_data_df['Glucose'].mean(), inplace = True)

diabetes_data_df['BloodPressure'].fillna(diabetes_data_df['BloodPressure'].mean(), inplace = True)

diabetes_data_df['SkinThickness'].fillna(diabetes_data_df['SkinThickness'].median(), inplace = True)

diabetes_data_df['Insulin'].fillna(diabetes_data_df['Insulin'].median(), inplace = True)

diabetes_data_df['BMI'].fillna(diabetes_data_df['BMI'].median(), inplace = True)

In [None]:
diabetes_data_df.isnull().sum()

Plotting after removal removal of NaN values.

In [None]:
x = diabetes_data_df.hist(figsize = (20,20))

##### Heatmap map for unclean data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

corr = diabetes_data.corr()
plt.figure(figsize=(8,8))
sns.heatmap(corr, vmin=-1.0,vmax=1.0,annot=True)
plt.yticks(rotation=0)
plt.show()

#### Heatmap for clean data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

corr = diabetes_data_df.corr()
plt.figure(figsize=(8,8))
sns.heatmap(corr, vmin=-1.0,vmax=1.0,annot=True)
plt.yticks(rotation=0)
plt.show()

In [None]:
sns.set()
price_plot=diabetes_data_df['Outcome'].value_counts().plot(kind='bar')
plt.xlabel('Outcome')
plt.ylabel('Count')
plt.show()

### From the above graph it is understood that the data is biased towards datapoints having the outcome value as 0 which  means that the non-diabetic patients were more in number as compared to that of the diabetic patients.

# Creating X & Y variables  for predictions

In [None]:
#Create X & Y 
X = diabetes_data_df.values[:,0:-1]
Y = diabetes_data_df.values[:,-1]

In [None]:
print(X.shape)
print(Y.shape)

# Scaling the X variables

As there are different feature variables in X, it can have a possibility of giving more importance to the variables having greater range and give less important to variables having small range which is not suitable.  
So to overcome this problem scaling is done on the all variables in X, as it will bring all the variables in a same range.  
This will help us to use distance metrics

In [None]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()

scaler.fit(X)

X = scaler.transform(X)

# Train_test_splitting

In [None]:
from sklearn.model_selection import train_test_split  #<1000=in range of 80-20  &  >1000=in range of 70-30

#Split the data into test and train
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=10)

In [None]:
print(X_train.shape)
print(Y_train.shape)
print(X_test.shape)
print(Y_test.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
#create a model
classifier=LogisticRegression()
#fitting training data to the model
classifier.fit(X_train,Y_train)

Y_pred=classifier.predict(X_test)
print(Y_pred)

In [None]:
np.set_printoptions(suppress= True)
Y_pred_prob=classifier.predict_proba(X_test)
Y_pred_prob

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm=confusion_matrix(Y_test,Y_pred)
sns.heatmap(cfm, annot=True, fmt='g', cbar=False, cmap='BuPu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()


print("Classification report:")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("Accuracy of the model: ",acc)

In [None]:
print("Train Score(Log):",classifier.score(X_train,Y_train))
print("Test Score(Log):",classifier.score(X_test,Y_test))


In [None]:
#predicting using the Decision_Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DecisionTree = DecisionTreeClassifier(criterion="gini",random_state=10)

#fit the model on the data and predict the values
model_DecisionTree.fit(X_train,Y_train)

Y_pred = model_DecisionTree.predict(X_test)
#print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm=confusion_matrix(Y_test,Y_pred)
sns.heatmap(cfm, annot=True, fmt='g', cbar=False, cmap='BuPu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()


print("Classification report:")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("Accuracy of the model: ",acc)

In [None]:
print("Train Score(Log):",model_DecisionTree.score(X_train,Y_train))
print("Test Score(Log):",model_DecisionTree.score(X_test,Y_test))


As the Type II errors or we can say the patients who were diabetic but were detected as non - diabetic were still in no but still we can see the recall value was seemed to be inclined more towards "0" i.e. the patients who were non-diabetic.

# Optimization techniques - Handling Imbalanced data

In [None]:
diabetes_data_df.Outcome.value_counts()

### Oversampling

In [None]:
from sklearn.utils import resample
# Separate majority and minority classes
df_majority = diabetes_data_df[diabetes_data_df.Outcome==0]
df_minority = diabetes_data_df[diabetes_data_df.Outcome==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority,
                                 replace=True,     # sample with replacement
                                 n_samples=450,    # to match majority class
                                 random_state=10) # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts
print(df_upsampled.Outcome.value_counts())
df_upsampled.Outcome.value_counts().plot(kind="pie")

In [None]:
X=df_upsampled.values[:,:-1]
Y=df_upsampled.values[:,-1]

In [None]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
#X=scaler.fit_transform(X)
#print(X)

In [None]:
from sklearn.model_selection import train_test_split

#split the data into test and train
X_train, X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=10)

In [None]:
#from Logistic Regression.
from sklearn.linear_model import LogisticRegression
#create a model object
lr=LogisticRegression()
#fitting training data to the model
lr.fit(X_train,Y_train)

Y_pred=lr.predict(X_test)
print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm=confusion_matrix(Y_test,Y_pred)
sns.heatmap(cfm, annot=True, fmt='g', cbar=False, cmap='BuPu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()


print("Classification report:")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("Accuracy of the model: ",acc)

In [None]:
#predicting using the Decision Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DecisionTree=DecisionTreeClassifier(criterion="gini",random_state=10)


#fit the model on the data and predict the values
model_DecisionTree.fit(X_train,Y_train)
Y_pred=model_DecisionTree.predict(X_test)
#print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm=confusion_matrix(Y_test,Y_pred)
sns.heatmap(cfm, annot=True, fmt='g', cbar=False, cmap='BuPu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()


print("Classification report:")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("Accuracy of the model: ",acc)

### Undersampling

In [None]:
from sklearn.utils import resample
# Separate majority and minority classes
df_majority = diabetes_data_df[diabetes_data_df.Outcome==0]
df_minority = diabetes_data_df[diabetes_data_df.Outcome==1]

# Upsample minority class
df_majority_downsampled = resample(df_majority,
                                 replace=True,     # sample with replacement
                                 n_samples=250,    # to match majority class
                                 random_state=10) # reproducible results

# Combine majority class with upsampled minority class
df_downsampled = pd.concat([df_minority, df_majority_downsampled])

# Display new class counts
print(df_downsampled.Outcome.value_counts())
df_downsampled.Outcome.value_counts().plot(kind="pie")

In [None]:
X=df_downsampled.values[:,:-1]
Y=df_downsampled.values[:,-1]

In [None]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
#X=scaler.fit_transform(X)
#print(X)

In [None]:
from sklearn.model_selection import train_test_split

#split the data into test and train
X_train, X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=10)

In [None]:
#from Logistic Regression.
from sklearn.linear_model import LogisticRegression
#create a model object
lr=LogisticRegression()
#fitting training data to the model
lr.fit(X_train,Y_train)

Y_pred=lr.predict(X_test)
print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm=confusion_matrix(Y_test,Y_pred)
sns.heatmap(cfm, annot=True, fmt='g', cbar=False, cmap='BuPu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()


print("Classification report:")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("Accuracy of the model: ",acc)

In [None]:
#predicting using the Decision Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DecisionTree=DecisionTreeClassifier(criterion="gini",random_state=10)


#fit the model on the data and predict the values
model_DecisionTree.fit(X_train,Y_train)
Y_pred=model_DecisionTree.predict(X_test)
#print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm=confusion_matrix(Y_test,Y_pred)
sns.heatmap(cfm, annot=True, fmt='g', cbar=False, cmap='BuPu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()


print("Classification report:")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("Accuracy of the model: ",acc)

#### SMOTE

In [None]:
import imblearn

In [None]:
X=diabetes_data_df.values[:,:-1]      
Y=diabetes_data_df.values[:,-1]

In [None]:
from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
scaler.fit(X)
X=scaler.transform(X)
#X=scaler.fit_transform(X)
#print(X)

In [None]:
from sklearn.model_selection import train_test_split

#split the data into test and train
X_train, X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,random_state=10)

In [None]:
print(len(Y_train[Y_train==1])) # minority
print(len(Y_train[Y_train==0])) #majority


In [None]:
?sm.fit_resample

In [None]:
print("Before OverSampling, counts of label '1': ", (sum(Y_train == 1)))
print("Before OverSampling, counts of label '0': ", (sum(Y_train == 0)))
  
# import SMOTE from imblearn library
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state = 10)
X_train_res, Y_train_res = sm.fit_resample(X_train, Y_train)
  
print('After OverSampling, the shape of train_X: ', (X_train_res.shape))
print('After OverSampling, the shape of train_y: ', (Y_train_res.shape))
  
print("After OverSampling, counts of label '1': ", (sum(Y_train_res == 1)))
print("After OverSampling, counts of label '0': ", (sum(Y_train_res == 0)))

In [None]:
#from Logistic Regression.
from sklearn.linear_model import LogisticRegression
#create a model object
lr=LogisticRegression()
#fitting training data to the model
lr.fit(X_train_res,Y_train_res)

Y_pred=lr.predict(X_test)
print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm=confusion_matrix(Y_test,Y_pred)
sns.heatmap(cfm, annot=True, fmt='g', cbar=False, cmap='BuPu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()


print("Classification report:")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("Accuracy of the model: ",acc)

In [None]:
#predicting using the Decision Tree_Classifier
from sklearn.tree import DecisionTreeClassifier

model_DecisionTree=DecisionTreeClassifier(criterion="gini",random_state=10)


#fit the model on the data and predict the values
model_DecisionTree.fit(X_train,Y_train)
Y_pred=model_DecisionTree.predict(X_test)
#print(Y_pred)
#print(list(zip(Y_test,Y_pred)))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm=confusion_matrix(Y_test,Y_pred)
sns.heatmap(cfm, annot=True, fmt='g', cbar=False, cmap='BuPu')
plt.xlabel('Predicted Values')
plt.ylabel('Actual Values')
plt.show()


print("Classification report:")

print(classification_report(Y_test,Y_pred))

acc=accuracy_score(Y_test,Y_pred)
print("Accuracy of the model: ",acc)