# Introduction
Diabetes is a serious, long-term condition with a major impact on the lives and well-being of individuals, families, and societies worldwide. It is among the top 10 causes of death in adults, and was estimated to have caused four million deaths globally in 2017 [1]. In 2017

he global diabetes prevalence in 2019 is estimated to be 9.3% (463 million people), rising to 10.2% (578 million) by 2030 and 10.9% (700 million) by 2045. The prevalence is higher in urban (10.8%) than rural (7.2%) areas, and in high-income (10.4%) than low-income countries (4.0%). One in two (50.1%) people living with diabetes do not know that they have diabetes. The global prevalence of impaired glucose tolerance is estimated to be 7.5% (374 million) in 2019 and projected to reach 8.0% (454 million) by 2030 and 8.6% (548 million) by 2045

For reference

https://www.diabetesresearchclinicalpractice.com/article/S0168-8227(19)31230-6/fulltext

# IDF Diabetes Atlas Eighth Edition 2019
![image.png](attachment:image.png)

This link contains the complete picture clearly https://diabetesatlas.org/upload/resources/material/20191218_144459_2019_global_factsheet.pdf

For reference
https://diabetesatlas.org/en/resources/

First we import the main libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import Dataset

In [None]:
diabetes_data = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

#Print the first 5 rows of the dataframe.
diabetes_data.head()

In [None]:
diabetes_data.shape

DataFrame.describe() method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn't work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include="all" is passed.

Now, let's understand the statistics that are generated by the describe() method:

* count tells us the number of NoN-empty rows in a feature.
* mean tells us the mean value of that feature.
* std tells us the Standard Deviation Value of that feature.
* min tells us the minimum value of that feature.
* 25%, 50%, and 75% are the percentile/quartile of each features. This quartile information helps us to detect Outliers.
* max tells us the maximum value of that feature.

In [None]:
diabetes_data.describe()


#  Basic EDA and statistical analysis

In [None]:
diabetes_data.info(verbose=True)

In [None]:
sns.countplot(x='Outcome',data=diabetes_data)
plt.show()

The above graph shows that the data is biased towards datapoints having outcome value as 0 where it means that diabetes was not present actually. The number of non-diabetics is almost twice the number of diabetic patients

/////////////////////////////////////////////////////////////////////////////////////////////////////

On these columns, a value of zero does not make sense and thus indicates missing value.

Following columns or variables have an invalid zero value:

1. Glucose
2. BloodPressure
3. SkinThickness
4. Insulin
5. BMI

It is better to replace zeros with nan since after that counting them would be easier and zeros need to be replaced with suitable values

In [None]:
diabetes_data_copy = diabetes_data.copy(deep = True)
diabetes_data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes_data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

## showing the count of Nans
print(diabetes_data_copy.isnull().sum())

It is better to replace zeros with nan since after that counting them would be easier and zeros need to be replaced with suitable values

In [None]:
plt.style.use('classic')
plot = diabetes_data.hist(figsize = (20,20))

Aiming to impute nan values for the columns in accordance with their distribution

In [None]:
diabetes_data_copy['Glucose'].fillna(diabetes_data_copy['Glucose'].mean(), inplace = True)
diabetes_data_copy['BloodPressure'].fillna(diabetes_data_copy['BloodPressure'].mean(), inplace = True)
diabetes_data_copy['SkinThickness'].fillna(diabetes_data_copy['SkinThickness'].median(), inplace = True)
diabetes_data_copy['Insulin'].fillna(diabetes_data_copy['Insulin'].median(), inplace = True)
diabetes_data_copy['BMI'].fillna(diabetes_data_copy['BMI'].median(), inplace = True)

In [None]:
diabetes_data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] =diabetes_data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

## showing the count of Nans
print(diabetes_data_copy.isnull().sum())

Plotting after Nan removal

In [None]:
plot = diabetes_data_copy.hist(figsize = (20,20))


Scatter matrix of uncleaned data

In [None]:

sns.pairplot(diabetes_data )

Pair plot for clean data

In [None]:
sns.pairplot(data=diabetes_data_copy,hue='Outcome',diag_kind='kde', kind="reg")
plt.show()

Heatmap for unclean data

In [None]:
plt.figure(figsize=(12,10))  # on this line I just set the size of figure to 12 by 10.
ax = sns.heatmap(diabetes_data.corr(), xticklabels=2, annot=True ,yticklabels=False)

Here I would like to clarify something very important

Pearson's Correlation Coefficient: helps you find out the relationship between two quantities. It gives you the measure of the strength of association between two variables. The value of Pearson's Correlation Coefficient can be between -1 to +1. 1 means that they are highly correlated and 0 means no correlation.

A heat map is a two-dimensional representation of information with the help of colors. Heat maps can help the user visualize simple or complex information.

///////////////////////////////////////////////////////////////////////////////////////////////////////

pandas_profiling library
Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Type inference: detect the types of columns in a dataframe.
* Essentials: type, unique values, missing values
* Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
* Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* Most frequent values
* Histogram
* Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
* Missing values matrix, count, heatmap and dendrogram of missing values
* Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

In [None]:
from pandas_profiling import ProfileReport 

profile = ProfileReport(diabetes_data.corr(), title='Pandas profiling report ' , html={'style':{'full_width':True}})

profile.to_notebook_iframe()

Scaling the data

data Z is rescaled such that μ = 0 and 𝛔 = 1, and is done through this formula:


![image.png](attachment:image.png)

In [None]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X =  pd.DataFrame(sc_X.fit_transform(diabetes_data_copy.drop(["Outcome"],axis = 1),),
        columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'])
X.head()

In [None]:
y = diabetes_data_copy.Outcome

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=1/3,random_state=42, stratify=y)

In [None]:
# Import Libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
#----------------------------------------------------

#----------------------------------------------------
#Applying VotingClassifier Model 

'''
#ensemble.VotingClassifier(estimators, voting=’hard’, weights=None,n_jobs=None, flatten_transform=None)
'''

#loading models for Voting Classifier
LRModel_ = LogisticRegression(solver='lbfgs', multi_class='multinomial',random_state=33)
RFModel_ = RandomForestClassifier(n_estimators=100, criterion='gini',max_depth=5, random_state=33)
KNNModel_ = KNeighborsClassifier(n_neighbors= 10, weights ='uniform', algorithm='auto')
NNModel_ = MLPClassifier(solver='lbfgs',hidden_layer_sizes=(1000, 20),learning_rate='constant',activation='relu', power_t=0.4, max_iter=250)

#loading Voting Classifier
VotingClassifierModel = VotingClassifier(estimators=[('LRModel',LRModel_),('RFModel',RFModel_),('KNNModel',KNNModel_),('NNModel',NNModel_)], voting= 'soft')
VotingClassifierModel.fit(X_train, y_train)

#Calculating Details
print('VotingClassifierModel Train Score is : ' , VotingClassifierModel.score(X_train, y_train))
print('VotingClassifierModel Test Score is : ' , VotingClassifierModel.score(X_test, y_test))
print('----------------------------------------------------')


###### #Calculating Prediction


In [None]:
y_pred = VotingClassifierModel.predict(X_test)
print('Predicted Value for VotingClassifierModel is : ' , y_pred[:10])

In [None]:

#Calculating Confusion Matrix

from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="BuPu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

In [None]:
#Import Libraries
from sklearn.metrics import accuracy_score
#----------------------------------------------------

#----------------------------------------------------
#Calculating Accuracy Score  : ((TP + TN) / float(TP + TN + FP + FN))
AccScore = accuracy_score(y_test, y_pred, normalize=False)
print('Accuracy Score is : ', AccScore)

In [None]:
#Import Libraries
from sklearn.metrics import f1_score
#----------------------------------------------------

#----------------------------------------------------
#Calculating F1 Score  : 2 * (precision * recall) / (precision + recall)
# f1_score(y_true, y_pred, labels=None, pos_label=1, average=’binary’, sample_weight=None)

F1Score = f1_score(y_test, y_pred, average='micro') #it can be : binary,macro,weighted,samples
print('F1 Score is : ', F1Score)

In [None]:
from sklearn.metrics import roc_curve
y_pred_proba = VotingClassifierModel.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr, label='Knn')
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('VotingClassifierModel ROC curve')
plt.show()

https://seaborn.pydata.org/search.html?q=cmap&check_keywords=yes&area=default
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166
https://www.kaggle.com/shrutimechlearn/step-by-step-diabetes-classification-knn-detailed