# Intro
Welcome to the popular [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

![](https://storage.googleapis.com/kaggle-datasets-images/228/482/a520351269b547c89afe790820a1087e/dataset-cover.jpeg)

The following EDA based on this [book](https://www.packtpub.com/product/feature-engineering-made-easy/9781787287600).

We focus on working with missing values in this dataset.

<span style="color: royalblue;">Please vote the notebook up if it helps you. Thank you. </span>

# Libraries

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt

In [None]:
from sklearn.impute import KNNImputer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

# Path

In [None]:
path = '/kaggle/input/pima-indians-diabetes-database/'
os.listdir(path)

# Functions
We define some simple helper functions for the analysis and visualisation.

In [None]:
def plot_histograms():
    """ Plot histograms of all features"""
    
    fig, axs = plt.subplots(4, 2, figsize=(22, 18))
    fig.subplots_adjust(hspace = 0.5, wspace=0.2)
    axs = axs.ravel()
    for i in range(8):
        axs[i].hist(data[data['Outcome']==0][data.columns[i]],
                    10, alpha=0.5, label='non-diabetes')
        axs[i].hist(data[data['Outcome']==1][data.columns[i]],
                    10, alpha=0.5, label='diabetes')
        axs[i].set_title(data.columns[i])
        axs[i].legend(loc='upper right')
        axs[i].set_ylabel('Frequency')
        axs[i].grid()

In [None]:
def plot_bar(data, feature, text='', rotation=False):
    """ Bar plot of a feature """
    
    fig = plt.figure(figsize=(10, 5))
    x = data.index
    y = data[feature]
    plt.bar(x, y)
    plt.title(text, loc='left')
    plt.xlabel('Category')
    if rotation:
        plt.xticks(rotation='vertical')
    plt.grid()
    plt.show()

# Laod Data

In [None]:
data = pd.read_csv(path+'diabetes.csv')

In [None]:
data.head()

# Overview

In [None]:
print('number of samples:', len(data.index))
print('number of features (target included):', len(data.columns))

First of all we want do describe the features of the dataset. There are 8 features and 1 target. The target is the Outcome.

|Feature|Description| Measurement|
|---|---|---|
|Pregnancies|Number of times pregnant|Number|
|Glucose|[Plasma glucose concentration](https://www.ncbi.nlm.nih.gov/books/NBK541081/#:~:text=Normal%20plasma%20glucose%20levels%20are,individuals%20can%20vary%20with%20age.) a 2 hours in an oral glucose tolerance test|mg/dL|
|BloodPressure|Diastolic blood pressure|mm Hg|
|SkinThickness|Triceps skinfold thickness|mm|
|Insulin|2-Hour serum insulin|mu U/ml|
|BMI|Body mass index|weight in kg/(height in m)²|
|DiabetesPedigreeFunction|Diabetes pedigree function||
|Age|Age|Number|

The target is 1 if the patient developed diabets and 0 otherwise.

Distribution of the target:

In [None]:
data['Outcome'].value_counts(normalize=True)

# EDA
## Histograms
We plot the histograms of all features and visualize the difference between diabetes and non-diabetes. As we can see there are some obvious differenc in the values.

In [None]:
plot_histograms()

## Correlation Matrix
With the correlation matrix we visualize a linear relationsship between the features. We can see a significant correlation between the target label and the feature glucose. So this feature seems to be important.

In [None]:
corr = data.corr()
corr.style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)

## Missing/Implausible Values
### Analysis
Fortunately there are no missing values to handel ...

In [None]:
data.isnull().sum()

... but a look on the dataframe describetion shows some implausible values:

In [None]:
data.describe()

The body mass index of a person could not be 0. Also the features glucose, blood pressure and insulin could not be equal to zero. So it seems that missing values were filled by 0. This is of course a way to handle missing values. But it is also possible to find a better solution.

In [None]:
features_with_missing_values = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

In [None]:
def label_missing_values(s):
    """ Label missing values (=0) with None """
    if s == 0:
        return None
    else:
        return s

In [None]:
for feature in features_with_missing_values:
    data[feature] = data[feature].apply(label_missing_values)

Finally there are some missing values. Of course they have influence of the histograms and the correaltion matrix shown before and the model for prediction we will define later. There are several options to deal with missing values. We could drop all rows with missing data. So there are 374 samples with missing values for the feature insulin. If we drop all of them we will loss about 49% of all samples. 

In [None]:
data.isnull().sum()

### Drop Rows With Missing Values
We define a new dataframe by dropping the missing values of the origin dataframe. Then we analyse the impact.

In [None]:
data_dropped = data.dropna()

We compare the mean values of all features and both dataframes.

In [None]:
compare = pd.DataFrame()
compare['origin'] = data.mean()
compare['dropped'] = data_dropped.mean()
compare['delta'] = (compare['dropped']-compare['origin'])/compare['origin']

The absolut difference is apparently not significant. But the relative difference is it.

In [None]:
compare

In [None]:
plot_bar(compare, 'delta', 'Relative difference', rotation=True)

### Imputing Missing Values

In [None]:
def calculate(X, y):
    """ Calulate the best score of grid search """
    
    knn = KNeighborsClassifier()
    knn_params = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7]}
    grid = GridSearchCV(knn, knn_params)
    grid.fit(X, y)
    print(grid.best_score_, grid.best_params_)

First we calculate the best score of the grid search for the origin data set (the missing data were filled with 0):

In [None]:
X_origin = data[data.columns[:-1]].fillna(0)
y_origin = data['Outcome']
calculate(X_origin, y_origin)

We calculate the best score of grid search for the dropped data:

In [None]:
X_dropped = data_dropped[data_dropped.columns[:-1]]
y_dropped = data_dropped['Outcome']
calculate(X_dropped, y_dropped)

We fill missing data by the mean of every column and calculate the best score of gird search:

In [None]:
X_mean = data[data.columns[:-1]].fillna(X_dropped.mean(axis=0))
y_mean = data['Outcome']
calculate(X_mean, y_mean)

We fill missing data by the KNN imputer and calculate the best score of grid search:

In [None]:
imputer = KNNImputer(n_neighbors=2)
X_imputed = imputer.fit_transform(data[data.columns[:-1]])
y_imputed = data['Outcome']
calculate(X_imputed, y_imputed)

In [None]:
min_max = MinMaxScaler()
X_scaled = min_max.fit_transform(X_imputed)
y_scaled = y_imputed
calculate(X_scaled, y_scaled)

Finally we scale the imputed data and calculate best score of gird search:

In [None]:
imputer = KNNImputer(n_neighbors=1)
X_imputed = imputer.fit_transform(data[data.columns[:-1]])
y_imputed = data['Outcome']
calculate(X_imputed, y_imputed)

In [None]:
min_max = MinMaxScaler()
X_scaled = min_max.fit_transform(X_imputed)
y_scaled = y_imputed
calculate(X_scaled, y_scaled)

# Summary
There are several way handling missing values. We want to put all results togehter in a table:

|Describtion|Rows|Cross-valiated accuracy|
|:---|---|---|
|origin (missing values filled with 0)|768|0.7357|
|dropped missing values|392|0.7348|
|Impute values with mean of columns|768|0.7305|
|Impute with knn (n_neighbors=2)|768|0.7448|
|Impute with knn (n_neighbors=2) and scaled|768|0.7513|
|Impute with knn (n_neighbors=1)|768|0.7136|
|Impute with knn (n_neighbors=1) and scaled|768|0.7617|