----------------------------------
# Pima Indians Diabetes Analysis
----------------------------------


#####  Perform Exploratory Data Analysis to identify the impact of various attributes on the diabetes rates of the Pima Indians.

- This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.
- Through the dataset we aim to get insights about the patterns in the disease, based on certain diagnostic measurements included in the dataset.
- Several constraints were placed on the selection of these instances from a larger database.
- In particular, all patients here are females at least 21 years old of Pima Indian heritage.
- The datasets consists of several medical predictor variables and one target variable.
- Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.


--------------------------
### Attribute Information:
--------------------------

1. **Number of times pregnant**
2. **Plasma glucose concentration a 2 hours in an oral glucose tolerance test**
3. **Diastolic blood pressure (mm Hg)**
4. **Triceps skin fold thickness (mm):** Its thickness gives information about the fat reserves of the body
5. **2-Hour serum insulin (mu U/ml)**
6. **Body mass index (weight in kg/(height in m)^2)**
7. **Diabetes pedigree function**
8. **Age (years)**
9. **Class variable (0 or 1)**

------------------------
# Concepts to Cover
------------------------
- 1. <a href = #link1>Overview of the data</a>
- 2. <a href = #link2>Univariate and Bivariate Analysis</a>
- 3. <a href = #link3>Data Preprocessing</a>
- 4. <a href = #link4>Pandas Profiling</a> 

# Let's start coding!

<a id='link'></a>
### Import libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import necessary libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
import seaborn as sns

from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler, MinMaxScaler

### Think about it:

- Why we use StandardScalar and MinMaxScalar from sklearn library?

In [3]:
# Adjust pandas display and formatting settings

# Remove scientific notations and display numbers with 2 decimal points instead
pd.options.display.float_format = '{:,.2f}'.format        

# Increase cell width
# If you don't want to change your default settings, and you only want to change the width of 
# the current notebook you're working on, you can enter the following into a cell:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Update default style
# Set the aesthetic style of the plots.
# This affects things like the color of the axes, whether a grid is enabled by default, and other aesthetic elements.
sns.set_style(style='darkgrid')

### Load and explore the data

In [4]:
# Load the data into pandas dataframe
df = pd.read_csv("/content/pima-indians-diabetes.csv")           # Make changes to the path depending on where your data file is stored.

FileNotFoundError: [Errno 2] File /content/pima-indians-diabetes.csv does not exist: '/content/pima-indians-diabetes.csv'

# <a id='link1'>Overview of the data</a>

In [None]:
df.head()

## Think about it:

- What do you interpret by looking at the data in the first 5 rows?
- Do you see that each column has different scales of data.
    - e.g. "Pragnancies" column : from 0 to 8, as we can see.
    - Glucose: 85 to 183.
So, Consider two cases:
- Without scaling the data
- With using StandardScalar, MinMaxScalar

Will the result differ in each case or will be same?

In [None]:
# Check number of rows and columns
df.shape

In [None]:
# Check column types and missing values
df.info()

In [None]:
# Check missing values via heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.isna())
plt.show()


## Think about it:

- What are some other ways to check the missing values?

# <a id = "link2">Univariate and Bivariate Analysis</a>

- How to interpret the average of a binary field?
- What can we tell about the shape of distributions from the summary statistics above?

In [None]:
# Example of a binary array
a = [1,1,1,1,1,1,0,0,0,0]

# Find the average of the binary array
np.mean(a)

- Is the average above or below the median in the distribution above?

In [None]:
# Create summary statistics for numeric fields
df.describe().T

#### Skewness of the variables

In [None]:
df.skew()

**If skewness value is zero, then it is not skewed**

**If it is less than zero, then negatively skewed (left tail) and if greater than zero then positively skewed (right tail)**

Some insights about data:

- Data for all the attributes are skewed, especially for the variable "Insulin".

- The mean for "Insulin" is 80(rounded) while the median is 30.5 which clearly indicates an extreme long tail on the right.

In [None]:
# Example of a skewed distribution (right tail)

sns.distplot(df['Insulin'])
plt.show()

#### Incorrect Imputations

In [None]:
# Let us check whether any of the columns has any value other than numeric i.e. data is not corrupted such as a "?" instead of 
# a number.

# we use np.isreal a numpy function which checks each column for each row and returns a bool array, 
# where True if input element is real.
# applymap is pandas dataframe function that applies the np.isreal function elementwise

# Following line selects those rows which have some non-numeric value in any of the columns hence the  ~ symbol

df[~df.applymap(np.isreal).all(1)]

# this works only on continous columns.

In [None]:
df['Outcome'].value_counts()
# See distrubtion of target variable

In [None]:
df['Outcome'].value_counts(normalize=True)*100
# See percentage distribution of target variable

## Bivariate

In [None]:
# Let us look at the target column 'Outcome' to understand how the data is distributed amongst the various fields
df.groupby(["Outcome"]).mean()

In [None]:
# Let us look at the target column 'Outcome' to understand how the data is distributed amongst the various fields
df.groupby(["Outcome"]).median()

**All the features have higher mean for people having diabetes**

**All the features have higher median for people having diabetes except "Insulin" which is understood as diabetes patient do not produce insulin**

#### Pair plot

In [None]:
# Check distributions by Outcome
sns.pairplot(df, hue = 'Outcome')
plt.show()

### No clear relationship for any feature.

### Number of people having diabetes are higher for people with higher "Age", "BMI" and "Glucose".

#### Correlation with target variable is signficant when target variable is continous

Here we can check if any feature is correlated amongst them.

In [None]:
corr = df.drop('Outcome',axis=1).corr()

plt.figure(figsize=(12,8))
sns.heatmap(corr, annot = True)
plt.show()

### No high correlation between features

# <a id='link3'>Data Preprocessing</a>

#### Exampes of data standardization using mean and standard deviation

In [None]:
# Multiple ways to implement Z score standardization

# Standardization of entire data set using "zscore" function from scipy.stats package 
df_z = df.apply(zscore)

# Manual standardization of individual fields
df['Age_Z_Manual'] = (df['Age']-np.mean(df['Age']))/np.std(df['Age'])

# Using "zscore" function from scipy.stats package 
df['Age_Z_Scipy'] = df[['Age']].apply(zscore)

# Using "StandardScaler" function from sklearn.preprocessing package - useful for machine learning models  
df['Age_Z_Sklearn'] = StandardScaler().fit_transform(df[['Age']])

In [None]:
# View the new data set with all standardized fields
df_z.head()

In [None]:
# View existing data set with new Age standardized fields
df.head()

### Zscore and Standard Scalar are same.

#### Exampes of data normalization using min and range

In [None]:
# Manual normalization
df['Age_Norm_Manual'] = (df['Age']-np.min(df['Age']))/(np.max(df['Age'])-np.min(df['Age']))

# Using "StandardScaler" function from sklearn.preprocessing package - useful for machine learning models  
df['Age_Norm_Sklearn'] = MinMaxScaler().fit_transform(df[['Age']])

In [None]:
df.head()

In [None]:
df.describe().T

#### Exampe of data transformation using natural logarithm

In [None]:
# Log transformation of a skewed field using numpy log function
df['Age_Log'] = np.log(df['Age'])

# Log transformation of a skewed field with zeros
df['BloodPressure_Log'] = np.log(df['BloodPressure']+0.5)

In [None]:
# Let's plot original, standardized, normalized, and log transformed Age fields
fig, axs = plt.subplots(ncols = 4, figsize = (30, 7))

sns.distplot(df['Age'], ax = axs[0])
sns.distplot(df['Age_Z_Manual'], ax = axs[1])
sns.distplot(df['Age_Norm_Manual'], ax = axs[2])
sns.distplot(df['Age_Log'], ax = axs[3]);

In [None]:
# Let's replace "Outcome" field in the standardized dataset with the original "Outcome" field
df_z['Outcome'] = df['Outcome']

In [None]:
# Create a list with features for plotting boxplots
features = [col for col in df_z.columns if col != 'Outcome']

In [None]:
# Create boxplots to understand differences in distributions among people with and without diabetes
fig, axs = plt.subplots(ncols = len(features), figsize = (40, 15))

for idx, field in enumerate(features):
    sns.boxplot(x = 'Outcome', 
                y = field, 
                data = df_z,
                ax = axs[idx])

# <a id='link4'>Pandas Profiling</a>
#### Automated data profiling

In [None]:
# let us try pandas-profiling now and see how does it simplifies the EDA
!pip install pandas-profiling==2.8.0

In [None]:
# Loading dataframe again, so that original features are considered
data = pd.read_csv("/content/pima-indians-diabetes.csv")

In [None]:
from pandas_profiling import ProfileReport
prof = ProfileReport(data)
prof
# to view report created by pandas profile

In [None]:
prof.to_file(output_file='output.html')
# to save report obtained via pandas profiling

# Appendix



- **warnings.filterwarnings("ignore")** : Never print matching warnings
- **Pandas** : Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
- **Numpy** : The fundamental package for scientific computing with Python.
- **Matplotlib** : Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
- **Seaborn** : Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- **scipy.stats** : This module contains a large number of probability distributions as well as a growing library of statistical functions.
- **sklearn.preprocessing** : This package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.