<a href="https://colab.research.google.com/github/Redwoods/Py/blob/master/pdm2020/my-note/pz-project/diabetes_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDA of Pima diabetes data
- Exploratory Data Analysis
- > https://medium.com/@soumen.atta/analyzing-pima-indians-diabetes-data-using-python-89a021b5f4eb

## 1. Importing Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## 2. Data Collection

In [None]:
# Get the data from github
url = "https://github.com/Redwoods/Py/raw/master/pdm2020/my-note/py-pandas/data/diabetes.csv"
df = pd.read_csv(url)

In [None]:
df.shape

In [None]:
df.head()

### Ckeck & cleaning data
- Check the NaN or missing values
- Clean the null data

In [None]:
#CHECK FOR NULL VALUES
df.isnull().values.any(), df.isna().sum()

In [None]:
# Drop unused columns, and drop rows with any missing values.
print(df.shape)
vars = df.columns
print(vars)
df = df[vars].dropna()
df.shape

## 3. Explore Data

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
df.info()

### Check the balance of classes in the data through plot

In [None]:
# Check the balance of the data through plot
classes=df.Outcome
ax=sns.countplot(classes, label='count')
nDB,DB=classes.value_counts()
print('False: non-diabetes',nDB)
print('True: diabetes',DB)

In [None]:
classes.value_counts(), type(classes)

***

### **Univariate plots:** 

* Histograms.
* Density Plots.
* Box and Whisker Plots.  

**Histograms** 

The distribution of each attribute can easily be visualized by ploting histograms. 

In [None]:
plt.rcParams['figure.figsize'] = [12, 10]; # set the figure size 

# Draw histograms for all attributes 
df.hist()
plt.show()

**Density Plots**

Another way to visualize the distribution of each attribute is density plots. 

In [None]:
# Density plots for all attributes
df.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

**Box and Whisker Plots** 

Box and Whisker Plots (or simply, boxplots) is used to visualize the distribution of each attribute.

In [None]:
# Draw box and whisker plots for all attributes 
df.plot(kind= 'box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

***

## **Multivariate Plots:**

* Correlation Matrix Plot
* Scatter Plot Matrix 

### correlation plot

In [None]:
#correlation plot
cormat=df.corr()
plt.figure(figsize=(12,10))
g=sns.heatmap(df.corr(),annot=True,cmap='coolwarm', #cmap= "RdYlGn",
             vmin=-1, vmax=1)

### 상관성 분석 결과
* Age vs. Pregnancies : 0.54
* Glucose vs. Outcome : 0.47
* SkinThickness vs. Insulin : 0.44
* SkinThickness vs. BMI : 0.39

> ### **[DIY-1] 상관성이 높은 변수들에 대한 좀 더 자세한 시각화가 필요하다.**

**Scatter Plot Matrix**

A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute.

In [None]:
# Import required package 
from pandas.plotting import scatter_matrix
plt.rcParams['figure.figsize'] = [12, 12]

# Plotting Scatterplot Matrix
scatter_matrix(df)
plt.show()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
# Scatter plot
# import seaborn as sns
sns.pairplot(df) #, hue="Outcome") #, markers=["o", "s"]) #,palette="husl")

In [None]:
# Scatter plot
# import seaborn as sns
df_temp = df.copy()
df_temp['Outcome'] = df_temp['Outcome'].replace([0, 1],['noDM', 'DM'])
sns.pairplot(df_temp, hue='Outcome') #, hue="Outcome") #, markers=["o", "s"]) #,palette="husl")

In [None]:
df_temp['Outcome'].value_counts()

In [None]:
df.columns

### [DIY] 상관성이 높은 변수들에 대한 좀 더 자세한 시각화

#### 상관성이 높은 6개의 특성에 대한 산포도

In [None]:
high_corr = ['Pregnancies', 'Glucose', 'SkinThickness', 'Insulin', 'BMI','Age', 'Outcome']

In [None]:
# Scatter plot
# import seaborn as sns
df_temp = df.copy()
df_temp['Outcome'] = df_temp['Outcome'].replace([0, 1],['noDM', 'DM'])
sns.pairplot(df_temp[high_corr], hue='Outcome') 

In [None]:
highest_corr = ['Pregnancies', 'Age', 'Outcome']

In [None]:
# Scatter plot
# import seaborn as sns
df_temp = df.copy()
df_temp['Outcome'] = df_temp['Outcome'].replace([0, 1],['noDM', 'DM'])
sns.pairplot(df_temp[highest_corr], hue='Outcome') 

***

## Advanced plots
### voiline plot by grouping data of columns
* Standarization of data (Normalization)

In [None]:
df_n=(df-df.mean())/df.std()
df_n

In [None]:
# voiline plot by grouping data of columns
# Standarization of data (Normalization)
df_n=(df-df.mean())/df.std()
# data_std=(data-data.mean())/data.std()
y=df.Outcome
df2=pd.concat([y, df_n.iloc[:,0:8]], axis=1)
y.shape,df2.shape

In [None]:
df3=pd.melt(df2,id_vars='Outcome', var_name='features',value_name='values')
df3.head(), df3.shape

### pandas melt() 
- https://rfriend.tistory.com/278

In [None]:
plt.figure(figsize=(10,10))
sns.violinplot(x='features', y='values', hue='Outcome', data=df3, split=True, inner='quart')
plt.xticks(rotation=45);

In [None]:
#customizing seaborn plot
sns.set(style='whitegrid', palette='muted')
# use df3 dataframe
# data=pd.concat([y, data_std.iloc[:,0:8]], axis=1)
# data=pd.melt(data,id_vars='Outcome', var_name='features',value_name='values')
plt.figure(figsize=(10,10))
sns.swarmplot(x='features', y='values', hue='Outcome', data=df3)
plt.xticks(rotation=45);