# EDA of Pima diabetes data
- Exploratory Data Analysis
- > https://medium.com/@soumen.atta/analyzing-pima-indians-diabetes-data-using-python-89a021b5f4eb

## 1. Importing Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## 2. Data Collection

In [None]:
# Get the data from github
url = "https://github.com/Redwoods/Py/raw/master/pdm2020/my-note/py-pandas/data/diabetes.csv"
df = pd.read_csv(url)

In [None]:
df.shape

In [None]:
df.head()

### Ckeck & cleaning data
- Check the NaN or missing values
- Clean the null data

In [None]:
#CHECK FOR NULL VALUES
df.isnull().values.any(), df.isna().sum()

In [None]:
# Drop unused columns, and drop rows with any missing values.
print(df.shape)
vars = df.columns
print(vars)
df = df[vars].dropna()   # df.dropna(axis=0)
df.shape

In [None]:
# 중복 샘플을 제거
df.drop_duplicates(subset=df.columns, inplace=True) # 열 전체에서 동일한 중복인 내용이 있다면 중복 제거

# 중복 샘플을 제거 후, 전체 샘플 수를 확인.
print('총 샘플의 수 :',len(df))

## 3. Explore Data

In [None]:
df.describe()

In [None]:
df.describe().T

In [None]:
df.info()

### Check the balance of classes in the data through plot

In [None]:
# Check the balance of the data through plot
classes=df.Outcome
ax=sns.countplot(classes, label='count')
noDB,DB=classes.value_counts()
print('False: non-diabetes',noDB)
print('True: diabetes',DB)

In [None]:
classes.value_counts(), type(classes)

***

### **Univariate plots:** 

* Histograms.
* Density Plots.
* Box and Whisker Plots.  

**Histograms** 

The distribution of each attribute can easily be visualized by ploting histograms. 

In [None]:
plt.rcParams['figure.figsize'] = [12, 10]; # set the figure size 

# Draw histograms for all attributes 
df.hist()
plt.show()

**Density Plots**

Another way to visualize the distribution of each attribute is density plots. 

In [None]:
# Density plots for all attributes
df.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

**Box and Whisker Plots** 

Box and Whisker Plots (or simply, boxplots) is used to visualize the distribution of each attribute.

In [None]:
# Draw box and whisker plots for all attributes 
df.plot(kind= 'box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

***

## **Multivariate Plots:**

* Correlation Matrix Plot
* Scatter Plot Matrix 

### correlation plot

In [None]:
#correlation plot
# cormat=df.corr()
plt.figure(figsize=(12,10))
g=sns.heatmap(df.corr(),annot=True,cmap='coolwarm', #cmap= "RdYlGn",
             vmin=-1, vmax=1)

### 상관성 분석 결과
* Age vs. Pregnancies : 0.54
* Glucose vs. Outcome : 0.47
* SkinThickness vs. Insulin : 0.44
* SkinThickness vs. BMI : 0.39

> ### **[DIY-1] 상관성이 높은 변수들에 대한 좀 더 자세한 시각화가 필요하다.**

**Scatter Plot Matrix**

A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute.

In [None]:
# Import required package 
from pandas.plotting import scatter_matrix
plt.rcParams['figure.figsize'] = [12, 12]

# Plotting Scatterplot Matrix
scatter_matrix(df)
plt.show()

In [None]:
# Scatter plot
# import seaborn as sns
sns.pairplot(df) #, hue="Outcome") #, markers=["o", "s"]) #,palette="husl")

In [None]:
# Scatter plot
# import seaborn as sns
df_temp = df.copy()
df_temp['Outcome'] = df_temp['Outcome'].replace([0, 1],['noDM', 'DM'])
sns.pairplot(df_temp, hue='Outcome') #, hue="Outcome") #, markers=["o", "s"]) #,palette="husl")

In [None]:
df_temp['Outcome'].value_counts()

### [DIY] 상관성이 높은 변수들에 대한 좀 더 자세한 시각화

#### 상관성이 높은 6개의 특성에 대한 산포도

In [None]:
high_corr = ['Pregnancies', 'Glucose', 'SkinThickness', 'Insulin', 'BMI','Age', 'Outcome']

In [None]:
# Scatter plot
# import seaborn as sns
df_temp = df.copy()
df_temp['Outcome'] = df_temp['Outcome'].replace([0, 1],['noDM', 'DM'])
sns.pairplot(df_temp[high_corr], hue='Outcome') 

In [None]:
highest_corr = ['Pregnancies', 'Age', 'Outcome']

In [None]:
# Scatter plot
# import seaborn as sns
df_temp = df.copy()
df_temp['Outcome'] = df_temp['Outcome'].replace([0, 1],['noDM', 'DM'])
sns.pairplot(df_temp[highest_corr], hue='Outcome') 

***

# Checking data more!

In [None]:
df.head(10)

### 위의 데이터에서 문제점을 찾으시오.
- 0이 허용되지 않는 특징이 있는까?
- 값 0을 어떤 값으로 변경해야하는가?

In [None]:
df.info()

In [None]:
# Pregnancies, Outcome은 0이 가능한 값이므로 제외하고 0이 있는 항목(column) 조사
columns_with_zero = df.columns[(df==0).sum() > 0][1:-1]
columns_with_zero

In [None]:
# Clean the data : zero2median()
# 1. Check zeros in features with Pregnancies, Outcome excluded.
# 2. Replace zero with NaN 
# 3. Replace NaN with the median of the corresponding featurs
def zero2median(df):
    columns_with_zero = df.columns[(df==0).sum() > 0][1:-1]
    # Index(['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI'], dtype='object')
    df[columns_with_zero]=df[columns_with_zero].replace(0,np.nan)
    for feature in columns_with_zero:
        df[feature].fillna(df[feature].mean(),inplace=True)  # median
    
    return df

# Make clean dataframe, data2 from data
df2 = zero2median(df)

In [None]:
df2.head(10)

In [None]:
# Scatter plot
# import seaborn as sns
df2_temp = df2.copy()
df2_temp['Outcome'] = df2_temp['Outcome'].replace([0, 1],['noDM', 'DM'])
sns.pairplot(df2_temp[high_corr], hue='Outcome') 

---

# df2를 streamlit wepapp에 이용.