<a href="https://colab.research.google.com/github/Redwoods/Py/blob/master/pdm2020/my-note/py-pandas/pandas_5_diabetes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Analyzing Pima-Indians-Diabetes-Data** 

> https://medium.com/@soumen.atta/analyzing-pima-indians-diabetes-data-using-python-89a021b5f4eb

In [None]:
# Load the required packages 
import numpy as np 
import pandas as pd 
from pandas import read_csv
import matplotlib.pyplot as plt

**Load CSV file using Pandas**

In [None]:
# Specify the file name 
url = "https://github.com/Redwoods/Py/raw/master/pdm2020/my-note/py-pandas/data/diabetes.csv"
filename = url
# filename = 'diabetes.csv'  # access to local file

# Read the data 
data = read_csv(filename) 

# Print the shape 
data.shape

In [None]:
# Print the first 5 rows 
data.head()

> https://medium.com/@soumen.atta/analyzing-pima-indians-diabetes-data-using-python-89a021b5f4eb

In [None]:
# Show the type of 'data'
type(data) 

In [None]:
# Get the column names 
col_idx = data.columns
col_idx

In [None]:
# Get row indices 
row_idx = data.index
print(row_idx)

In [None]:
# Find data type for each attribute 
print("Data type of each attribute:")
data.dtypes

In [None]:
data.info()

## Check data
- null
- NaN

In [None]:
# Check NaN
data.isna().sum()
# data.isnull().sum()

**Descriptive Statistics using Pandas**  

The *describe()* function on the Pandas DataFrame lists 8 statistical properties of each attribute. They are:

Count,
Mean,
Standard Deviation,
Minimum Value,
25th Percentile,
50th Percentile (Median),
75th Percentile,
Maximum Value. 

In [None]:
# Generate statistical summary 
description = data.describe()
print("Statistical summary of the data:\n")
description

# EDA (Exploratory Data Analysis)
- 탐색적 데이터 분석
- https://medium.com/mighty-data-science-bootcamp/eda-%EB%8D%B0%EC%9D%B4%ED%84%B0-%EC%84%A4%EB%AA%85%EC%84%9C%EC%97%90%EC%84%9C-%EC%8B%9C%EC%9E%91%ED%95%98%EA%B8%B0-230060b9fc17

**Distribution of the *Outcome* attribute**

The data considered here is an example of a classification data. We can get an idea of the distribution of the *Outcome* attribute in Pandas. 

In [None]:
class_counts = data.groupby('Outcome').size() 
print("Class breakdown of the data:\n")
print(class_counts)

In [None]:
v,c=np.unique(data['Outcome'], return_counts=True)
v,c

Therefore, there are a total of 768 entries in the dataset. The outcome variable is set to 1 for 268 entries, and the rest are set to 0. 

**Correlation between all pairs of attributes:** 

We can use the *corr()* function on the Pandas DataFrame to calculate a correlation matrix. For calculating correlation, Pearson’s Correlation Coefficient is used here. *Pearson’s Correlation Coefficient* assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. On the other hand, a value of 0 shows no correlation at all. 

In [None]:
# Compute correlation matrix 
correlations = data.corr(method = 'pearson') 
print("Correlations of attributes in the data:\n") 
correlations

In [None]:
np.max(correlations),np.min(correlations)

In [None]:
correlations.max(),correlations.min()

In [None]:
data.columns

In [None]:
fig, ax = plt.subplots(1,1,figsize=(10,8))
img = ax.imshow(correlations,cmap='hot',interpolation='nearest')
ax.set_xticklabels(data.columns)
# plt.xticks(rotation=90)
ax.set_yticklabels(data.columns)
fig.colorbar(img)
plt.show()

**Skew of attribute distributions** 

The skew of each attribute can be calculated using the *skew()* function on the Pandas DataFrame. 

In [None]:
skew = data.skew() 
print("Skew of attribute distributions in the data:\n") 
skew

A positive value represents a right-skewed distribution, and a negative value denotes a left-skewed distribution. Values closer to zero correspond to less skewed distribution. 

# Visualizing data using Pandas

Now, we visualize data using Python's Pandas library. We discuss both univariate plots and multivariate plots using Pandas. 

**Univariate plots:** 

* Histograms.
* Density Plots.
* Box and Whisker Plots.  

**Histograms** 

The distribution of each attribute can easily be visualized by ploting histograms. 

In [None]:
plt.rcParams['figure.figsize'] = [12, 10]; # set the figure size 

# Draw histograms for all attributes 
data.hist()
plt.show()

**Density Plots**

Another way to visualize the distribution of each attribute is density plots. 

In [None]:
# Density plots for all attributes
data.plot(kind='density', subplots=True, layout=(3,3), sharex=False)
plt.show()

**Box and Whisker Plots** 

Box and Whisker Plots (or simply, boxplots) is used to visualize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data, and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of the spread of the middle 50% of the data).

In [None]:
# Draw box and whisker plots for all attributes 
data.plot(kind= 'box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

**Multivariate Plots:**

* Correlation Matrix Plot
* Scatter Plot Matrix 

**Correlation Matrix Plot**

We can use the *corr()* function on the Pandas DataFrame to calculate a correlation matrix. For calculating correlation, *Pearson’s Correlation Coefficient* is used here. Pearson’s Correlation Coefficient assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. On the other hand, a value of 0 shows no correlation at all.

In [None]:
# Compute the correlation matrix 
correlations = data.corr(method = 'pearson') # Correlations between all pairs of attributes
# Print the datatype 
type(correlations)
# Show the correlation matrix 
correlations

In [None]:
# import required package 
# import numpy as np 

# plot correlation matrix
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1) #, cmap='coolwarm' )
fig.colorbar(cax)
ticks = np.arange(0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
names = data.columns
ax.set_xticklabels(names)#,rotation=90) # Rotate x-tick labels by 90 degrees 
ax.set_yticklabels(names)
plt.show()

**Scatter Plot Matrix**

A scatter plot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. Drawing all these scatter plots together is called a scatter plot matrix. Scatter plots are useful for spotting structured relationships between variables. Attributes with structured relationships may also be correlated, and good candidates for removal from your dataset.

In [None]:
# Import required package 
from pandas.plotting import scatter_matrix
plt.rcParams['figure.figsize'] = [12, 12]

# Plotting Scatterplot Matrix
scatter_matrix(data)
plt.show()

## seaborn
> Good visualization module

In [None]:
import seaborn as sns

In [None]:
# Correlation plot
# plot the heatmap
# plt.plot(figsize=(10,10))
sns.heatmap(correlations, 
        xticklabels=data.columns,
        yticklabels=data.columns,
        vmin= -1, vmax=1.0)
plt.show()

In [None]:
# Scatter plot
# import seaborn as sns
sns.pairplot(data)