# Exploratory Data Analysis 

This file is composed of components of the EDA process for preliminary exploration of your data. This code is not a complete EDA, but rather a preliminary examination of data. Please see the Table of Contents to explore different areas of EDA.
***

##### **Input:** .csv file with entire dataset. Will need to interpolate prior to using unsupervised learning if NaN exist in your dataset
##### **Output:** Figures for EDA
##### **Dependencies:** 
***

##### Format of input: 
.csv file with entire dataset 
***

**Check:** 
* Will need to interpolate data/remove NaN before doing any unsupervised learning for EDA

**Sources:**

***
***

## Table of Contents

#### Exploratory Data Analysis
* [Cleaning and Filtering Data](#read)
* [Correlation Plots](#corr)
* [Covariance Matrix](#cov)
* [Missing Data Analysis](#miss)
* [Outlier Analysis](#out)
* [Histograms of Features](#hist)

#### Unsupervised Learning
* [Clustering](#cluster)
    * [KNN Clustering](#knn)
    * [Hierarchical Clustering](#hic)
* [Principal Component Analysis (PCA)](#pca)


***

## Read data:
<a id="read"></a>

In [None]:
import pandas as pd
data = pd.read_csv('/Users/joey/Desktop/DBDP/STEP-data/deidentified_data.csv')  #Change filename

## Preliminary Exploratory Data Analysis:

https://github.com/dformoso/sklearn-classification/blob/master/Data%20Science%20Workbook%20-%20Census%20Income%20Dataset.ipynb

In [None]:
len(data)

In [None]:
data.describe()

### Correlation Plots
<a id="corr" ></a>


In [None]:
%matplotlib inline

In [None]:
import seaborn as sns
corr = data.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

In [None]:
#Correlation Plot
corr = data.corr()
sns.heatmap(corr, 
        xticklabels=corr.columns,
        yticklabels=corr.columns)

## Covariance Matrix
<a id="cov"></a>

Compute pairwise covariance of columns

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, co variance tells you how two variables vary together.
*Covariance must be scaled.*

- Python: https://www.geeksforgeeks.org/python-pandas-dataframe-cov/
- Math/Interpretation: https://www.statisticshowto.datasciencecentral.com/covariance/

In [None]:
#Need to standardize scale:
cv_df = data.drop(columns=[]) #drop all columns that are non-numeric
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
cv_np = sc.fit_transform(cv_df)
cv_df = pd.DataFrame(cv_np)
#cv_df.columns = [] #name columns if desired

#covariance 
cv_df.cov()

## Check for missing values
<a id="miss"></a>
#### Very cool package for missing values (includes heatmaps of missing, bar graphs, and matrices of missing values):
https://github.com/ResidentMario/missingno

In [None]:
import missingno as msno

#Check for missing data
msno.matrix(data)

## Plot Distribution of Each Feature
<a id="dist"></a>

### Outcome variable 

To look at how the outcome variable is balanced:

In [None]:
import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,1)) 
sns.countplot(y=outcomevariablehere, data=data); #put outcomevariable here

### Plot distributions by outcome class 
<a id="dist-class"></a>

In [None]:
import seaborn as sns

# Sort dataframe by outcome

#Plot outcome variables
sns.distplot(outcome_a[[X1]], hist=False, rug=True)
sns.distplot(outcome_b[[X1]], hist=False, rug=True)
sns.distplot(outcome_c[[X1]], hist=False, rug=True)

plt.title()
plt.xlabel()
plt.legend(labels=[])

Plot all variables at once:

### Outlier Analysis
<a id="out"></a>

In [None]:
sns.boxplot(y=variablehere, x=variablehere, data=data, palette="Set1")

## Plot histograms of all variables in data

In [None]:
#    def makehist(datainput, label, color):
    fig = plt.figure(figsize=(16,4))
    mean = datainput.mean(axis = 0) #changeoutcomevar
    plt.hist(data['Variable'], bins=(20), align='mid', color='green', alpha=0.5)
    plt.axvline(x=mean, color=color, linestyle='-')
    plt.xlabel('Variable')
    plt.ylabel('Frequency')
    plt.title((label + ' Histogram'))
    plt.tight_layout()
    plt.savefig((filesource + 'Variable' + '.png'), dpi=100) #change filesource or add as input to function if variable
    print(('Saved plot of ' + 'Variable'))

In [None]:
# makehist(data['Variable'], 'Variable', 'green')
#Repeat above command for each numeric Variable in data

## Plot boxplot of all variables in data

In [None]:
# def makebox(datainput, label):
    fig = plt.figure(figsize =(16, 4))
    plt.boxplot(data['Variable'])
    plt.title(('Variable' + ' Box Plot'))
    plt.savefig((filesource + 'Variable' + '.png'), dpi=100) #change filesource or add as input to function if variable
    print(('Saved plot of ' + 'Variable'))

In [None]:
# makebox(data['Variable'], 'Variable')
#Repeat above command for each numeric Variable in data

## Plot leafplot of all variables in data

In [None]:
# def makeleaf(datainput, label):
    fig = plt.figure(figsize =(16, 4))
    plt.stem(data['Variable'])
    plt.title(('Variable' + ' Leaf Plot'))
    plt.savefig((filesource + 'Variable' + '.png'), dpi=100) #change filesource or add as input to function if variable
    print(('Saved plot of ' + 'Variable'))

In [None]:
# makeleaf(data['Variable'], 'Variable')
#Repeat above command for each numeric Variable in data

## Plot bubble chart of all variables in data

In [None]:
# def makebubble(x, y, s, label):
    fig = plt.figure(figsize =(16, 4))
    plt.scatter(x=data['Variable 1'], y=data['Variable 2'], s=data['Size Variable'])
    plt.title(('Variable 1 vs Variable 2' + ' Bubble plot'))
    plt.xlabel(XLabel) #Put desired name for x Axis here
    plt.ylabel(YLabel) #Put desired name for y Axis here
    plt.savefig((filesource + 'Variable 1 vs Variable 2' + '.png'), dpi=100) #change filesource or add as input to function if variable
    print(('Saved plot of ' + 'Variable 1 vs Variable 2'))
    
# fig = px.scatter(df.query(""), x="statistics", y = "Medical Methods", size = "pop", color="corr.columns")  # need to change for color's names

In [None]:
# makebubble(data['Variable 1'], data['Variable 2'], data['Size Variable'], 'Variable 1 vs Variable 2')
#Repeat above command for each numeric Variable in data

## Plot run chart of all variables in data

In [None]:
# def makerun(xAxis, yAxis, label):
    fig = plt.plot(data['Variable'], data['Variable'])
    plt.title((data['Variable'] + ' Run Chart'))
    plt.xlabel(XLabel) #Put desired name for x Axis here
    plt.ylabel(YLabel) #Put desired name for y Axis here
    plt.savefig((filesource + data['Variable'] + '.png'), dpi=100) #change filesource or add as input to function if variable
    print(('Saved Run Chart of ' + data['Variable']))

In [None]:
# makerun(data['Variable'], data['Variable'], data['Variable'])
#Repeat above command for each numeric Variable in data

## Plot multivariate chart of all variables in data

In [None]:
# def makemultivariate(var1, var2, label):
    fig = plt.plot(data['Variable'], data['Variable'])
    plt.title(('Variable' + ' Multivariate Chart'))
    plt.xlabel(XLabel) #Put desired name for X Axis here
    plt.ylabel(YLabel) #Put desired name for Y Axis here
    plt.savefig((filesource + 'Variable' + '.png'), dpi=100) #change filesource or add as input to function if variable
    print(('Saved Run Chart of ' + 'Variable'))

In [None]:
# make_multivariate(data['Variable'], data['Variable'], 'Variable')
#Repeat above command for each numeric Variable in data

## Plot scatterplot of all variables in data

In [None]:
# def makescatter(x, y, label):
    fig = plt.figure(figsize=(16,4))
    plt.scatter(x=data['Variable'], y=data['Variable'])
    plt.title(('Variable' + ' Scatter plot'))
    plt.xlabel(XLabel) #Put desired name for X Axis here
    plt.ylabel(YLabel) #Put desired name for Y Axis here
    plt.savefig((filesource + 'Variable' + '.png'), dpi=100) #change filesource or add as input to function if variable
    print(('Saved plot of ' + 'Variable'))

In [None]:
#makeScatterplot(data['Variable'], data['Variable'], 'Variable')
#Repeat above command for each numeric Variable in data

## Examples of Visualizations
If you want to try these examples of visualizations, you need to uncomment all the previous functions like 'makebox', 'makeleaf', 'makebubble'

### Box plot

In [None]:
makebox(data['Skin Tone'], 'Skin Tone')

### Leaf plot

In [None]:
makeleaf(data['Apple Watch'], 'Apple Watch')

### Bubble Chart

In [None]:
makebubble(data["ECG"], data['Apple Watch'], data['Skin Tone'], "ECG")

### Run Chart

In [None]:
makerun(data['Apple Watch'], data['ECG'], "Apple Watch vs ECG Data")

### Multivariate Chart

In [None]:
makemultivariate(data['ECG'], data['Apple Watch'], 'Apple Watch vs ECG Data')

### Scatter Plot

In [None]:
makescatter(data['ECG'], data['Apple Watch'], 'ECG Data vs Apple Watch')

## Clustering
<a id="cluster"></a>

https://www.neuroelectrics.com/blog/clustering-methods-in-exploratory-analysis/

In [None]:
dfc = data.drop(columns=[]) # drop all non-numeric columns
dfc.head()

### K-means Clustering:
<a id="knn"></a>

In [None]:
from sklearn.cluster import KMeans
# create kmeans object
kmeans = KMeans(n_clusters=3)# fit kmeans object to data
kmeans.fit(dfc)# print location of clusters learned by kmeans object
#print(kmeans.cluster_centers_)# save new clusters for chart
y_km = kmeans.fit_predict(dfc)

In [None]:
labels = kmeans.labels_
dfc['clusters'] = labels

### Hierarchical Clustering
<a id="hic"></a>

*Agglomerative (data points are clustered using a bottim-up approach starting with individual data points)


https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/

In [None]:
import scipy.cluster.hierarchy as shc
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 7))
plt.title("Data Dendograms") # first show the dendograms so that we know how many clusters we need to scatter to
dend = shc.dendrogram(shc.linkage(dfc, method='ward'))

In [None]:
# begin clustering
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
cluster.fit_predict(dfc) # see the label of all the data

In [None]:
plt.figure(figsize=(10,7))
plt.scatter(dfc[:,0], dfc[:,1], c=cluster.labels_, cmap='rainbow') #plot for hieractical clustering