%%html
<style>
li{
  margin: 10px 0;
}
</style>
<h2>EDA Analysis Pipeline</h2>
Authors: Stephan, Yijoon, Hans, Frank

<h3>Index</h3>
<ol type="I">
<li><a href='#H1'>Defining the goal/problem</a></li>  
<li><a href='#H2'>Fetching and data sanitation</a></li>  
<li><a href='#section3'>Understand and visualize the data</a></li>
<li><a href='#section4'>Analyze the data</a></li>
<li><a href='#section5'>Interpret results</a></li>
<li><a href='#section6'>Iterate and refine</a></li>
<li><a href='#section7'>Save the data of your analysis</a></li>
</ol>

---
<a id='H1'></a>
## 1. Defining the goal/problem
```What is the purpose of this analysis?  ```

Building a model that accurately can predict the lifespan based on features in the data and use that to determine the premium for life insurance in an ethical way.

In [None]:
# all libraries required for the entire EDA

import requests
import json
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns # Samuel Norman "Sam" Seaborn is a fictional character portrayed by Rob Lowe on the television serial drama The West Wing. Hence: sns
from seaborn_qqplot import pplot
import pandas as pd

import datetime as dt
from scipy import stats
from sklearn import linear_model 
from sklearn.model_selection import train_test_split
from sklearn.cluster import DBSCAN

%matplotlib inline
sns.set(color_codes=True)

---
<a id='H2'></a>
## 2. Fetching and data sanitation
```
- a. collect the data 
- b. check for: errors, missing values(Nan), data types(object,float,int), duplicates and other inconsistencies
- c. clean the data: remove duplicates and remove irrelevant information/columns
```

In [None]:
# for collecting data in csv form:

df= pd.read_csv('../data_csv',skipinitialspace=True)
diff
df = pd.read_csv('data/new2.csv', index_col = 0)
df.head()

In [None]:
# for collecting data from REST API

# Make request to an URL
response = requests.get('http://localhost:8080/medish_centrum_randstad/api/netlify?page=1')

file_contents= response.json()  #dictionary
print(type(file_contents))
print(len(file_contents))

df = pd.DataFrame.from_dict(file_contents['data']) #all the needed info was condensed into one data column called 'data'
display(df.head())
display(df.shape)

In [None]:
# ... select value to be looked at.

key = df.keys()

pipe = input('please type the name of the column you want to look at')
if pipe in key:
    print (f'We will be looking at "{pipe}" this time.')
else:
    print ('Please rerun this field and suply a valid column key.')

In [None]:
# # ... get a loc  not used atm
# dx=df.columns.get_loc(pipe)
# print (dx)

In [None]:
df = df.copy()

Duplication Check

In [None]:
duplicateRows = df[df.duplicated()]
print(duplicateRows)
df.drop.duplicates() 

NaN-check

In [None]:
#why is it missing, is it random? : input (impute) or delete? (some decisions come later outlier analysis, but some can be taken now)
df.isnull().sum() 
#df.drop(indexes_list, inplace=True)
#df = df.dropna()

Check for special characters ??/>, delete and convert to int/float etc.

---
<a id='section3'></a>
## 3. Understand and visualize the data
```
- a. Examine structure and content: size,shape and type of variables
- b. Theres great value of simply looking at the data: interquartile range, mean, median etc.
- c. Visualize the data with plots: histograms, box plots, scatter plots and heatmaps
- d. Identify any outliers, patterns, relationships or trends
- e. Decide to impute or delete the outliers
- f. Identify new features

```

Structure

In [None]:
print(df.head())

In [None]:
print(df.type)
print(df.describe) #mean,sd, min,max

Plots

In [None]:
#Quick Overall Graphical Overview (!warning, takes ~2min or more)
g = sns.PairGrid(df)
g.map(sns.lineplot)

In [None]:
#jointplot with distribution and regression line
sns.jointplot(data=df, x="alcohol", y="lifespan",marginal_kws=dict(bins=35), kind='reg')

In [None]:
#Show lineplot and/or jointplot, check for correlation (linear, positive/negative)
sns.lineplot(df,x='alcohol',y='lifespan')

In [None]:
# Maak een raster voor 4 images
import cv2

pImg = cv2.imread("pics/aiHealth_01.jpg")
# pImg = pplot(df, x="lifespan", y="alcohol", kind='qq', height=5)

fig, axs = plt.subplots(2, 2, sharey=False, figsize=(18,15))

axs[0][0].imshow(pImg)
pplot(df, x="lifespan", y="alcohol", kind='qq', height=5)

sns.boxplot(x=df['sugar'], ax=axs[1,1])

axs[0][1].scatter(df['sugar'], df['lifespan'])
axs[0][1].set_xlabel('sugar')
axs[0][1].set_ylabel('lifespan')

sns.catplot(data=df, x='sugar', y='lifespan', kind='box', ax=axs[1,1])
sns.lineplot(x=df['sugar'],y=df['lifespan'], ax=axs[1,0])

plt.show()


__[Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)__

$ \rho_{x,y} = \frac{cov(x,y)}{\sigma_{x}\sigma{y}}$


In [None]:
#colorfull matrix, showing correlations
df_corr = df[['lifespan', 'genetic', 
      #'length', 'mass', 
      'exercise', 'smoking',
       'alcohol', 'sugar', 'BMI']].dropna().corr()
df_corr.style.background_gradient(cmap='RdBu')

In [None]:
#graphical correlation matrix view
sns.heatmap(df_corr,annot=True)

Outliers: impute or delete?

In [None]:
#Show stripplot, boxplot 
fig, axes = plt.subplots(1,2,figsize=(15,5))
sns.stripplot(df, x='alcohol', y='lifespan',size=1, color=".8",ax=axes[0])
sns.boxplot(df[['alcohol']],ax=axes[1])

In [None]:
sns.boxplot(y=df['exercise'])

In [None]:
#Q1=df['exercise'].quantile(0.25)
#print("Q1:", Q1)
#Q3=df['exercise'].quantile(0.75)
#print("Q3:", Q3)
#IQR=Q3-Q1
#print("IQR: ", IQR)
#lower_bound = Q1 - 1.5*IQR
#print("Lower Bound:", lower_bound)
#upper_bound = Q3 + 1.5*IQR
#print("Upper Bound:", upper_bound)

In [None]:
#df_clean = df[(df['exercise']>lower_bound)&(df['exercise']<upper_bound)]
#sns.boxplot(y = df_clean['exercise'])

* Clean up outliers.

From boxplot we can check the outliers. I tried 2 ways to remove the outliers. One way is using IQR. The ohter way is using DBSCAN cluster. According to correlation gradient there were no differences in correlation.  Therefore outliers does not make any difference. I decided to keep the outliers.

DBSCAN, which stands for density-based spatial clustering of applications with noise, is an unsupervised clustering algorithm. This approach identifies any points that are loosely packed or sit alone outside of densely packed clusters as outliers.

In [None]:
X_train = df[['exercise','lifespan']]
model = DBSCAN()
model.fit(X_train)

In [None]:
cluster_labels = model.labels_
plt.scatter(df["exercise"], df["lifespan"], c = cluster_labels)
plt.show()

In [None]:
df['labels'] = cluster_labels
df_cluster_clean = df[df['labels'] != -1]

* Clean up outliers.

In [None]:
df_corr = df[['genetic',
      'length', 'mass', 
      'exercise', 'smoking',
       'alcohol', 'sugar', 'BMI','lifespan']].dropna().corr()
df_corr.style.background_gradient(cmap='RdBu')

lifespan is highly correlated to exercise. IQR method and DBSCAN cluster method showed same result.
Therefore we will remove outliers for a variabel,'exercise'.

In [None]:
sns.boxplot(data=df, y= 'lifespan', x='exercise')

In [None]:
# ... test plot not used atm

# got any outliers?
# sns.boxplot(x=df['lifespan'])

# sns.boxplot(df[pipe])


#... alles buiten de lijnen is een outlier. 

New features:


In [None]:
# New Feature BMI (kg/m^2)
# df['bmi'] = df['mass']/(df['length']/100)**2
# df.head()

# bmi_cats = [0, 18.5, 25, np.inf]
# labels_bmi_cats=['underweight','normal_range','overweight']
# df['bmi_cat']= pd.cut(df['bmi'], bins=bmi_cats, labels=labels_bmi_cats)

# bmi_subcats = [0, 16, 17, 18.5, 25, 30, 35, 40, np.inf]
# labels_bmi_subcats=['severe_thinness','moderate_thinness','mild_thinness','normal', 'pre_obese', 'obese_class_I', 'obese_class_II', 'obese_class_III']
# df['bmi_subcat']= pd.cut(df['bmi'], bins=bmi_subcats, labels=labels_bmi_subcats)


# df.head(12)

---
<a id='section4'></a>
## 4. Analyze the data
```
- a. Apply statistical analysis tools: mean, median, mode, standard deviation
- b. Go through Checklist for linear regression: normal distribution, continuous variable, correlation and p-value
- c  Apply regression
```

<li>Calculate P-values</li>

If p<0.05 (almost 0) the correlation for the feature is extremely likely to happen again if we collect more sample data and thus representative for the entire set


In [None]:
r,p = stats.pearsonr(df.lifespan,df.smoking)
print('smoking corr:',round(r,4))
print('smoking p-val:',round(p,4))

In [None]:
#Show Q-Q plot and draw conclusion on linearity and if linear regression is applicable 
from seaborn_qqplot import pplot
myplot = pplot(df, x="lifespan", y="alcohol", kind='qq', height=5)

If the feature satisfies: 
 - continuous variable
 - correlation and it is linear-ish
 - normal-ish distributed 
 We can try to apply a linear regression method <br>
( e.g. in the form of
 $ y=\alpha*x_{smoking}+\beta*x_{exercise}+c $ )


In [None]:
train, test = train_test_split(df, test_size=0.2, random_state=0)

X = train[['smoking', 'exercise']]
y = train.lifespan
regr = linear_model.LinearRegression()
regr.fit(X, y) 

In [None]:
# The coefficients
print('Coefficients: \n', regr.coef_)
print(f'c would be:',regr.predict([[0,0]]))

In regression, the $R^{2}$ coefficient of determination is a statistical measure of how well the regression predictions approximate the real data points.  <br>
Close to 1 means there is a strong correlation between the independent variables smoking, exercise and the dependent variable lifespan.

In [None]:
print(train.shape, test.shape)
score = regr.score(test[['smoking', 'exercise']],test.lifespan)
print(f'coefficient of determination(R\N{SUPERSCRIPT TWO}):', score)

---
<a id='section5'></a>
## 5. Interpret the results
```
- Draw conclusions, make insights and communicate in a clear, concise and unbiased manner
```

---
<a id='section6'></a>
## 6. Iterate and refine
```
- Explore alternative approaches e.g. test assumptions and update your conclusions based on feedback and new insights 
```

---
<a id='section7'></a>
## 7. Save the data of your analysis
```

In [None]:
# safe the data from this notebook as a csv in the folder output 

df.to_csv('../data/output_data/data_{}_{}.csv'.format(pipe,dt.datetime.now().strftime("%Y-%m-%d %H-%M")), index=False,sep=';')