# Advanced Data Science Capstone

## Correlation of air pollution and Prevalence of Asthma bronchiale in Germany  

## Model definition: Cluster approaches

### The deliverables
The deliverables of the current stage:

 - The models: at least one deep learning and at least one non-deep learning algorithm
 - Compare and document models performance
 - At least one additional iteration in the process model involving at least the feature creation task and record impact on model performance (e.g. data normalizing, PCA, …)
 
###  Architectural Decisions Document (ADD)

 - The choice of specific technologies / frameworks 
 - All decisions should be documented in the ADD
 
###  Result of the stage

 - Save the notebook according to the process model’s naming convention
 - Proceed to the model training task 
 
 First of all the necessary libraries and the feature matrices created at the Feature Engineering stage are loaded:

### Model definition
As far as linear regression approach was unsuccessful, classification approaches will be used.
As a result class a feature, showing whether the county is of the high risk of the disease or not.
This **feature** can be constructed as e.g. **presence of the county in 75th quantile** of the disease:

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import decomposition
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm


FeatureSetDenseMean = pd.read_csv('Capstone.FeatureEng/Capstone.feature_eng.DenseMean.1.0.csv', index_col=None)
FeatureSetLongMean = pd.read_csv('Capstone.FeatureEng/Capstone.feature_eng.LongMean.1.0.csv', index_col=None)

FeatureSetDensePerc50 = pd.read_csv('Capstone.FeatureEng/Capstone.feature_eng.DensePerc50.1.0.csv', index_col=None)
FeatureSetDensePerc75 = pd.read_csv('Capstone.FeatureEng/Capstone.feature_eng.DensePerc75.1.0.csv', index_col=None)

FeatureSetLongPerc50 = pd.read_csv('Capstone.FeatureEng/Capstone.feature_eng.LongPerc50.1.0.csv', index_col=None)
FeatureSetLongPerc75 = pd.read_csv('Capstone.FeatureEng/Capstone.feature_eng.LongPerc75.1.0.csv', index_col=None)

### Additional feature creation

Additional feature for the disease prevalence is constructed as **presence of the county in Nth percentile** of the disease prevalence:

In [31]:
def DiseaseFeaturePercentile(FeatureSetDF, Percentile):
    DiseasePercentile = (FeatureSetDF['DiseaseR']).quantile(Percentile/100.0)
    dfFeatureOut = pd.DataFrame(FeatureSetDF)
#    dfFeatureOut = pd.DataFrame(FeatureSetDF['Regions-ID'])
#    dfFeatureOut=dfFeatureOut.join((FeatureSetDF['DiseaseR'])>DiseasePercentile)
    
    dfFeatureOut['DiseaseR'] = ((dfFeatureOut['DiseaseR'])>DiseasePercentile)
    dfFeatureOut=dfFeatureOut.rename(columns = {'DiseaseR':'DiseaseRFeat'})

# rename col: DiseaseR -> DiseaseRFeat

#    dfFeatureOut.columns=['CountyID','DiseaseRFeat']
    return(dfFeatureOut)

dfPolMeanLongDisease50perc = DiseaseFeaturePercentile(FeatureSetLongMean, 50.0)
dfPolMeanLongDisease75perc = DiseaseFeaturePercentile(FeatureSetLongMean, 75.0)
dfPolMeanLongDisease95perc = DiseaseFeaturePercentile(FeatureSetLongMean, 95.0)

dfPolLongPerc75Disease50perc = DiseaseFeaturePercentile(FeatureSetLongPerc75, 50.0)
dfPolLongPerc75Disease75perc = DiseaseFeaturePercentile(FeatureSetLongPerc75, 75.0)
dfPolLongPerc75Disease95perc = DiseaseFeaturePercentile(FeatureSetLongPerc75, 95.0)

dfPolMeanLongDisease50perc.to_csv('Capstone.FeatureEng/Capstone.feature_eng.PolMeanLongDisease50perc.1.0.csv', index=False)
dfPolMeanLongDisease75perc.to_csv('Capstone.FeatureEng/Capstone.feature_eng.PolMeanLongDisease75perc.1.0.csv', index=False)
dfPolMeanLongDisease95perc.to_csv('Capstone.FeatureEng/Capstone.feature_eng.PolMeanLongDisease95perc.1.0.csv', index=False)

dfPolLongPerc75Disease50perc.to_csv('Capstone.FeatureEng/Capstone.feature_eng.PolLongPerc75Disease50perc.1.0.csv', index=False)
dfPolLongPerc75Disease75perc.to_csv('Capstone.FeatureEng/Capstone.feature_eng.PolLongPerc75Disease75perc.1.0.csv', index=False)
dfPolLongPerc75Disease95perc.to_csv('Capstone.FeatureEng/Capstone.feature_eng.PolLongPerc75Disease95perc.1.0.csv', index=False)

dfPolLongPerc75Disease95perc.head()

Unnamed: 0,CountyID,DiseaseRFeat,NO,NO2,PM1
0,16072,False,7.0,21.0,275.0
1,16073,False,747.0,538.0,1456.0
2,8128,False,701.0,511.0,1559.0
3,6413,False,4617.0,4931.0,3215.0
4,1001,False,5491.0,3638.0,2488.0


### Model Definition
A number of models, based on different feature sets are defined. Since the pollutant data sets are incomplete (do not include information about *all* German counties), the test/train sets will be cropped accordingly for each pollutant feature set:

In [26]:
dfAsthma75perc = DiseaseFeaturePercentile(dfAsthma, 75.0)
dfAsthma75perc.tail()

Unnamed: 0,CountyID,DiseaseRFeat
397,8417,False
398,9374,False
399,9272,False
400,8327,False
401,8127,False
