# Advanced Data Science Capstone

## Correlation of air pollution and Prevalence of Asthma bronchiale in Germany  

## Model evaluation: Gradient-Boosted Trees (GBTs) in SparkML, 6 feature sets

## Loading Feature sets from COS
Importing necessary libraries:

In [1]:
from pyspark.sql import SparkSession
import ibmos2spark

from pyspark.ml.feature import StringIndexer
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


from pyspark.sql.functions import isnan, when, count, col

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190823090127-0001
KERNEL_ID = 9aeb50d9-a20b-49c5-aa4e-25f629fef7a1


Starting Spark session and loading Feature Sets from COS:

In [2]:
# The code was removed by Watson Studio for sharing.

# The Model: Gradient-Boosted Trees

Preparing indexing of the Disease Feature label (1 if the county is in the Nth percentile of Disease Prevalence, 0 otherwise),
vectorisation and normalization of the feature set, classification method (*GBT*) and assembling it into a pipeline:

In [3]:
indexer = StringIndexer(inputCol="DiseaseRFeat", outputCol="label")
vectorAssembler = VectorAssembler(inputCols=["NO", "NO2", "PM1"], outputCol="featuresOO")
normalizer = Normalizer(inputCol="featuresOO", outputCol="features", p=1.0)
classifier = GBTClassifier(maxIter=10)
classifier.setLabelCol('label')
pipeline = Pipeline(stages=[indexer, vectorAssembler, normalizer, classifier])

### Evaluating the model on  the available Feature Sets

#### 75th percentile for pollutant concentration, 95th percentile for disease prevalence, three pollutants (dfPolLongPerc75Disease95perc)
Training the model and estimating in-sample and out-of-sample errors via accuracy metrics for six available feature sets: 

In [4]:
df = dfPolMeanLongDisease50perc

splits = df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test  = splits[1]
model = pipeline.fit(df_train)
prediction = model.transform(df_train)
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("DiseaseRFeat")
InSampleAcc = binEval.evaluate(prediction)
predictionTest = model.transform(df_test)
OutOfSampleAcc = binEval.evaluate(predictionTest)
print("for dfPolMeanLongDisease50perc feature set in-sample accuracy is ", InSampleAcc, ", out-of-sample accuracy is ", OutOfSampleAcc)

for dfPolMeanLongDisease50perc feature set in-sample accuracy is  0.0 , out-of-sample accuracy is  0.5555555555555556


In [5]:
df = dfPolMeanLongDisease75perc

splits = df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test  = splits[1]
model = pipeline.fit(df_train)
prediction = model.transform(df_train)
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("DiseaseRFeat")
InSampleAcc = binEval.evaluate(prediction)
predictionTest = model.transform(df_test)
OutOfSampleAcc = binEval.evaluate(predictionTest)
print("for dfPolMeanLongDisease75perc feature set in-sample accuracy is ", InSampleAcc, ", out-of-sample accuracy is ", OutOfSampleAcc)

for dfPolMeanLongDisease75perc feature set in-sample accuracy is  1.0 , out-of-sample accuracy is  0.6923076923076923


In [6]:
df = dfPolMeanLongDisease95perc

splits = df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test  = splits[1]
model = pipeline.fit(df_train)
prediction = model.transform(df_train)
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("DiseaseRFeat")
InSampleAcc = binEval.evaluate(prediction)
predictionTest = model.transform(df_test)
OutOfSampleAcc = binEval.evaluate(predictionTest)
print("for dfPolMeanLongDisease95perc feature set in-sample accuracy is ", InSampleAcc, ", out-of-sample accuracy is ", OutOfSampleAcc)

for dfPolMeanLongDisease95perc feature set in-sample accuracy is  1.0 , out-of-sample accuracy is  0.8181818181818182


In [7]:
df = dfPolLongPerc75Disease50perc

splits = df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test  = splits[1]
model = pipeline.fit(df_train)
prediction = model.transform(df_train)
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("DiseaseRFeat")
InSampleAcc = binEval.evaluate(prediction)
predictionTest = model.transform(df_test)
OutOfSampleAcc = binEval.evaluate(predictionTest)
print("for dfPolLongPerc75Disease50perc feature set in-sample accuracy is ", InSampleAcc, ", out-of-sample accuracy is ", OutOfSampleAcc)

for dfPolLongPerc75Disease50perc feature set in-sample accuracy is  0.0 , out-of-sample accuracy is  0.7142857142857143


In [8]:
df = dfPolLongPerc75Disease75perc

splits = df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test  = splits[1]
model = pipeline.fit(df_train)
prediction = model.transform(df_train)
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("DiseaseRFeat")
InSampleAcc = binEval.evaluate(prediction)
predictionTest = model.transform(df_test)
OutOfSampleAcc = binEval.evaluate(predictionTest)
print("for dfPolLongPerc75Disease75perc feature set in-sample accuracy is ", InSampleAcc, ", out-of-sample accuracy is ", OutOfSampleAcc)

for dfPolLongPerc75Disease75perc feature set in-sample accuracy is  1.0 , out-of-sample accuracy is  0.6666666666666666


In [9]:
df = dfPolLongPerc75Disease95perc

splits = df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test  = splits[1]
model = pipeline.fit(df_train)
prediction = model.transform(df_train)
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("DiseaseRFeat")
InSampleAcc = binEval.evaluate(prediction)
predictionTest = model.transform(df_test)
OutOfSampleAcc = binEval.evaluate(predictionTest)
print("for dfPolLongPerc75Disease95perc feature set in-sample accuracy is ", InSampleAcc, ", out-of-sample accuracy is ", OutOfSampleAcc)

for dfPolLongPerc75Disease95perc feature set in-sample accuracy is  1.0 , out-of-sample accuracy is  0.75


## GBT Summary

According to the model evaluation performed above, the best feature set (at least for the *GBT* classifier) is **dfPolMeanLongDisease95perc**, namely mean pollutant concentration over the year and 95th percentile of Asthma Bronchiale prevalence in a county. The second best feature set is **dfPolLongPerc75Disease95perc**, namely the value of 75th percentile of the pollutant concentration over the year and 95th percentile of Asthma Bronchiale prevalence in a county. 