# Advanced Data Science Capstone

## Correlation of air pollution and Prevalence of Asthma bronchiale in Germany  

## Model definition: Gradient-Boosted Trees (GBTs) in SparkML, polynomial feature set

## Loading Feature sets from COS
Importing necessary libraries:

In [1]:
from pyspark.sql import SparkSession
import ibmos2spark

from pyspark.ml.feature import StringIndexer
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Normalizer
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


from pyspark.sql.functions import isnan, when, count, col

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190823083749-0001
KERNEL_ID = 07e08e17-f2f7-448c-ae70-3c42ff5da55c


Starting Spark session and loading Feature Sets from COS:

In [2]:
# The code was removed by Watson Studio for sharing.

# The Model: Gradient-Boosted Trees

Preparing indexing of the Disease Feature label (1 if the county is in the Nth percentile of Disease Prevalence, 0 otherwise),
vectorisation and normalization of the feature set, classification method (*GBT*) and assembling it into a pipeline:

In [3]:
indexer = StringIndexer(inputCol="DiseaseRFeat", outputCol="label")
vectorAssembler = VectorAssembler(inputCols=["NO", "NO2", "PM1"], outputCol="featuresOO")
normalizer = Normalizer(inputCol="featuresOO", outputCol="features", p=1.0)
classifier = GBTClassifier(maxIter=10)
classifier.setLabelCol('label')
pipeline = Pipeline(stages=[indexer, vectorAssembler, normalizer, classifier])

### Training the model on  the available Feature Sets

#### 75th percentile for pollutant concentration, 75th percentile for disease prevalence, three pollutants (dfPolLongPerc75Disease95perc)

In [4]:
df = dfPolLongPerc75Disease95perc

Checking the schema of the loaded Feature set, prove the absence of *null* values in it, and checking that both disease labels exist in the data:

In [5]:
df.printSchema()
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
indexed = indexer.fit(df).transform(df)
indexed.select('label').distinct().show()

root
 |-- CountyID: long (nullable = true)
 |-- DiseaseRFeat: integer (nullable = true)
 |-- NO: double (nullable = true)
 |-- NO2: double (nullable = true)
 |-- PM1: double (nullable = true)

+--------+------------+---+---+---+
|CountyID|DiseaseRFeat| NO|NO2|PM1|
+--------+------------+---+---+---+
|       0|           0|  0|  0|  0|
+--------+------------+---+---+---+

+-----+
|label|
+-----+
|  0.0|
|  1.0|
+-----+



Splitting the data set into *train* and *test* manifolds:

In [6]:
splits = df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test  = splits[1]
print("Size of the train set is ", df_train.count(), ", size of the test set is ", df_test.count())

Size of the train set is  43 , size of the test set is  6


Training the model and estimating in-sample and out-of-sample errors via accuracy metrics:

In [7]:
model = pipeline.fit(df_train)
prediction = model.transform(df_train)
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy").setPredictionCol("prediction").setLabelCol("DiseaseRFeat")
InSampleAcc = binEval.evaluate(prediction)
predictionTest = model.transform(df_test)
OutOfSampleAcc = binEval.evaluate(predictionTest)
print("in-sample accuracy is ", InSampleAcc, ", out-of-sample accuracy is ", OutOfSampleAcc)

in-sample accuracy is  1.0 , out-of-sample accuracy is  0.8333333333333334


## GBT Summary
The GBT in the default configuration gives 92% accuracy on the dfPolLongPerc75Disease95perc feature set (correlation between 75th percentile of high pollutant concentration and 95th percentile of Asthma Bronchiale prevalence)