
# Advanced Data Science Capstone

## Correlation of air pollution and Prevalence of Asthma bronchiale in Germany  

## Feature Creation and Feature engineering

### The deliverables
The deliverables of the current stage:

 - Spark DataFrames with disease prevalence column, county id, and some features extracted from air pollution data series for sensors located in corresponding county

###  Feature creation
The basic features for air pollution levels are

 - Number of hours when pollutant concentration exceeded some certain value
 - Mean or Median concentration of the pollutant
 
###  Feature quality check

 - Feature variance
 - Feature cross-correlation matrix
 
The necessary libraries and the data sets preprocessed at the ETL stage loaded:

In [1]:
##### Libraries:
#import pandas as pd
#import numpy as np
#import matplotlib.pyplot as plt
#import re

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190823064850-0000
KERNEL_ID = 4dcec509-272c-4f5b-8d07-f31924dbc2a9


In [2]:
# The code was removed by Watson Studio for sharing.

###  Feature creation

Now let's create some basic features, illustrating some integral quantities of air pollution over the year.
For the start the following features will be generated:
 - Average concentration of every kind of pollutant over the year (average over all sensors within the county)
 - 50th and 75th percentile of every kind pollutant over the year, that is also proportional to the number of days when pollutant concentration exceeded some certain value
 - The feature for the disease prevalence is constructed as presence of the county in the Nth (50th, 75th or 95th) percentile of the disease prevalence

In [3]:
#sql view generation:
dfAllLongSpark.createOrReplaceTempView("SensorsHour")
dfAsthmaSpark.createOrReplaceTempView("DiseaseCounty")

#spark.sql("select * from SensorsHour limit 10").show()
#spark.sql("select * from DiseaseCounty limit 10").show() 
#dfAllLongSpark.printSchema()
#dfAsthmaSpark.printSchema()

In [4]:

#dfPollutantPercentilesSpark = spark.sql("""
#SELECT 
# distinct 
#     Pollutant, CountyID,
#     AVG(PollutantConc) over(PARTITION BY Pollutant, CountyID) AS Mean,
#     percentile_approx(PollutantConc,  0.5) over(PARTITION BY Pollutant, CountyID) as Percentile50
#--     ,
#--     percentile_approx(PollutantConc, 0.75) over(PARTITION BY Pollutant, CountyID) as Percentile75
#FROM SensorsHour
#""") 


In [5]:
#dfPollutantPercentilesSpark.createOrReplaceTempView("PollutantPercentiles")
#spark.sql("select * from PollutantPercentiles limit 10").show()

In [6]:
#spark.sql("""
#       SELECT 
#         CountyID,
#         (case 
#            WHEN DiseaseR >=
#             (select percentile(DiseaseR, 0.75) from DiseaseCounty limit 1) 
#          THEN 1
#          ELSE 0
#         END 
#        ) as DiseaseRFeat
#     FROM DiseaseCounty
#""").show(290)

In [7]:
#spark.sql("select * from PollutantPercentiles limit 10").show()

In [8]:
#spark.sql("""
#SELECT distinct 
#         CountyID,
#         AVG(PollutantConc) over(PARTITION BY CountyID) AS NO
#         FROM SensorsHour
#         WHERE Pollutant='NO'
#""").show(5)

In [9]:
# Feature matrices to be created:
#dfPolMeanLongDisease50perc = DiseaseFeaturePercentile(FeatureSetLongMean, 50.0)
#dfPolMeanLongDisease75perc = DiseaseFeaturePercentile(FeatureSetLongMean, 75.0)
#dfPolMeanLongDisease95perc = DiseaseFeaturePercentile(FeatureSetLongMean, 95.0)
#
#dfPolLongPerc75Disease50perc = DiseaseFeaturePercentile(FeatureSetLongPerc75, 50.0)
#dfPolLongPerc75Disease75perc = DiseaseFeaturePercentile(FeatureSetLongPerc75, 75.0)
#dfPolLongPerc75Disease95perc = DiseaseFeaturePercentile(FeatureSetLongPerc75, 95.0)

# ListOfPollutantsLong = ['NO','NO2','PM1']

In [10]:
# PolMeanLongDisease50perc : Pollutant mean; Disease level at or above 50th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolMeanLongDisease50perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.5) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [11]:
# PolMeanLongDisease75perc : Pollutant mean; Disease level at or above 75th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolMeanLongDisease75perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.75) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [12]:
# PolMeanLongDisease95perc : Pollutant mean; Disease level at or above 95th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolMeanLongDisease95perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.95) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [13]:
# PolLongPerc75Disease50perc : Pollutant level at or above 75th percentile; Disease level at or above 50th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolLongPerc75Disease50perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.5) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [14]:
# PolLongPerc75Disease75perc : Pollutant level at or above 75th percentile; Disease level at or above 75th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolLongPerc75Disease75perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.75) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [15]:
# PolLongPerc75Disease95perc : Pollutant level at or above 75th percentile; Disease level at or above 95th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolLongPerc75Disease95perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.95) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [18]:
from pyspark.sql.functions import isnan, when, count, col

dftmp = dfPolLongPerc75Disease75perc

dftmp.select([count(when(isnan(c), c)).alias(c) for c in dftmp.columns]).show()

dftmp.createOrReplaceTempView("PolLongPerc75Disease75perc")
spark.sql("select * from PolLongPerc75Disease75perc limit 10").show()

+--------+------------+---+---+---+
|CountyID|DiseaseRFeat| NO|NO2|PM1|
+--------+------------+---+---+---+
|       0|           0|  0|  0|  0|
+--------+------------+---+---+---+

+--------+------------+------------+-----------+------------------+
|CountyID|DiseaseRFeat|          NO|        NO2|               PM1|
+--------+------------+------------+-----------+------------------+
|   14626|           0|       7.785|     20.616|23.673000000000002|
|   14521|           0|     288.893|     92.927|              64.8|
|   16051|           1| 27.11467361|38.37479019|       22.60592461|
|   14730|           0|      19.833|     54.754|            130.99|
|    1002|           0|      74.347|       58.3|           126.083|
|    1051|           1|     188.201|     80.934|            95.755|
|   15002|           0|    34.32875|   44.47335|          28.09545|
|   16076|           0|250.02636719|74.93490601|      209.64552307|
|   15001|           0|     281.628|    88.5767|          759.8985|
|  

### Writing Spark DataSrames as Parquet co COS

In [20]:
dfPolMeanLongDisease50perc.write.parquet(cos.url('dfPolMeanLongDisease50perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfPolMeanLongDisease75perc.write.parquet(cos.url('dfPolMeanLongDisease75perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfPolMeanLongDisease95perc.write.parquet(cos.url('dfPolMeanLongDisease95perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))

dfPolLongPerc75Disease50perc.write.parquet(cos.url('dfPolLongPerc75Disease50perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfPolLongPerc75Disease75perc.write.parquet(cos.url('dfPolLongPerc75Disease75perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfPolLongPerc75Disease95perc.write.parquet(cos.url('dfPolLongPerc75Disease95perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))