
# Advanced Data Science Capstone

## Air pollution and prevalence of bronchial asthma in Germany  

## Feature Creation and Feature engineering, Apache Spark SQL

### The deliverables
The deliverables of the current stage:

 - Current notebook as the process documentation
 - Spark DataFrames with disease prevalence column, county id and features extracted from air pollution data series for sensors located in corresponding counties

###  Feature creation
The basic features for air pollution levels are

 - Number of hours when pollutant concentration exceeded some certain value
 - Mean concentration of the pollutant
 - Median or other quantile concentration of the pollutant 
 
### Loading Apache Spark DataFrames from COS:

In [2]:
# The code was removed by Watson Studio for sharing.

In [None]:
cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

dfAllLongSpark = spark.read.parquet(cos.url('dffAllLong.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfAsthmaSpark = spark.read.parquet(cos.url('Asthma.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))

###  Feature creation

The following features will be generated:
 - Pollutant concentration features
   - Average concentration of every kind of pollutant over the year (average over all sensors within the county)
   - 75th percentile of every kind pollutant over the year, that is also proportional to the number of hours when pollutant concentration exceeded some certain value
 - Health indicator features
   - presense of the county in 50th, 75th or 95th percentile of bronchial asthma prevalence over the counties
 
Starting from the "long" Apache Spark DataFrame **dfAllLongSpark** the quantities can be straightforward calculated by means of
Apache Spark SQL:

In [3]:
#sql view generation:
dfAllLongSpark.createOrReplaceTempView("SensorsHour")
dfAsthmaSpark.createOrReplaceTempView("DiseaseCounty")

#spark.sql("select * from SensorsHour limit 10").show()
#spark.sql("select * from DiseaseCounty limit 10").show() 
#dfAllLongSpark.printSchema()
#dfAsthmaSpark.printSchema()

Commented chunk with quantile calculation examples:

In [4]:
#dfPollutantPercentilesSpark = spark.sql("""
#SELECT 
# distinct 
#     Pollutant, CountyID,
#     AVG(PollutantConc) over(PARTITION BY Pollutant, CountyID) AS Mean,
#     percentile_approx(PollutantConc,  0.5) over(PARTITION BY Pollutant, CountyID) as Percentile50
#--     ,
#--     percentile_approx(PollutantConc, 0.75) over(PARTITION BY Pollutant, CountyID) as Percentile75
#FROM SensorsHour
#""") 

#dfPollutantPercentilesSpark.createOrReplaceTempView("PollutantPercentiles")
#spark.sql("select * from PollutantPercentiles limit 10").show()

#spark.sql("""
#       SELECT 
#         CountyID,
#         (case 
#            WHEN DiseaseR >=
#             (select percentile(DiseaseR, 0.75) from DiseaseCounty limit 1) 
#          THEN 1
#          ELSE 0
#         END 
#        ) as DiseaseRFeat
#     FROM DiseaseCounty
#""").show(290)

Commented chunk with mean calculation example:

In [8]:
#spark.sql("""
#SELECT distinct 
#         CountyID,
#         AVG(PollutantConc) over(PARTITION BY CountyID) AS NO
#         FROM SensorsHour
#         WHERE Pollutant='NO'
#""").show(5)

### Calculation of feature matrices by means of Apache Spark SQL
Selected feature sets (for the full set see Capstone.feature_eng.Pandas.X.X files) are generated by means of Apache Spark SQL.
The feature set DataFrame names are

dfPol**XXXX**Disease**YY**perc, where

**XXXX** = **MeanLong** denotes mean value of the limited pollutant set (*NO, NO2, PM1*)

**XXXX** = **LongPerc75** denotes value of the 75th percentile for the limited pollutant set (*NO, NO2, PM1*)


**YY** = 50 or 75 denotes that the high risk county flag is set, when the county is in 50th or 75th percentile of bronchial asthma prevalence over the counties


In [10]:
# PolMeanLongDisease50perc : Pollutant mean; Disease level at or above 50th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolMeanLongDisease50perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.5) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [11]:
# PolMeanLongDisease75perc : Pollutant mean; Disease level at or above 75th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolMeanLongDisease75perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.75) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [12]:
# PolMeanLongDisease95perc : Pollutant mean; Disease level at or above 95th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolMeanLongDisease95perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.95) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          AVG(PollutantConc) over(PARTITION BY CountyID, Pollutant) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [13]:
# PolLongPerc75Disease50perc : Pollutant level at or above 75th percentile; Disease level at or above 50th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolLongPerc75Disease50perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.5) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [14]:
# PolLongPerc75Disease75perc : Pollutant level at or above 75th percentile; Disease level at or above 75th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolLongPerc75Disease75perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.75) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [15]:
# PolLongPerc75Disease95perc : Pollutant level at or above 75th percentile; Disease level at or above 95th percentile
# pollutant list: ListOfPollutantsLong = ['NO','NO2','PM1']

dfPolLongPerc75Disease95perc = spark.sql("""    
SELECT t1.CountyID, t1.DiseaseRFeat, t2.NO, t3.NO2, t4.PM1 
FROM
       (SELECT 
         CountyID,
         (case 
            WHEN DiseaseR >=
             (select percentile(DiseaseR, 0.95) from DiseaseCounty limit 1) 
          THEN 1
          ELSE 0
         END 
        ) as DiseaseRFeat
     FROM DiseaseCounty) t1
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO
          FROM SensorsHour
          WHERE Pollutant='NO') t2
     ON t1.CountyID = t2.CountyID
          JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS NO2
          FROM SensorsHour
          WHERE Pollutant='NO2') t3
     ON t2.CountyID = t3.CountyID
     JOIN
        (SELECT distinct 
          CountyID, Pollutant,
          percentile_approx(PollutantConc,  0.75) over(PARTITION BY Pollutant, CountyID) AS PM1
          FROM SensorsHour
          WHERE Pollutant='PM1') t4
     ON t3.CountyID = t4.CountyID

""")

In [18]:
from pyspark.sql.functions import isnan, when, count, col

dftmp = dfPolLongPerc75Disease75perc

dftmp.select([count(when(isnan(c), c)).alias(c) for c in dftmp.columns]).show()

dftmp.createOrReplaceTempView("PolLongPerc75Disease75perc")
spark.sql("select * from PolLongPerc75Disease75perc limit 10").show()

+--------+------------+---+---+---+
|CountyID|DiseaseRFeat| NO|NO2|PM1|
+--------+------------+---+---+---+
|       0|           0|  0|  0|  0|
+--------+------------+---+---+---+

+--------+------------+------------+-----------+------------------+
|CountyID|DiseaseRFeat|          NO|        NO2|               PM1|
+--------+------------+------------+-----------+------------------+
|   14626|           0|       7.785|     20.616|23.673000000000002|
|   14521|           0|     288.893|     92.927|              64.8|
|   16051|           1| 27.11467361|38.37479019|       22.60592461|
|   14730|           0|      19.833|     54.754|            130.99|
|    1002|           0|      74.347|       58.3|           126.083|
|    1051|           1|     188.201|     80.934|            95.755|
|   15002|           0|    34.32875|   44.47335|          28.09545|
|   16076|           0|250.02636719|74.93490601|      209.64552307|
|   15001|           0|     281.628|    88.5767|          759.8985|
|  

### Writing Spark DataSrames as Parquet to COS

In [20]:
dfPolMeanLongDisease50perc.write.parquet(cos.url('dfPolMeanLongDisease50perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfPolMeanLongDisease75perc.write.parquet(cos.url('dfPolMeanLongDisease75perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfPolMeanLongDisease95perc.write.parquet(cos.url('dfPolMeanLongDisease95perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))

dfPolLongPerc75Disease50perc.write.parquet(cos.url('dfPolLongPerc75Disease50perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfPolLongPerc75Disease75perc.write.parquet(cos.url('dfPolLongPerc75Disease75perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
dfPolLongPerc75Disease95perc.write.parquet(cos.url('dfPolLongPerc75Disease95perc.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))