
# Advanced Data Science Capstone

## Correlation of air pollution and Prevalence of Asthma bronchiale in Germany  

## Data cleansing

### The deliverables
The deliverables of the current stage:

 - current notebook as the process documentation
 - Pandas data frame of the "wide" type, containing time series of pollutants concentrations and the unique sensor ID and a county id
 - Pandas data frame with disease prevalence column(s) (heart failures,...) and a county id

### Data cleansing
 - The air quality data sets are claimed to be "validated", so most work for cleansing the data is already done.
 - The incomplete files from the datasets (not having "hour" in the name) are ignored.
 - Few missing values appearing in the time series as negative values of the pollutant concentrations will be imputed.

In [1]:
import urllib.request
import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd
import numpy as np

import re, collections
from io import StringIO
import os, fnmatch
#, fastparquet

import matplotlib.pyplot as plt

def SelectAllXMLsensorID():
    varFull = [s for s in AllTags if 'value' in s][0]
    return([re.sub(r'[^a-zA-Z0-9:]*\'{http(.*)$', r'', re.sub(r'^.*AQD\/SPO.DE_', r'', str(varr.attrib))) for varr in Eroot.iter(varFull) if 'AQD' in str(varr.attrib)]) 



Now the files with pollutant concentration time series for the given year will be loaded to the **dffAll** Pandas data frame of the **wide** format. During the load procedure **consistensy** of **files** and **column** names will be checked.

In [2]:
AirE1aDir='Capstone.rawData/AQD_DE_E1a_2016/'

#!ls Capstone.rawData/AQD_DE_E1a_2017/*hour*
FilesHour=[]

for file in os.listdir(AirE1aDir):
    if fnmatch.fnmatch(file, '*hour*'):
        FilesHour.append(file)
print("Number of files in the dataset", len(FilesHour))

# shortening the process for debugging purposes
#FilesHour=FilesHour[0:3]        

dffAll=pd.DataFrame(index=range(0,8760))  # 8760 hours in the year

# add First column with Observation Times:
dff=[]  # Temporary list for DataFrames

file=FilesHour[0]
Etree = ET.parse(AirE1aDir+file)
Eroot = Etree.getroot()
Eroot.tag
Eroot.attrib
AllTags = [elem.tag for elem in Eroot.iter()]
varFull = [s for s in AllTags if 'values' in s][0]
for varr in Eroot.iter(varFull):
    dff.append(pd.read_csv(StringIO((varr.text).replace("@@","\n")), sep=",", header=None))
dffAll=pd.concat([dffAll, dff[0][[0]]], axis=1)
dffAll.columns=['observation_period']


# get all tags in xml file; Note, that the actual data is kept as a TEXT of *values* tags 
for file in FilesHour:
    Etree = ET.parse(AirE1aDir+file)
    Eroot = Etree.getroot()
    Eroot.tag
    Eroot.attrib
    AllTags = [elem.tag for elem in Eroot.iter()]
    
    ColNamesExp=SelectAllXMLsensorID()
# Compare column names with file names, they should encode same country, state and pollutant
    for ColName in ColNamesExp:
        if ((ColName[0:2]!=file[0:2]) or (ColName[2:4]!=file[3:5]) or (ColName[8:11]!=file[11:14])):
            print("Inconsistency in file and column names: ", file, ColName)
            exit()
    
    varFull = [s for s in AllTags if 'values' in s][0]
    
    dff=[] # Temporary list for DataFrames
# reading actual pollutant data fiom the text field:    
    for varr in Eroot.iter(varFull):
        dff.append(pd.read_csv(StringIO((varr.text).replace("@@","\n")), sep=",", header=None))

# checking, that measurment timestamps are identical in the files read       
    for s in range(0,len(dff)):
        if not (dffAll['observation_period']).equals(dff[s][0]):
            print("Inconsistency of observation times in the following files: ", file, FilesHour[0])
            exit()

        
# select column 4 - pollutant concentration:
    dff=pd.concat([dff[s][4] for s in range(0,len(dff))], axis=1)
    dff.columns=ColNamesExp
   
    dffAll=pd.concat([dffAll, dff], axis=1)

Number of files in the dataset 51


In [3]:
print("Memory usage: ", (dffAll.memory_usage(index=True).sum()/1048576.0), " MB")
dffAll.describe()

Memory usage:  35.25080871582031  MB


Unnamed: 0,DETH043_CHT_dataGroup1,DEST002_NOx_dataGroup1,DEST011_NOx_dataGroup1,DEST015_NOx_dataGroup1,DEST029_NOx_dataGroup1,DEST039_NOx_dataGroup1,DEST044_NOx_dataGroup1,DEST050_NOx_dataGroup1,DEST066_NOx_dataGroup1,DEST075_NOx_dataGroup1,...,DEST090_O3_dataGroup1,DEST098_O3_dataGroup1,DEST104_O3_dataGroup1,DEST105_O3_dataGroup1,DEST106_O3_dataGroup1,DESN020_CHT_dataGroup1,DESN024_CHT_dataGroup1,DESN025_CHT_dataGroup1,DESN061_CHT_dataGroup1,DESN074_CHT_dataGroup1
count,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,...,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0
mean,-83.778039,10.888974,14.562116,14.047963,27.748674,0.157481,15.33936,9.18396,5.545178,65.788163,...,29.227681,42.774789,44.960701,44.889605,34.416207,-107.805551,-92.952179,-106.665963,-101.362931,-87.667275
std,280.881482,85.650998,78.668791,81.596964,92.385789,67.737829,80.399452,127.564855,104.781496,119.676127,...,141.960489,84.657592,84.167235,84.283806,105.943664,313.654667,291.921653,312.228331,304.616098,283.306356
min,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,...,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
25%,0.705821,5.831979,8.993116,7.741056,14.106168,2.249845,7.341586,8.23764,6.473262,27.148387,...,22.969537,28.12215,28.968387,27.787012,18.3109,0.845,0.431,0.92675,0.908,0.175
50%,1.484682,10.978419,13.728892,13.064362,23.832317,3.773348,12.684082,15.456937,10.754043,54.808825,...,45.49845,48.135175,47.872975,50.023275,41.9794,1.816,0.745,1.879,1.553,0.275
75%,2.895999,20.749995,23.07227,24.123445,41.255151,5.823507,24.139927,29.928729,18.807399,99.925038,...,66.178,66.632163,68.577738,69.561425,64.79405,3.08325,1.23025,3.16575,2.377,0.447
max,41.215481,232.7519,256.3176,244.3704,553.55265,81.001855,393.3323,373.88295,227.1963,675.5735,...,184.551,151.138,181.4925,169.217,159.999,37.382,29.41,19.661,27.843,22.108


Now we have **wide** data frame, containing timeseries of all pollutant concentrations for all sensors. The pollutant type and the sensor ID are encoded in column names. The minimal value of pollutant concentrations *-999.0* is equivalent to *NA* and will be imputted, as well as all negative values (the concentration can not be negative). The limit for imputation will be set to 876, i.e. *NA* sequences exceeding 10% of the year will not be imputted. Since the number of heavily corrupted columns is below 3%, they will be dropped in favor to the information quality:

In [4]:
dffAll[dffAll.loc[:, dffAll.columns != 'observation_period'] < 0.0] = np.NaN
dffAll.interpolate(method='linear', inplace=True, axis=0, limit=876, limit_direction='both')
print('The number of corrupted columns is ', len(dffAll.isna().sum().nonzero()[0]), ' of ', len(dffAll.columns))
dffAll = dffAll.dropna(axis=1)
dffAll['observation_period']=pd.to_datetime(dffAll['observation_period'])
dffAll['observation_period']=dffAll['observation_period'].dt.to_period('H')
#dffAll['observation_period'][0].end_time
dffAll.tail(3)

  app.launch_new_instance()


The number of corrupted columns is  10  of  526




Unnamed: 0,observation_period,DETH043_CHT_dataGroup1,DEST002_NOx_dataGroup1,DEST011_NOx_dataGroup1,DEST015_NOx_dataGroup1,DEST029_NOx_dataGroup1,DEST039_NOx_dataGroup1,DEST044_NOx_dataGroup1,DEST050_NOx_dataGroup1,DEST066_NOx_dataGroup1,...,DEST090_O3_dataGroup1,DEST098_O3_dataGroup1,DEST104_O3_dataGroup1,DEST105_O3_dataGroup1,DEST106_O3_dataGroup1,DESN020_CHT_dataGroup1,DESN024_CHT_dataGroup1,DESN025_CHT_dataGroup1,DESN061_CHT_dataGroup1,DESN074_CHT_dataGroup1
8781,2016-12-31 21:00,2.266842,15.716239,25.39692,40.19583,51.71306,1.505132,8.738043,43.08516,32.09499,...,11.1866,38.85265,28.47085,41.19935,13.16545,1.958,3.287,2.478,2.213,0.206
8782,2016-12-31 22:00,1.897258,14.546514,8.653047,33.80841,41.815865,1.392986,7.927009,39.024955,37.50866,...,14.38625,50.39245,27.4998,41.0519,11.6742,2.185,3.593,1.985,2.168,0.164
8783,2016-12-31 23:00,1.639371,9.365071,12.488082,31.645165,26.7691,1.217337,9.191041,35.06317,34.358935,...,10.41294,64.19715,24.2934,41.12065,11.3646,1.837,2.575,2.56,2.235,0.125


Now we can save the resulting dataset for further use:


In [5]:
###dffAll.dtypes
###fastparquet.write('Capstone.ETL/Capstone.etl.wide.1.0.parquet', dffAll)
#!mkdir ./Capstone .ETL
dffAll.to_csv('Capstone.ETL/Capstone.etl.wideCSV.1.0.gzip', compression='gzip', index=False)
###pd.Period(pd.to_datetime("2017-01-01T00:00:00+01:00"), freq='H').end_time

### Sensor Locations
In order to use the spatial data one should have coordinates of air pollution measurements sensors.
For the current study the county index for every individual sensor is needed. First all measurement stations IDs and the town names of the sensors locations are read to **SensorLocation** dataframe:

In [6]:
# pick all tags from the XML file
Etree = etree.parse("Capstone.rawData/AQD_DE_D_2017/DE_D_allInOne_metaMeasurements_2017.xml")
Eroot = Etree.getroot()
Eroot.tag
Eroot.attrib
AllTags = [elem.tag for elem in Eroot.iter()]

# get correct tag names for 'municipality', 'EUStationCode' and 'featureMember':
varMUN = [s for s in AllTags if 'municipality' in s][0]
varID  = [s for s in AllTags if 'EUStationCode' in s][0]
varFeatMem = [s for s in AllTags if 'featureMember' in s][0]

IDs=[]
MUNs=[]
# read 'municipality' and 'EUStationCode' to SensorLocation dataframe:
for varr in Eroot.iter(varFeatMem):
    for child in varr.iter(varMUN):
        MUNs.append(child.text)
        for child2 in varr.iter(varID):
            IDs.append(child2.text)
SensorLocation=pd.DataFrame({'SensorID': IDs, 'SensorTown': MUNs})
SensorLocation.tail(5)

Unnamed: 0,SensorID,SensorTown
803,DEUB005,Lüder
804,DEUB028,Zingst
805,DEUB029,Suhl
806,DEUB030,Stechlin
807,DEUB044,Garmisch-Partenkirchen


In order to map town names to county names, used in the health related datasets, the town-county table **dfCT** will be created. It contains 5-digit county-id (not unique, but characterizing counties in some vicinity), name of town and county: 

In [7]:
columns = [(10, 15), (22, 71), (72, 121)]
dfCT = pd.read_fwf("Capstone.rawData/GV100AD3107/GV100AD_310719.ASC", 
                     colspecs=columns, names=['countyid','town','county'],
                     encoding="iso8859_1")
dfCT=dfCT.fillna(method='ffill')

dfCT['town'] = dfCT['town'].str.split(",").str[0]
dfCT.tail(5)

Unnamed: 0,countyid,town,county
16116,16077,Starkenberg,Schmölln/Thür.
16117,16077,Thonhausen,Schmölln/Thür.
16118,16077,Treben,Schmölln/Thür.
16119,16077,Vollmershain,Schmölln/Thür.
16120,16077,Windischleuba,Schmölln/Thür.


### Prevalence of Heart failures
The central data frame of the model will contain list of counties, prevalence of disease(s) in this counties, and the set of air-pollution-based features. Let's load the *Prevalence of Heart failures* dataset: 

In [8]:
xlsx_file = pd.ExcelFile("Capstone.rawData/Heart_2017/data_id_97_kreis11_2_j_1483228800.xlsx")
print("xls sheet names: ",xlsx_file.sheet_names)
dfHeart = xlsx_file.parse('Daten', header=3, decimal=",") 
print(dfHeart.head(3))
print("Number of duplicates in Regions-ID column: ", dfHeart.duplicated(['Regions-ID']).sum())

xls sheet names:  ['Hintergrundinformationen', 'Daten']
             Region  Regions-ID  KV           Kreistyp  Wert  Bundeswert
0            Lk.Hof        9475  BY  Ländliches Umland  6.43        3.11
1  Mansfeld-Südharz       15087  ST    Ländlicher Raum  6.37        3.11
2               Hof        9464  BY  Ländliches Umland  6.36        3.11
Number of duplicates in Regions-ID column:  0


The mapping will start from setting the **countyID** to every **sensorID** in the **SensorLocation** dataframe:


In [9]:
SensorLocation = (SensorLocation.join(dfCT[['countyid','town']].set_index('town'),
                                      on='SensorTown')).drop_duplicates(subset=['SensorID'])

Checking the resulting table it was found, that 30 of 804 entries have not resolved **countyid**:

In [10]:
print("Total number of sensors: ", SensorLocation.count())
print("Number of sensors with unresolved countyid: ", SensorLocation[SensorLocation.isna().any(axis=1)].count())
#print("List of unresolved sensors:")
#SensorLocation[SensorLocation.isna().any(axis=1)]
#print("Number of duplicates in SensorID column: ", SensorLocation.duplicated(['SensorID']).sum())
#SensorLocation.loc[SensorLocation.duplicated(['SensorID'])==True]

Total number of sensors:  SensorID      808
SensorTown    808
countyid      778
dtype: int64
Number of sensors with unresolved countyid:  SensorID      30
SensorTown    30
countyid       0
dtype: int64


At the moment it is easier to drop these 4% of sensor's data. Otherwise this table could be corrected manually, since it has reasonable size, and it's contents (sensor lables/county codes) hardly changes in time. 

In [11]:
SensorLocation=SensorLocation.dropna()
SensorLocation=SensorLocation.astype({'countyid':int})
SensorLocation.head(5)

Unnamed: 0,SensorID,SensorTown,countyid
0,DEBB007,Elsterwerda,12062
1,DEBB021,Potsdam,12054
2,DEBB026,Spremberg,12071
3,DEBB029,Schwedt/Oder,12073
4,DEBB032,Eisenhüttenstadt,12067


Finally this data frame will be saved for further use:

In [12]:
SensorLocation.to_csv('Capstone.ETL/Capstone.etl.SensorLocationCSV.1.0.csv', index=False)
dfHeart.to_csv('Capstone.ETL/Capstone.etl.dfHeartCSV.1.0.csv', index=False)

### Prevalence of Asthma bronchiale
The central data frame of the model will contain list of counties, prevalence of disease(s) in this counties, and the set of air-pollution-based features. Let's load the *Prevalence of Asthma bronchiale* dataset: 

In [13]:
xlsx_file = pd.ExcelFile("Capstone.rawData/Asthma_2016/data_id_92_kreis11_1_j_1451606400.xlsx")
print("xls sheet names: ",xlsx_file.sheet_names)
dfAsthma = xlsx_file.parse('Daten', header=3, decimal=",") 
print(dfAsthma.head(3))
print("Number of duplicates in Regions-ID column: ", dfAsthma.duplicated(['Regions-ID']).sum())
dfAsthma.to_csv('Capstone.ETL/Capstone.etl.dfAsthmaCSV.1.0.csv', index=False)

xls sheet names:  ['Hintergrundinformationen', 'Daten']
      Region  Regions-ID  KV             Kreistyp  Wert  Bundeswert
0   Eisenach       16056  TH    Ländliches Umland   8.9         5.7
1  Sonneberg       16072  TH      Ländlicher Raum   8.7         5.7
2  Ammerland        3451  NI  Verdichtetes Umland   8.5         5.7
Number of duplicates in Regions-ID column:  0
