
# Advanced Data Science Capstone

## Correlation of air pollution and Prevalence of Heart failures in Germany  

## Data cleansing

### The deliverables
The deliverables of the current stage:

 - current notebook as the process documentation
 - Pandas dataframe of the "wide" type, containing time series of pollutants concentrations and the unique sensor ID and a county id
 - Pandas dataframe with disease prevalence column(s) (heart failures,...) and a county id

### Data cleansing
 - The air quality data sets are claimed to be "validated", so most work for cleansing the data is already done.
 - The incomplete files from the datasets (not having "hour" in the name) are ignored.
 - Few missing values appearing in the time series as negative values of the pollutant concentrations will be imputed.

In [1]:
import urllib.request
import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd
import numpy as np

import re, collections
from io import StringIO
import os, fnmatch, fastparquet

import matplotlib.pyplot as plt

def SelectAllXMLsensorID():
    varFull = [s for s in AllTags if 'value' in s][0]
    return([re.sub(r'[^a-zA-Z0-9:]*\'{http(.*)$', r'', re.sub(r'^.*AQD\/SPO.DE_', r'', str(varr.attrib))) for varr in Eroot.iter(varFull) if 'AQD' in str(varr.attrib)]) 



Now the files with pollutant concentration time series for the given year will be loaded to the **dffAll** Pandas dataframe of the **wide** format. During the load procedure **consistensy** of **files** and **column** names will be checked.

In [2]:
AirE1aDir='Capstone.rawData/AQD_DE_E1a_2017/'

#!ls Capstone.rawData/AQD_DE_E1a_2017/*hour*
FilesHour=[]

for file in os.listdir(AirE1aDir):
    if fnmatch.fnmatch(file, '*hour*'):
        FilesHour.append(file)
print("Number of files in the dataset", len(FilesHour))

# shortening the process for debugging purposes
#FilesHour=FilesHour[0:3]        

dffAll=pd.DataFrame(index=range(0,8760))  # 8760 hours in the year

# add First column with Observation Times:
dff=[]  # Temporary list for DataFrames

file=FilesHour[0]
Etree = ET.parse(AirE1aDir+file)
Eroot = Etree.getroot()
Eroot.tag
Eroot.attrib
AllTags = [elem.tag for elem in Eroot.iter()]
varFull = [s for s in AllTags if 'values' in s][0]
for varr in Eroot.iter(varFull):
    dff.append(pd.read_csv(StringIO((varr.text).replace("@@","\n")), sep=",", header=None))
dffAll=pd.concat([dffAll, dff[0][[0]]], axis=1)
dffAll.columns=['observation_period']


# get all tags in xml file; Note, that the actual data is kept as a TEXT of *values* tags 
for file in FilesHour:
    Etree = ET.parse(AirE1aDir+file)
    Eroot = Etree.getroot()
    Eroot.tag
    Eroot.attrib
    AllTags = [elem.tag for elem in Eroot.iter()]
    
    ColNamesExp=SelectAllXMLsensorID()
# Compare column names with file names, they should encode same country, state and pollutant
    for ColName in ColNamesExp:
        if ((ColName[0:2]!=file[0:2]) or (ColName[2:4]!=file[3:5]) or (ColName[8:11]!=file[11:14])):
            print("Inconsistency in file and column names: ", file, ColName)
            exit()
    
    varFull = [s for s in AllTags if 'values' in s][0]
    
    dff=[] # Temporary list for DataFrames
# reading actual pollutant data fiom the text field:    
    for varr in Eroot.iter(varFull):
        dff.append(pd.read_csv(StringIO((varr.text).replace("@@","\n")), sep=",", header=None))

# checking, that measurment timestamps are identical in the files read       
    for s in range(0,len(dff)):
        if not (dffAll['observation_period']).equals(dff[s][0]):
            print("Inconsistency of observation times in the following files: ", file, FilesHour[0])
            exit()

        
# select column 4 - pollutant concentration:
    dff=pd.concat([dff[s][4] for s in range(0,len(dff))], axis=1)
    dff.columns=ColNamesExp
   
    dffAll=pd.concat([dffAll, dff], axis=1)

Number of files in the dataset 156


In [3]:
print("Memory usage: ", (dffAll.memory_usage(index=True).sum()/1048576.0), " MB")
dffAll.describe()

Memory usage:  151.5784454345703  MB


Unnamed: 0,DESH001_O3_dataGroup1,DESH008_O3_dataGroup1,DESH013_O3_dataGroup1,DESH014_O3_dataGroup1,DESH015_O3_dataGroup1,DESH016_O3_dataGroup1,DESH023_O3_dataGroup1,DESH033_O3_dataGroup1,DESH035_O3_dataGroup1,DESH056_O3_dataGroup1,...,DEHH015_PM1_dataGroup1,DEHH016_PM1_dataGroup1,DEHH026_PM1_dataGroup1,DEHH033_PM1_dataGroup1,DEHH059_PM1_dataGroup1,DEHH068_PM1_dataGroup1,DEHH070_PM1_dataGroup1,DEHH072_PM1_dataGroup1,DEHH079_PM1_dataGroup1,DEHH081_PM1_dataGroup1
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,...,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,45.374908,48.709722,37.286848,60.931687,42.575316,-397.154575,39.948094,-511.639592,37.8675,-28.696564,...,-28.273549,-8.032516,-55.883309,-12.692748,-28.061636,16.24762,3.343157,-28.703924,-0.194743,-10.264745
std,50.596504,73.491147,149.589802,33.518547,85.25826,511.246865,104.336547,525.860031,70.716164,281.257366,...,211.948821,155.066454,269.333205,170.167351,208.851818,87.331396,125.651302,209.301279,139.33224,174.797689
min,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,...,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
25%,29.51725,37.53475,44.86225,46.69175,32.35075,-999.0,32.58025,-999.0,26.05625,29.8335,...,8.66775,7.45475,10.366,8.229,7.656,11.96075,10.45875,6.42725,11.1325,10.7225
50%,48.1995,53.9885,59.3845,65.548,49.974,8.216,50.5045,-999.0,42.929,52.865,...,13.801,12.7235,16.2905,13.302,12.8885,19.36,15.7815,11.7485,15.707,16.1135
75%,64.562,68.6775,70.98,77.2265,64.91525,41.18025,66.478,55.73,56.65325,68.088,...,21.44475,20.19275,25.06225,20.4035,20.664,30.738,23.23725,20.78725,22.993,23.94925
max,145.66,131.607,138.455,154.401,143.936,142.818,144.126,127.519,124.948,132.191,...,275.973,190.385,266.052,152.603,196.997,469.227,277.583,113.427,170.415,537.024


Now we have **wide** dataframe, containing timeseries of all pollutant concentrations for all sensors. The pollutant type and the sensor ID are encoded in column names. The minimal value of pollutant concentrations *-999.0* is equivalent to *NA* and will be imputted:

In [4]:
dffAll[dffAll == -999.0] = np.NaN
dffAll.interpolate(method='linear', inplace=True, axis=0)
dffAll['observation_period']=pd.to_datetime(dffAll['observation_period'])
dffAll['observation_period']=dffAll['observation_period'].dt.to_period('H')
dffAll.describe()
#dffAll['observation_period'][0].end_time
dffAll.tail(3)



Unnamed: 0,observation_period,DESH001_O3_dataGroup1,DESH008_O3_dataGroup1,DESH013_O3_dataGroup1,DESH014_O3_dataGroup1,DESH015_O3_dataGroup1,DESH016_O3_dataGroup1,DESH023_O3_dataGroup1,DESH033_O3_dataGroup1,DESH035_O3_dataGroup1,...,DEHH015_PM1_dataGroup1,DEHH016_PM1_dataGroup1,DEHH026_PM1_dataGroup1,DEHH033_PM1_dataGroup1,DEHH059_PM1_dataGroup1,DEHH068_PM1_dataGroup1,DEHH070_PM1_dataGroup1,DEHH072_PM1_dataGroup1,DEHH079_PM1_dataGroup1,DEHH081_PM1_dataGroup1
8757,2017-12-31 21:00,63.301,60.667,67.917,59.453,64.209,63.144,62.95,17.042,58.144,...,18.904,5.162,33.793,16.538,29.383,63.372,27.154,8.836,12.413,8.112
8758,2017-12-31 22:00,70.473,62.127,65.207,60.882,68.517,67.69,61.775,17.042,63.06,...,22.25,10.204,47.489,26.634,32.953,74.021,24.785,8.177,15.343,6.805
8759,2017-12-31 23:00,69.297,71.514,68.926,69.438,76.856,70.269,68.019,17.042,74.892,...,16.465,11.866,54.099,25.612,24.207,70.609,25.648,6.642,15.602,5.61


Now we can save the resulting dataset for further use:


In [5]:
#dffAll.dtypes
#fastparquet.write('Capstone.ETL/Capstone.etl.wide.1.0.parquet', dffAll)
dffAll.to_csv('Capstone.ETL/Capstone.etl.wideCSV.1.0.gzip', compression='gzip')
#pd.Period(pd.to_datetime("2017-01-01T00:00:00+01:00"), freq='H').end_time

### Sensor Locations
In order to use the spatial data one should have coordinates of air pollution measurements sensors.
For the current study the county index for every individual sensor is needed. First all measurement stations IDs and the town names of the sensors locations are read to **SensorLocation** dataframe:

In [6]:
# pick all tags from the XML file
Etree = etree.parse("Capstone.rawData/AQD_DE_D_2017/DE_D_allInOne_metaMeasurements_2017.xml")
Eroot = Etree.getroot()
Eroot.tag
Eroot.attrib
AllTags = [elem.tag for elem in Eroot.iter()]

# get correct tag names for 'municipality', 'EUStationCode' and 'featureMember':
varMUN = [s for s in AllTags if 'municipality' in s][0]
varID  = [s for s in AllTags if 'EUStationCode' in s][0]
varFeatMem = [s for s in AllTags if 'featureMember' in s][0]

IDs=[]
MUNs=[]
# read 'municipality' and 'EUStationCode' to SensorLocation dataframe:
for varr in Eroot.iter(varFeatMem):
    for child in varr.iter(varMUN):
        MUNs.append(child.text)
        for child2 in varr.iter(varID):
            IDs.append(child2.text)
SensorLocation=pd.DataFrame({'SensorID': IDs, 'SensorTown': MUNs})
SensorLocation.tail(5)

Unnamed: 0,SensorID,SensorTown
803,DEUB005,Lüder
804,DEUB028,Zingst
805,DEUB029,Suhl
806,DEUB030,Stechlin
807,DEUB044,Garmisch-Partenkirchen


In order to map town names to county names, used in the health related datasets, the town-county table **dfCT** will be created. It contains 5-digit county-id (not unique, but characterizing counties in some vicinity), name of town and county: 

In [7]:
columns = [(10, 15), (22, 71), (72, 121)]
dfCT = pd.read_fwf("Capstone.rawData/GV100AD3107/GV100AD_310719.ASC", 
                     colspecs=columns, names=['countyid','town','county'],
                     encoding="iso8859_1")
dfCT=dfCT.fillna(method='ffill')

dfCT['town'] = dfCT['town'].str.split(",").str[0]
dfCT.tail(5)

Unnamed: 0,countyid,town,county
16116,16077,Starkenberg,Schmölln/Thür.
16117,16077,Thonhausen,Schmölln/Thür.
16118,16077,Treben,Schmölln/Thür.
16119,16077,Vollmershain,Schmölln/Thür.
16120,16077,Windischleuba,Schmölln/Thür.


### Prevalence of Heart failures
The central dataframe of the model will contain list of counties, prevalence of disease(s) in this counties, and the set of air-pollution-based features. Let's load the *Prevalence of Heart failures* dataset: 

In [8]:
xlsx_file = pd.ExcelFile("Capstone.rawData/Heart_2017/data_id_97_kreis11_2_j_1483228800.xlsx")
print("xls sheet names: ",xlsx_file.sheet_names)
dfHeart = xlsx_file.parse('Daten', header=3, decimal=",") 
print(dfHeart.head(3))
print("Number of duplicates in Regions-ID column: ", dfHeart.duplicated(['Regions-ID']).sum())

xls sheet names:  ['Hintergrundinformationen', 'Daten']
             Region  Regions-ID  KV           Kreistyp  Wert  Bundeswert
0            Lk.Hof        9475  BY  Ländliches Umland  6.43        3.11
1  Mansfeld-Südharz       15087  ST    Ländlicher Raum  6.37        3.11
2               Hof        9464  BY  Ländliches Umland  6.36        3.11
Number of duplicates in Regions-ID column:  0


The mapping will start from setting the **countyID** to every **sensorID** in the **SensorLocation** dataframe:


In [9]:
SensorLocation = (SensorLocation.join(dfCT[['countyid','town']].set_index('town'),
                                      on='SensorTown')).drop_duplicates(subset=['SensorID'])

Checking the resulting table it was found, that 30 of 804 entries have not resolved **countyid**:

In [10]:
print("Total number of sensors: ", SensorLocation.count())
print("Number of sensors with unresolved countyid: ", SensorLocation[SensorLocation.isna().any(axis=1)].count())
#print("List of unresolved sensors:")
#SensorLocation[SensorLocation.isna().any(axis=1)]
#print("Number of duplicates in SensorID column: ", SensorLocation.duplicated(['SensorID']).sum())
#SensorLocation.loc[SensorLocation.duplicated(['SensorID'])==True]

Total number of sensors:  SensorID      808
SensorTown    808
countyid      778
dtype: int64
Number of sensors with unresolved countyid:  SensorID      30
SensorTown    30
countyid       0
dtype: int64


At the moment it is easier to drop these 4% of sensor's data. Otherwise this table could be corrected manually, since it has reasonable size, and it's contents (sensor lables/county codes) hardly changes in time. 

In [11]:
SensorLocation=SensorLocation.dropna()
SensorLocation=SensorLocation.astype({'countyid':int})
SensorLocation.head(5)

Unnamed: 0,SensorID,SensorTown,countyid
0,DEBB007,Elsterwerda,12062
1,DEBB021,Potsdam,12054
2,DEBB026,Spremberg,12071
3,DEBB029,Schwedt/Oder,12073
4,DEBB032,Eisenhüttenstadt,12067


Finally this dataframe will be saved for further use:

In [12]:
SensorLocation.to_csv('Capstone.ETL/Capstone.etl.SensorLocationCSV.1.0.csv')