
# Advanced Data Science Capstone

## Correlation of air pollution and groups of diseases in Germany  

### Data sources
As data sources I use the data sets oficially published by Geschäfts- und Koordinierungsstelle GovData, the search engine is available at  
https://www.govdata.de/web/guest/suchen .
The list of relevant data sets is following:

Air quality:

Datenstrom E1a umfasst gemessene (Link zu Datenstrom D) Einzelwerte von gasförmigen Schadstoffen (z. B. Ozon, Stickstoffdixoid, Schwefeldioxid, Kohlenmonoxid), von partikelförmigen Schadstoffen (z.B. Feinstaub, Ruß, Gesamtstaub) und Staubinhaltsstoffen (z.B. Schwermetalle, PAK in PM10, PM2.5, TSP) sowie der Gesamtdeposition (BULK), der nassen Deposition und meteorologische Messgrößen (z.B. Temperatur, Windgeschwindigkeit, Luftdruck), für die eine Datenbereitstellungspflicht besteht. Der Bericht umfasst zudem die Datenqualitätsziele (Messunsicherheit, Mindestzeiterfassung (time coverage) erfüllt ja/nein, Mindestdatenerfassung (data capture) erfüllt ja/nein) und Informationen zu Konzentrationswerten die natürlichen Quellen und der Ausbringung von Streusand und –salz zuzurechnen sind (Konzentrationswerte ohne etwaige Korrekturabzüge).

 - Luftqualitätsdaten (Datenstrom E1a) - Validierte Einzelwerte 2018 (Datensatz).
   - Metadata: https://www.govdata.de/ckan/dataset/cdadb71f-5571-4c95-af2a-d926efb4e3a0.rdf .
   - ATOM: Luftqualitätsdaten (INSPIRE Download/Atom Feed): https://datahub.uba.de/server/rest/directories/arcgisforinspire/INSPIRE/aqd_MapServer/Service_5e7bb800-c1db-4343-ac3e-a63204b0f6b7.atom.xml (N/A)
   - Data with geotagging: https://datahub.uba.de/server/rest/directories/arcgisforinspire/INSPIRE/aqd_MapServer/Daten/AQD_DE_E1a_2018.zip
   - XML-Metadaten: Luftqualitätsdaten (Datenstrom E1a) - Validierte Einzelwerte 2018 (Datensatz): http://www.geoportal.de/gds/xml.php?uuid=cdadb71f-5571-4c95-af2a-d926efb4e3a0
   - 
   
 - Disease information
   - by regions?...
   - find integral?..

### Initial Data Exploration
 - Load the data set and check it's structure, size and data quality (non-informative entries, data variability)
 - If the data set is big, make a sample subset
 - Make exploratory plots
 - May be make some e.g. hierarchical/k-means (?) clustering, finding patterns
 
### File naming conventions:

  *project_name*.**data_exp**.*technology.version.extension*  
  *project_name*.**etl**.*technology.version.extension*    
  *project_name*.**feature_eng**.*technology.version.extension*  
  *project_name*.**model_def**.*technology.version.extension*  
  *project_name*.**model_train**.*technology.version.extension*   
  *project_name*.**model_evaluate**.*technology.version.extension*   
  *project_name*.**model_deployment**.*technology.version.extension*   
    
Raw data files are placed to ./*project_name*.**rawData/**    

In [84]:
###import rdfpandas as pd
#!pip install rdflib
#!mkdir Capstone.rawData

import urllib.request

In [88]:
import re, collections
from lxml import etree

def PrintXML(XMLfileName):
    xml_root = (etree.parse(XMLfileName)).getroot()
    raw_tree = etree.ElementTree(xml_root)
    nice_tree = collections.OrderedDict()

    for tag in xml_root.iter():
        path = re.sub('\[[0-9]+\]', '', raw_tree.getpath(tag))
        if path not in nice_tree:
            nice_tree[path] = []
        if len(tag.keys()) > 0:
            nice_tree[path].extend(attrib for attrib in tag.keys() if attrib not in nice_tree[path])            
 
    for path, attribs in nice_tree.items():
        indent = int(path.count('/') - 1)
        print('{0}{1}: {2} [{3}]'.format('    ' * indent, indent, path.split('/')[-1], ', '.join(attribs) if len(attribs) > 0 else '-'))

In [89]:
PrintXML("Capstone.rawData/E1a2018_meta_xml.xml")



#### Metadata xml also contains info on the relations between datasets, models and organizations; It is not very informative.
#import rdflib
#from rdflib import Graph
#from rdflib.extras.external_graph_libs import rdflib_to_networkx_multidigraph
#import networkx as nx
#import matplotlib.pyplot as plt
#import xml.etree.ElementTree as ET
#import xml.dom.minidom
##urllib.request.urlretrieve("http://www.geoportal.de/gds/xml.php?uuid=cdadb71f-5571-4c95-af2a-d926efb4e3a0", "Capstone.rawData/E1a2018_meta_xml.xml")
#with open('Capstone.rawData/E1a2018_meta_xml.xml', encoding='utf-8') as xmldata:
#    xml = xml.dom.minidom.parseString(xmldata.read())  # or xml.dom.minidom.parseString(xml_string)
#    xml_pretty_str = xml.toprettyxml()
#print(xml_pretty_str)

    
#### Metadata Graph shows relations between datasets, models and organizations; It is not very informative.
#E1a2018_meta_graph = Graph()
#E1a2018_meta_graph.parse("https://www.govdata.de/ckan/dataset/cdadb71f-5571-4c95-af2a-d926efb4e3a0.rdf")
#E1a2018_meta_network = rdflib_to_networkx_multidigraph(E1a2018_meta_graph)
## Plot Networkx instance of RDF Graph
#plt.figure(figsize=(36,18)) 
#pos = nx.spring_layout(E1a2018_meta_network, scale=10)
#edge_labels = nx.get_edge_attributes(E1a2018_meta_network, 'r')
#nx.draw_networkx_edge_labels(E1a2018_meta_network, pos, labels=edge_labels)
#nx.draw(E1a2018_meta_network, with_labels=True)
#plt.show()


0: gmd:MD_Metadata [{http://www.w3.org/2001/XMLSchema-instance}schemaLocation]
    1: gmd:fileIdentifier [-]
        2: gco:CharacterString [-]
    1: gmd:language [-]
        2: * [codeList, codeListValue]
    1: gmd:characterSet [-]
        2: * [codeList, codeListValue]
    1: gmd:hierarchyLevel [-]
        2: * [codeList, codeListValue]
    1: gmd:hierarchyLevelName [-]
        2: gco:CharacterString [-]
    1: gmd:contact [-]
        2: gmd:CI_ResponsibleParty [-]
            3: gmd:individualName [-]
                4: gco:CharacterString [-]
            3: gmd:organisationName [-]
                4: gco:CharacterString [-]
            3: gmd:positionName [-]
                4: gco:CharacterString [-]
            3: gmd:contactInfo [-]
                4: gmd:CI_Contact [-]
                    5: gmd:phone [-]
                        6: gmd:CI_Telephone [-]
                            7: gmd:voice [-]
                                8: gco:CharacterString [-]
                     

In [79]:
# import the dataset itself
#!ls -l Capstone.rawData/
#urllib.request.urlretrieve("https://datahub.uba.de/server/rest/directories/arcgisforinspire/INSPIRE/aqd_MapServer/Daten/AQD_DE_E1a_2018.zip", "Capstone.rawData/AQD_DE_E1a_2018.zip")
#!mv Capstone.rawData/AQD_DE_E1a_2018.zip Capstone.rawData/AQD_DE_E1a_2018/
#!unzip Capstone.rawData/AQD_DE_E1a_2018/AQD_DE_E1a_2018.zip
#!rm Capstone.rawData/AQD_DE_E1a_2018/AQD_DE_E1a_2018.zip
#!mv ~/E1a Capstone.rawData/AQD_DE_E1a_2018/
#!mv Capstone.rawData/AQD_DE_E1a_2018/E1a/* Capstone.rawData/AQD_DE_E1a_2018/
#!rm -rf Capstone.rawData/AQD_DE_E1a_2018/E1a
!ls -la Capstone.rawData/AQD_DE_E1a_2018

total 1124732
drwxr-xr-x 2 spark spark    12288 Jul  4 11:05 .
drwxr-xr-x 3 spark spark     4096 Jul  4 11:00 ..
-rw-r--r-- 1 spark spark  2737854 Feb  4 15:11 DE_BB_2018_CO_hour.xml
-rw-r--r-- 1 spark spark 13254333 Feb  4 15:11 DE_BB_2018_NO2_hour.xml
-rw-r--r-- 1 spark spark 13169219 Feb  4 15:11 DE_BB_2018_NO_hour.xml
-rw-r--r-- 1 spark spark 13290213 Feb  4 15:11 DE_BB_2018_NOx_hour.xml
-rw-r--r-- 1 spark spark  8897254 Feb  4 15:11 DE_BB_2018_O3_hour.xml
-rw-r--r-- 1 spark spark 13292670 Feb  4 15:11 DE_BB_2018_PM1_hour.xml
-rw-r--r-- 1 spark spark 13250141 Feb  4 15:12 DE_BB_2018_PM2_hour.xml
-rw-r--r-- 1 spark spark  2738411 Feb  4 15:12 DE_BB_2018_SO2_hour.xml
-rw-r--r-- 1 spark spark  1654796 Feb  4 15:12 DE_BE_2018_CHB_hour.xml
-rw-r--r-- 1 spark spark  1104344 Feb  4 15:12 DE_BE_2018_CO_hour.xml
-rw-r--r-- 1 spark spark  8917510 Feb  4 15:12 DE_BE_2018_NO2_hour.xml
-rw-r--r-- 1 spark spark  8799792 Feb  4 15:12 DE_BE_2018_NO_hour.xml
-rw-r--r-- 1 spark spark 

In [90]:

PrintXML("Capstone.rawData/AQD_DE_E1a_2018/DE_BB_2018_NOx_hour.xml")


0: gml:FeatureCollection [{http://www.w3.org/2001/XMLSchema-instance}schemaLocation, {http://www.opengis.net/gml/3.2}id]
    1: gml:featureMember [-]
        2: om:OM_Observation [{http://www.opengis.net/gml/3.2}id]
            3: om:phenomenonTime [-]
                4: gml:TimePeriod [{http://www.opengis.net/gml/3.2}id]
                    5: gml:beginPosition [-]
                    5: gml:endPosition [-]
            3: om:resultTime [-]
                4: gml:TimeInstant [{http://www.opengis.net/gml/3.2}id]
                    5: gml:timePosition [-]
            3: om:procedure [{http://www.w3.org/1999/xlink}href]
            3: om:parameter [-]
                4: om:NamedValue [-]
                    5: om:name [{http://www.w3.org/1999/xlink}href]
                    5: om:value [{http://www.w3.org/1999/xlink}href, {http://www.w3.org/2001/XMLSchema-instance}type]
            3: om:observedProperty [{http://www.w3.org/1999/xlink}href]
            3: om:featureOfInterest [{http://ww

In [78]:
!ls -la /home/spark/shared/E1a/

ls: cannot access '/home/spark/shared/E1a/': No such file or directory
