
# Advanced Data Science Capstone

## Air pollution and prevalence of bronchial asthma in Germany  

## ETL, Data cleansing

### The deliverables
The deliverables of the current stage:

 - current notebook as the process documentation
 - Spark data frame of the "wide" type, containing time series of pollutants concentrations for every available sensor
 - Spark data frame of the "long" type, containing time series of pollutants concentrations, county id and a pollutant label
 - Spark data frame with disease prevalence column (bronchiale asthma) and a county id
 
### ETL
 #### Data Sources
  -  The officially published data sets by **Geschäfts- und Koordinierungsstelle GovData**, the search engine is available at https://www.govdata.de/web/guest/suchen.
  - Data stream **E1a** contains measured (Link to Data stream **D**) values of gas phase pollutants (e.g. Ozone, NO2, SO2, CO), particle pollutants (e.g. dust) and dust constituants (e.g. heavy metals, PAK in PM10, PM2.5, TSP) as well es total deposition (BULK), wet deposition and meteorologic data (e.g. temperature, wind, pressure)for every measurement location.
  - The data for years 2013 - 2018 is currently available. For the project I will limit myself with 2016 data (due to limited availability of the health related data sets), however the method and the model are easily extendable for the data for other years.
  - Compressed dataset is available at https://datahub.uba.de/server/rest/directories/arcgisforinspire/INSPIRE/aqd_MapServer/Daten/AQD_DE_E1a_2016.zip .
 #### Data cleansing
  - The air quality data sets are claimed to be "validated", so most work for cleansing the data is already done.
  - The incomplete files from the datasets (not having "hour" in the name) are ignored.
  - Few missing values (below 10%) appearing in the time series as negative values of the pollutant concentrations will be imputed.
  - The sensors with heavily corrupted data (above 10% of measurements) will be dropped.
 
 #### Enterprise data storage
  - Saving Spark data frames to the Cloud Object Storage (COS) in the Parquet format.

In [1]:
import urllib.request
import xml.etree.ElementTree as ET
from lxml import etree
import pandas as pd
import numpy as np

import re, collections
from io import StringIO
import os, fnmatch
#, fastparquet

import matplotlib.pyplot as plt

def SelectAllXMLsensorID():
    varFull = [s for s in AllTags if 'value' in s][0]
    return([re.sub(r'[^a-zA-Z0-9:]*\'{http(.*)$', r'', re.sub(r'^.*AQD\/SPO.DE_', r'', str(varr.attrib))) for varr in Eroot.iter(varFull) if 'AQD' in str(varr.attrib)]) 



Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20190827152924-0001
KERNEL_ID = b1fc56da-8f75-4007-a7f2-1952aea375c7


Now the files with pollutant concentration time series for the given year will be loaded to the **dffAll** Pandas data frame of the **wide** format. During the load procedure **consistensy** of **files** and **column** names will be checked.

First, all the necessary files are downloaded from the web:

In [2]:
!rm -rf ./Capstone.rawData
## Download and decompress the dataset itself:
!mkdir Capstone.rawData
#!ls -l Capstone.rawData/

##### Pollution 2016
!mkdir Capstone.rawData/AQD_DE_E1a_2016
urllib.request.urlretrieve("https://datahub.uba.de/server/rest/directories/arcgisforinspire/INSPIRE/aqd_MapServer/Daten/AQD_DE_E1a_2016.zip", "Capstone.rawData/AQD_DE_E1a_2016.zip")
!mv Capstone.rawData/AQD_DE_E1a_2016.zip Capstone.rawData/AQD_DE_E1a_2016/
!unzip Capstone.rawData/AQD_DE_E1a_2016/AQD_DE_E1a_2016.zip -d Capstone.rawData/
!rm Capstone.rawData/AQD_DE_E1a_2016/AQD_DE_E1a_2016.zip

##### Sensor locations 2016
urllib.request.urlretrieve("https://datahub.uba.de/server/rest/directories/arcgisforinspire/INSPIRE/aqd_MapServer/Daten/AQD_DE_D_2016.zip", "Capstone.rawData/AQD_DE_D_2016.zip")
!unzip Capstone.rawData/AQD_DE_D_2016.zip -d Capstone.rawData/
!rm Capstone.rawData/AQD_DE_D_2016.zip

##### Prevalence of Asthma bronchiale 2016 
!mkdir Capstone.rawData/Asthma_2016
urllib.request.urlretrieve("https://www.versorgungsatlas.de/fileadmin/excel/data_id_92_kreis11_1_j_1451606400.xlsx", "Capstone.rawData/Asthma_2016/data_id_92_kreis11_1_j_1451606400.xlsx")

##### Town-county dataset:
urllib.request.urlretrieve("https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/Administrativ/Archiv/GV100ADQ/GV100AD3107.zip?__blob=publicationFile",
                           "Capstone.rawData/GV100AD3107.zip")
!mkdir Capstone.rawData/GV100AD3107
!unzip Capstone.rawData/GV100AD3107.zip -d Capstone.rawData/GV100AD3107/
!rm Capstone.rawData/GV100AD3107.zip

Archive:  Capstone.rawData/AQD_DE_E1a_2016/AQD_DE_E1a_2016.zip
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_NO2_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_NOx_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_NO_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_O3_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_PM1_day.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_PM1_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_PM2_day.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_PM2_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SH_2016_SO2_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SL_2016_CO_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SL_2016_NO2_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SL_2016_NO_hour.xml  
  inflating: Capstone.rawData/AQD_DE_E1a_2016/DE_SL_2016_O3_hour.xml  
  inflat

Then the pollutant concentration *xml* files are parsed and hourly averaged values of pollutants concentrations are stored in the **dffAll** Pandas DataFrame:

In [3]:
AirE1aDir='Capstone.rawData/AQD_DE_E1a_2016/'

#!ls Capstone.rawData/AQD_DE_E1a_2016/*hour*
FilesHour=[]

for file in os.listdir(AirE1aDir):
    if fnmatch.fnmatch(file, '*hour*'):
        FilesHour.append(file)
print("Number of files in the dataset", len(FilesHour))

# shortening the process for debugging purposes
#FilesHour=FilesHour[0:3]        

NumHoursInYear=8760 # 8760 hours in the year
NumHoursInYear=8784 # 8784 hours in the leap year 2016

dffAll=pd.DataFrame(index=range(0,NumHoursInYear))  

# add First column with Observation Times:
dff=[]  # Temporary list for DataFrames

file=FilesHour[0]
Etree = ET.parse(AirE1aDir+file)
Eroot = Etree.getroot()
Eroot.tag
Eroot.attrib
AllTags = [elem.tag for elem in Eroot.iter()]
varFull = [s for s in AllTags if 'values' in s][0]
for varr in Eroot.iter(varFull):
    dff.append(pd.read_csv(StringIO((varr.text).replace("@@","\n")), sep=",", header=None))
dffAll=pd.concat([dffAll, dff[0][[0]]], axis=1)
dffAll.columns=['observation_period']


# get all tags in xml file; Note, that the actual data is kept as a TEXT of *values* tags 
for file in FilesHour:
    Etree = ET.parse(AirE1aDir+file)
    Eroot = Etree.getroot()
    Eroot.tag
    Eroot.attrib
    AllTags = [elem.tag for elem in Eroot.iter()]
    
    ColNamesExp=SelectAllXMLsensorID()
# Compare column names with file names, they should encode same country, state and pollutant
    for ColName in ColNamesExp:
        if ((ColName[0:2]!=file[0:2]) or (ColName[2:4]!=file[3:5]) or (ColName[8:11]!=file[11:14])):
            print("Inconsistency in file and column names: ", file, ColName)
            exit()
    
    varFull = [s for s in AllTags if 'values' in s][0]
    
    dff=[] # Temporary list for DataFrames
# reading actual pollutant data fiom the text field:    
    for varr in Eroot.iter(varFull):
        dff.append(pd.read_csv(StringIO((varr.text).replace("@@","\n")), sep=",", header=None))

# checking, that measurment timestamps are identical in the files read       
    for s in range(0,len(dff)):
        if not (dffAll['observation_period']).equals(dff[s][0]):
            print("Inconsistency of observation times in the following files: ", file, FilesHour[0])
            exit()

        
# select column 4 - pollutant concentration:
    dff=pd.concat([dff[s][4] for s in range(0,len(dff))], axis=1)
    dff.columns=ColNamesExp
   
    dffAll=pd.concat([dffAll, dff], axis=1)

Number of files in the dataset 51


Now check the data set size and print a summary:

In [4]:
print("Memory usage: ", (dffAll.memory_usage(index=True).sum()/1048576.0), " MB")
dffAll.describe()

Memory usage:  35.25080871582031  MB


Unnamed: 0,DESH008_NO2_dataGroup1,DESH022_NO2_dataGroup1,DESH023_NO2_dataGroup1,DESH025_NO2_dataGroup1,DESH027_NO2_dataGroup1,DESH028_NO2_dataGroup1,DESH030_NO2_dataGroup1,DESH033_NO2_dataGroup1,DESH035_NO2_dataGroup1,DESH052_NO2_dataGroup1,...,DEUB029_PM1_dataGroup1,DEUB030_PM1_dataGroup1,DEUB005_PM2_dataGroup1,DEUB001_SO2_dataGroup1,DEUB004_SO2_dataGroup1,DEUB005_SO2_dataGroup1,DEUB028_SO2_dataGroup1,DEUB029_SO2_dataGroup1,DEUB030_SO2_dataGroup1,DEUB046_SO2_dataGroup1
count,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,...,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0,8784.0
mean,7.517775,32.406144,12.276538,34.838776,38.864838,32.474424,39.616351,15.206867,21.245121,63.749169,...,-87.290716,-25.283687,-28.934937,-55.997823,-64.135568,-49.471064,-118.446669,-51.785409,-57.245443,-41.084183
std,50.050559,44.066237,48.330659,51.156795,65.114026,30.816056,71.091519,50.981938,51.424199,55.293691,...,298.946758,192.00998,191.831127,230.456606,245.82016,218.321263,323.767133,222.940043,234.069962,201.453804
min,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,...,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0
25%,4.7205,19.265,6.8685,23.44575,23.618,18.3965,24.093,9.33775,14.42175,33.47875,...,2.33,5.78,3.96,0.19,0.34,0.28,0.25,0.26,0.5,0.9275
50%,7.3145,30.6725,11.3645,36.1525,38.806,30.1055,39.9915,14.422,21.076,57.6295,...,7.29,9.875,6.22,0.31,0.42,0.39,0.37,0.44,0.63,1.17
75%,12.12125,46.19025,18.5795,48.97525,57.58,44.63125,58.782,22.207,30.2825,91.86175,...,14.22,15.4825,10.7,0.36,0.54,0.67,0.54,0.69,0.85,1.46
max,68.899,126.621,83.155,113.51,152.249,137.767,160.504,83.625,80.934,228.703,...,79.21,88.69,95.22,4.39,3.73,14.76,12.02,32.68,13.65,10.74


Now we have **wide** data frame, containing timeseries of all pollutant concentrations for all sensors. The pollutant type and the sensor ID are encoded in column names. The minimal value of pollutant concentrations *-999.0* is equivalent to *NA* and will be imputted, as well as all negative values (the concentration can not be negative). The limit for imputation will be set to 876, i.e. *NA* sequences exceeding 10% of the year will not be imputted. Since the number of heavily corrupted columns is below 2%, they will be dropped in favor to the information quality:

In [5]:
dffAll[dffAll.loc[:, dffAll.columns != 'observation_period'] < 0.0] = np.NaN
dffAll.interpolate(method='linear', inplace=True, axis=0, limit=876, limit_direction='both')
print('The number of corrupted columns is ', len(dffAll.isna().sum().nonzero()[0]), ' of ', len(dffAll.columns))
dffAll = dffAll.dropna(axis=1)
dffAll['observation_period']=pd.to_datetime(dffAll['observation_period'])
dffAll['observation_period']=dffAll['observation_period'].dt.to_period('H')
#dffAll['observation_period'][0].end_time
dffAll.tail(3)

  app.launch_new_instance()


The number of corrupted columns is  10  of  526




Unnamed: 0,observation_period,DESH008_NO2_dataGroup1,DESH022_NO2_dataGroup1,DESH023_NO2_dataGroup1,DESH025_NO2_dataGroup1,DESH027_NO2_dataGroup1,DESH028_NO2_dataGroup1,DESH030_NO2_dataGroup1,DESH033_NO2_dataGroup1,DESH035_NO2_dataGroup1,...,DEUB029_PM1_dataGroup1,DEUB030_PM1_dataGroup1,DEUB005_PM2_dataGroup1,DEUB001_SO2_dataGroup1,DEUB004_SO2_dataGroup1,DEUB005_SO2_dataGroup1,DEUB028_SO2_dataGroup1,DEUB029_SO2_dataGroup1,DEUB030_SO2_dataGroup1,DEUB046_SO2_dataGroup1
8781,2016-12-31 21:00,24.274,22.78,33.092,28.64,29.738,28.722,35.438,25.233,26.793,...,3.5,8.87,11.67,0.21,0.46,1.16,0.33,0.35,0.47,1.07
8782,2016-12-31 22:00,22.592,21.262,33.049,27.886,29.738,29.114,37.843,23.267,28.229,...,7.58,13.4,12.11,0.2,0.47,1.52,0.29,0.35,0.62,1.21
8783,2016-12-31 23:00,21.737,20.361,32.63,28.721,29.738,30.624,36.119,20.998,31.436,...,18.37,12.87,11.88,0.19,0.44,1.69,0.32,0.39,0.8,1.25


### Saving Air Pollution DataFrame to the COS
Now we can save the resulting dataset to the Cloud Object Storage for the further use:


In [6]:
# The code was removed by Watson Studio for sharing.

In [7]:
from pyspark.sql import SparkSession

cos = ibmos2spark.CloudObjectStorage(sc, credentials, configuration_name, 'bluemix_cos')
spark = SparkSession.builder.getOrCreate()

dfAllSpark = spark.createDataFrame(dffAll.drop('observation_period', axis = 1))
dfAllSpark.write.parquet(cos.url('dffAll.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
#dfAllSpark = spark.read.parquet(cos.url('dffAll.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))
#dfAllSpark.printSchema()
#dfAllSpark.show()

### Sensor Locations
In order to use the spatial data one should have a coordinates of the air pollution sensors.
For the current study the county index for every individual sensor is needed. 

All the sensors IDs and the town names of the sensors locations are read into the **SensorLocation** DataFrame:

In [8]:
# pick all tags from the XML file
Etree = etree.parse("Capstone.rawData/DE_D_allInOne_metaMeasurements_2016.xml")
Eroot = Etree.getroot()
Eroot.tag
Eroot.attrib
AllTags = [elem.tag for elem in Eroot.iter()]

# get correct tag names for 'municipality', 'EUStationCode' and 'featureMember':
varMUN = [s for s in AllTags if 'municipality' in s][0]
varID  = [s for s in AllTags if 'EUStationCode' in s][0]
varFeatMem = [s for s in AllTags if 'featureMember' in s][0]

IDs=[]
MUNs=[]
# read 'municipality' and 'EUStationCode' to SensorLocation dataframe:
for varr in Eroot.iter(varFeatMem):
    for child in varr.iter(varMUN):
        MUNs.append(child.text)
        for child2 in varr.iter(varID):
            IDs.append(child2.text)
SensorLocation=pd.DataFrame({'SensorID': IDs, 'SensorTown': MUNs})
SensorLocation.tail(5)

Unnamed: 0,SensorID,SensorTown
762,DEUB005,Lüder
763,DEUB028,Zingst
764,DEUB029,Suhl
765,DEUB030,Stechlin
766,DEUB044,Garmisch-Partenkirchen


In order to map the town names to the county names used in the health indicators data sets, the town-county DataFrame **dfCT** is created. It contains 5-digit county-id (not unique, but characterizing counties in some vicinity), name of town and county: 

In [9]:
columns = [(10, 15), (22, 71), (72, 121)]
dfCT = pd.read_fwf("Capstone.rawData/GV100AD3107/GV100AD_310719.ASC", 
                     colspecs=columns, names=['CountyID','town','county'],
                     encoding="iso8859_1")
dfCT=dfCT.fillna(method='ffill')

dfCT['town'] = dfCT['town'].str.split(",").str[0]
dfCT.tail(5)

Unnamed: 0,CountyID,town,county
16116,16077,Starkenberg,Schmölln/Thür.
16117,16077,Thonhausen,Schmölln/Thür.
16118,16077,Treben,Schmölln/Thür.
16119,16077,Vollmershain,Schmölln/Thür.
16120,16077,Windischleuba,Schmölln/Thür.


### Prevalence of bronchial asthma
The central data frame of the model will contain list of counties, prevalence of disease(s) in this counties, and the set of air-pollution-based features. 
First the *Prevalence of bronchial asthma* dataset is loaded: 

In [10]:
xlsx_file = pd.ExcelFile("Capstone.rawData/Asthma_2016/data_id_92_kreis11_1_j_1451606400.xlsx")
print("xls sheet names: ",xlsx_file.sheet_names)
dfAsthma = xlsx_file.parse('Daten', header=3, decimal=",") 
print(dfAsthma.head(3))
print("Number of duplicates in Regions-ID column: ", dfAsthma.duplicated(['Regions-ID']).sum())

xls sheet names:  ['Hintergrundinformationen', 'Daten']
      Region  Regions-ID  KV             Kreistyp  Wert  Bundeswert
0   Eisenach       16056  TH    Ländliches Umland   8.9         5.7
1  Sonneberg       16072  TH      Ländlicher Raum   8.7         5.7
2  Ammerland        3451  NI  Verdichtetes Umland   8.5         5.7
Number of duplicates in Regions-ID column:  0


In [11]:
dfAsthma = dfAsthma.drop(['Region', 'KV', 'Kreistyp', 'Bundeswert'], axis=1)
dfAsthma.columns=['CountyID','DiseaseR']
dfAsthma.head(5)

Unnamed: 0,CountyID,DiseaseR
0,16056,8.9
1,16072,8.7
2,3451,8.5
3,16073,8.3
4,3151,8.2


The mapping of sensor positions to counties is done by setting the **CountyID** to every **sensorID** in the **SensorLocation** dataframe:


In [12]:
SensorLocation = (SensorLocation.join(dfCT[['CountyID','town']].set_index('town'),
                                      on='SensorTown')).drop_duplicates(subset=['SensorID'])

Checking the resulting table one can see that 23 of 767 entries have not resolved **CountyID**:

In [13]:
print("Total number of sensors: ", SensorLocation.count())
print("Number of sensors with unresolved CountyID: ", SensorLocation[SensorLocation.isna().any(axis=1)].count())
#print("List of unresolved sensors:")
#SensorLocation[SensorLocation.isna().any(axis=1)]
#print("Number of duplicates in SensorID column: ", SensorLocation.duplicated(['SensorID']).sum())
#SensorLocation.loc[SensorLocation.duplicated(['SensorID'])==True]

Total number of sensors:  SensorID      767
SensorTown    767
CountyID      744
dtype: int64
Number of sensors with unresolved CountyID:  SensorID      23
SensorTown    23
CountyID       0
dtype: int64


At the moment it is easier to drop these 3% of sensor's data. Otherwise this table could be corrected manually, since it has reasonable size and it's contents (sensor lables/county codes) hardly changes in time. 

In [14]:
SensorLocation=SensorLocation.dropna()
SensorLocation=SensorLocation.astype({'CountyID':int})
SensorLocation.head(5)

Unnamed: 0,SensorID,SensorTown,CountyID
0,DEBB007,Elsterwerda,12062
1,DEBB021,Potsdam,12054
2,DEBB026,Spremberg,12071
3,DEBB029,Schwedt/Oder,12073
4,DEBB032,Eisenhüttenstadt,12067


### Saving Bronchial Asthma Prevalence DataFrames to COS
Now we can save the resulting data set for further use:

In [15]:
AsthmaSpark = spark.createDataFrame(dfAsthma)
AsthmaSpark.write.parquet(cos.url('Asthma.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))

### Constructing "Long" DataFrame
For the further use in within the SparkSQL tools for the feature generation, it is worth to create a "long" Spark DataFrame,
containing three columns (pollutant concentration, pollutant name and the county id) and many rows, one for each hourly-averaged measurement.

First the initial **dffAll** DataFrame is converted to the "long" shape:

In [16]:
dffAllLong = pd.melt(dffAll, id_vars=['observation_period'], var_name='SensorPollID', value_name='PollutantConc')
dffAllLong.head()

Unnamed: 0,observation_period,SensorPollID,PollutantConc
0,2016-01-01 00:00,DESH008_NO2_dataGroup1,35.767
1,2016-01-01 01:00,DESH008_NO2_dataGroup1,34.95
2,2016-01-01 02:00,DESH008_NO2_dataGroup1,34.773
3,2016-01-01 03:00,DESH008_NO2_dataGroup1,36.938
4,2016-01-01 04:00,DESH008_NO2_dataGroup1,36.09


Then a dictionary for translation of *all* sensor's id's to the county id's is created:

In [17]:
import gc
del dffAll
del dff
gc.collect()

SensorCountyDict = dict(zip(SensorLocation.SensorID, SensorLocation.CountyID))

In [18]:
#dffAllLong['CountyID'] = dffAllLong.apply(lambda row: re.search('(^.{7})', row['SensorPollID']).group(1), axis=1)
ColumnCountyID = pd.DataFrame()
ColumnSensorID = pd.DataFrame()
ColumnSensorID['SensorID'] = dffAllLong.apply(lambda row: re.search('(^.{7})', row['SensorPollID']).group(1), axis=1)
#print("Memory usage: ", (ColumnSensorID.memory_usage(index=True).sum()/1048576.0), " MB")

In [19]:
#ColumnCountyID.replace({'CountyID': SensorCountyDict})
ColumnCountyID['CountyID'] = ColumnSensorID['SensorID'].map(SensorCountyDict)
#.dropna().astype('int64')
#dffAllLong.replace({'CountyID': SensorCountyDict})
#ColumnCountyID.head()

In [20]:
print("Total number of sensors is ", (ColumnCountyID.count()/NumHoursInYear)," , ", (ColumnCountyID.isna().sum()/NumHoursInYear), " of them have unresolved CountyID")

Total number of sensors is  CountyID    507.0
dtype: float64  ,  CountyID    8.0
dtype: float64  of them have unresolved CountyID


Now resolving the **SensorPollID** column of the melted DataFrame into the *double* **Pollutant** and *integer* **CountyID** columns:

In [21]:
dffAllLong['Pollutant'] = dffAllLong.apply(lambda row: re.search('^.{8}(.*)_', row['SensorPollID']).group(1), axis=1)
dffAllLong['CountyID'] = ColumnCountyID['CountyID']
dffAllLong = dffAllLong.dropna().drop(['observation_period','SensorPollID'], axis=1)
#dffAllLong.iloc[888555]
dffAllLong['CountyID'] = dffAllLong['CountyID'].astype('int64')
dffAllLong.tail()
#print("Memory usage: ", (dffAllLong.memory_usage(index=True).sum()/1048576.0), " MB")
#SensorLocation.loc[SensorLocation['SensorID']=='DESL002']

Unnamed: 0,PollutantConc,Pollutant,CountyID
4514971,0.5,SO2,12065
4514972,0.43,SO2,12065
4514973,0.47,SO2,12065
4514974,0.62,SO2,12065
4514975,0.8,SO2,12065


### Saving "Long" DataFrame to the COS
Now we can save the resulting dataset to the Cloud Object Storage for the further use:

In [22]:
dffAllLongSpark = spark.createDataFrame(dffAllLong)
dffAllLongSpark.write.parquet(cos.url('dffAllLong.parquet', 'capstone-donotdelete-pr-zpykcz8f0kxuad'))