# Loading Emissions Data

## Introduction

The purpose of this notebook is to show the loading process that was involved in creating the datasets that would be used for the principal exploratory data analysis for the project. This is by no means an exhaustive cleaning process, rather a way of condensing files into single files which can be accessed by others. The 'cleaning' process, if we can call it that, will therefore involve removing the data that is not relevant; this means data that describes areas outside of the United States.

In [3]:
import pandas as pd
import xarray as xr
import os

## CH$_4$ Data

The following dataset, as well as all subsequent datasets pertaining to atmospheric greenhouse gas emissions is collected by the Emissions Database for Global Atmospheric Research (EDGAR), spearheaded by the European Commission. The data in all instances is clearly laid out. I will describe the basic layout within this section, but this will also apply to the other instances (CO$_2$ and N$_2$O). The data is gridded and provided in a .nc file. The grid is measured by $0.1^\circ × 0.1^\circ$ change in longitude and latitude. The data is split into separate .nc files for individual years, meaning that for each .nc file, the following information is shown:

|Column|Description|
|:--|:--|
|lat|Latitude|
|lon|Longitude|
|emi_*substance*|The amount of the respective substance released. Measured in kg substance/m$^2$|

We will read the files for the individual years, from which we will create a DataFrame, which will finally be concatenated to an overarching DataFrame. For each of the individual DataFrames created we will also add a `year` column so that we have this information readily available in the final DataFrame.

In [25]:
ch4_df = pd.DataFrame()

for file in os.listdir('EDGAR Emissions/CH4_TOTALS_nc'):
    if file[-2:] == 'nc':
        year = file[9:13]
        path = f'EDGAR Emissions/CH4_TOTALS_nc/{file}'
        ds = xr.open_dataset(path)
        tmp_df = ds.to_dataframe()
        tmp_df.unstack()
        tmp_df['year'] = int(year)
        ch4_df = pd.concat([ch4_df, tmp_df])

Ultimately, we have created a DataFrame, that has a multiindex. Even after unstacking, the `lat` and `lon` index columns are not accessible. Consequently, I saved the DataFrame to a .csv file that we will be able to read from later on. 

In [1]:
ch4_df.to_csv('data/EDGAR Emissions/ch4.csv')

In [34]:
ch4_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,emi_ch4,year
lat,lon,Unnamed: 2_level_1,Unnamed: 3_level_1
-89.949997,0.05,0.0,2002
-89.949997,0.15,0.0,2002
-89.949997,0.25,0.0,2002
-89.949997,0.35,0.0,2002
-89.949997,0.45,0.0,2002


If we read the csv into a DataFrame, we will see how the data is laid out.

In [5]:
ch4_df = pd.read_csv('EDGAR Emissions/ch4.csv')

In [5]:
ch4_df.head()

Unnamed: 0,lat,lon,emi_ch4,year
0,-89.949997,0.05,0.0,2002
1,-89.949997,0.15,0.0,2002
2,-89.949997,0.25,0.0,2002
3,-89.949997,0.35,0.0,2002
4,-89.949997,0.45,0.0,2002


In order to use this data productively, we need to ensure that the `lat` and `lon` values are written in a way that is consistent with our wildfire data.

In [12]:
print('The longitude values range from:')
print(ch4_df['lon'].min(), ch4_df['lon'].max())

print('The latitude values range from:')
print(ch4_df['lat'].min(), ch4_df['lat'].max())

The longitude values range from:
0.050000000745058 359.95001220703125
The latitude values range from:
-89.94999694824219 89.94999694824219


This means that the longitude values are all positive, while the latitude values range from -90$^{\circ}$ to 90$^{\circ}$. The longitude values that we have defined for the area of North America are defined in a range from -180$^{\circ}$ to 180$^{\circ}$. We need to convert these values so that they are in accordance with the DataFrame.

|Westernmost Longitude| Easternmost Longitude|Northenmost Latitude|Southernmost Latitude|
|:--|:--|:--|:--|
|-178.1333 /  181.8667|-53.0567 / 306.9433|82.9143|14.0749|

From these values we can filter out the values that don't fall within these geographical boundaries.

In [18]:
latitudes = (ch4_df['lat'] >= 14.0749) & (ch4_df['lat'] <= 82.9143)
longitudes = (ch4_df['lon'] >= 181.8667) & (ch4_df['lon'] <= 306.9433)
ch4_df = ch4_df[latitudes & longitudes]

Now that we have filtered out the values that we want we have a DataFrame with relevant latitudes and longitudes for the years 1992-2015.

In [20]:
ch4_df.head()

Unnamed: 0,lat,lon,emi_ch4,year
3749419,14.15,181.949997,2.025535e-16,2002
3749420,14.15,182.050003,8.164502e-16,2002
3749421,14.15,182.149994,2.210639e-15,2002
3749422,14.15,182.25,2.04898e-15,2002
3749423,14.15,182.350006,1.972801e-15,2002


We notice however that the DataFrame is not sorted. Perhaps it would be best if we sort this by year for easier reading.

In [22]:
ch4_sorted = ch4_df.sort_values(by='year')
ch4_sorted.head()

Unnamed: 0,lat,lon,emi_ch4,year
12703868,82.849998,306.850006,0.0,1992
11054241,37.049999,224.149994,1.641249e-14,1992
11054240,37.049999,224.050003,1.154963e-14,1992
11054239,37.049999,223.949997,1.350558e-14,1992
11054238,37.049999,223.850006,1.25344e-14,1992


In [23]:
ch4_sorted.to_csv('EDGAR Emissions/ch4.csv')

## $CO_2$ Data

Considering the similarity between this dataset and the previous, we will follow the same process to create our final .csv file.

In [28]:
co2_df = pd.DataFrame()

for file in os.listdir('EDGAR Emissions/CO2_excl_short-cycle_TOTALS_nc'):
    if file[-2:] == 'nc':
        year = file[-22:-18]
        path = f'EDGAR Emissions/CO2_excl_short-cycle_TOTALS_nc/{file}'
        ds = xr.open_dataset(path)
        tmp_df = ds.to_dataframe()
        tmp_df.unstack()
        tmp_df['year'] = int(year)
        co2_df = pd.concat([co2_df, tmp_df])
    
co2_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,emi_co2,year
lat,lon,Unnamed: 2_level_1,Unnamed: 3_level_1
-89.949997,0.05,0.0,2015
-89.949997,0.15,0.0,2015
-89.949997,0.25,0.0,2015
-89.949997,0.35,0.0,2015
-89.949997,0.45,0.0,2015


In [30]:
co2_df.to_csv('EDGAR Emissions/co2.csv')

In [31]:
co2_df = pd.read_csv('EDGAR Emissions/co2.csv')

In [32]:
co2_df.head()

Unnamed: 0,lat,lon,emi_co2,year
0,-89.949997,0.05,0.0,2015
1,-89.949997,0.15,0.0,2015
2,-89.949997,0.25,0.0,2015
3,-89.949997,0.35,0.0,2015
4,-89.949997,0.45,0.0,2015


In [35]:
co2_df.shape

(155520000, 4)

In [33]:
co2_df.isnull().sum()

lat        0
lon        0
emi_co2    0
year       0
dtype: int64

In [34]:
print('The longitude values range from:')
print(co2_df['lon'].min(), co2_df['lon'].max())

print('The latitude values range from:')
print(co2_df['lat'].min(), co2_df['lat'].max())

The longitude values range from:
0.050000000745058 359.95001220703125
The latitude values range from:
-89.94999694824219 89.94999694824219


Find the rows with the appropriate latitude and longitude.

In [36]:
latitudes = (co2_df['lat'] >= 14.0749) & (co2_df['lat'] <= 82.9143)
longitudes = (co2_df['lon'] >= 181.8667) & (co2_df['lon'] <= 306.9433)
co2_df = co2_df[latitudes & longitudes]

In [37]:
co2_df.shape

(20640000, 4)

In [38]:
co2_sorted = co2_df.sort_values(by='year')
co2_sorted.head()

Unnamed: 0,lat,lon,emi_co2,year
109903868,82.849998,306.850006,0.0,1992
109079045,59.950001,264.549988,0.0,1992
109079046,59.950001,264.649994,0.0,1992
109079047,59.950001,264.75,1.04613e-10,1992
109079048,59.950001,264.850006,1.280826e-10,1992


In [39]:
co2_sorted.to_csv('data/EDGAR Emissions/co2.csv')

## $N_2O$ Data

In [45]:
n2o_df = pd.DataFrame()

for file in os.listdir('EDGAR Emissions/N2O_TOTALS_nc'):
    if file[-2:] == 'nc':
        year = file[9:13]
        path = f'EDGAR Emissions/N2O_TOTALS_nc/{file}'
        ds = xr.open_dataset(path)
        tmp_df = ds.to_dataframe()
        tmp_df.unstack()
        tmp_df['year'] = int(year)
        n2o_df = pd.concat([n2o_df, tmp_df])

In [46]:
n2o_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,emi_n2o,year
lat,lon,Unnamed: 2_level_1,Unnamed: 3_level_1
-89.949997,0.05,4.642007e-13,2010
-89.949997,0.15,7.027698e-14,2010
-89.949997,0.25,7.02767e-14,2010
-89.949997,0.35,4.641762e-13,2010
-89.949997,0.45,7.027825e-14,2010


In [47]:
n2o_df.shape

(155520000, 2)

In [48]:
n2o_df.to_csv('EDGAR Emissions/n2o.csv')

In [49]:
n2o_df = pd.read_csv('EDGAR Emissions/n2o.csv')

In [50]:
n2o_df.head()

Unnamed: 0,lat,lon,emi_n2o,year
0,-89.949997,0.05,4.642007e-13,2010
1,-89.949997,0.15,7.027698e-14,2010
2,-89.949997,0.25,7.02767e-14,2010
3,-89.949997,0.35,4.641762e-13,2010
4,-89.949997,0.45,7.027825e-14,2010


In [52]:
n2o_df.shape

(155520000, 4)

In [54]:
n2o_df.isnull().sum()

lat        0
lon        0
emi_n2o    0
year       0
dtype: int64

In [55]:
print('The longitude values range from:')
print(n2o_df['lon'].min(), n2o_df['lon'].max())

print('The latitude values range from:')
print(n2o_df['lat'].min(), n2o_df['lat'].max())

The longitude values range from:
0.050000000745058 359.95001220703125
The latitude values range from:
-89.94999694824219 89.94999694824219


In [56]:
latitudes = (n2o_df['lat'] >= 14.0749) & (n2o_df['lat'] <= 82.9143)
longitudes = (n2o_df['lon'] >= 181.8667) & (n2o_df['lon'] <= 306.9433)
n2o_df = n2o_df[latitudes & longitudes]
n2o_df.head()

Unnamed: 0,lat,lon,emi_n2o,year
3749419,14.15,181.949997,4.072134e-15,2010
3749420,14.15,182.050003,4.072134e-15,2010
3749421,14.15,182.149994,4.087264e-15,2010
3749422,14.15,182.25,4.065503e-15,2010
3749423,14.15,182.350006,4.068405e-15,2010


In [57]:
n2o_df.shape

(20640000, 4)

In [59]:
n2o_df = n2o_df.sort_values(by='year')
n2o_df.head()

Unnamed: 0,lat,lon,emi_n2o,year
103423868,82.849998,306.850006,3.002973e-15,1992
101774241,37.049999,224.149994,3.022328e-15,1992
101774240,37.049999,224.050003,3.170282e-15,1992
101774239,37.049999,223.949997,3.231255e-15,1992
101774238,37.049999,223.850006,3.166844e-15,1992


In [60]:
n2o_df.to_csv('EDGAR Emissions/n2o.csv')