## Emissions by Country ETL

This dataset provides information on global fossil CO2 emissions by country from 2002-2022. It gives us a better understanding of how much a country contributes to global warming and climate change in general. The first thing we did was extract this dataset from https://www.kaggle.com/datasets/thedevastator/global-fossil-co2-emissions-by-country-2002-2022 which was in csv format. 

In [1]:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from azure.storage.blob import BlobClient

from config import account
from config import container
from config import credential

blob = BlobClient(account_url=f"https://{account}.blob.core.windows.net",
                  container_name=container,
                  blob_name="GCB2022v27_MtCO2_flat.csv",
                  credential=credential)


with open("energy.csv", "wb") as f:
    data = blob.download_blob()
    data.readinto(f)

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
emissions = pd.read_csv('GCB2022v27_MtCO2_flat.csv')

We imported all the necessary packages to create new data frame structures and make visualizations. We also used the display max for columns and rows to get the full amount of rows and columns. Below we looked at the data frame and noticed a lot of missing values.

In [4]:
emissions.head()

Unnamed: 0,Country,ISO 3166-1 alpha-3,Year,Total,Coal,Oil,Gas,Cement,Flaring,Other,Per Capita
0,Afghanistan,AFG,1750,0.0,,,,,,,
1,Afghanistan,AFG,1751,0.0,,,,,,,
2,Afghanistan,AFG,1752,0.0,,,,,,,
3,Afghanistan,AFG,1753,0.0,,,,,,,
4,Afghanistan,AFG,1754,0.0,,,,,,,


After noticing the missing values, we wanted to see how many columns had missing values. Turns out most of the columns had missing values which we can address later, but our main focus was making sure the iso column had no missing values. The iso column was the abbreviated countries and if not changed it could cause problems later on. 

In [5]:
emissions.isna().any()

Country               False
ISO 3166-1 alpha-3     True
Year                  False
Total                  True
Coal                   True
Oil                    True
Gas                    True
Cement                 True
Flaring                True
Other                  True
Per Capita             True
dtype: bool

We then looked at the countries with missing values in the iso column and we looked at the unique values in the cell below since the countries are repeated due to the different years. We decided to drop the six unique countries with missing values, then we filled the na values with zero instead of keeping them as nan. It also would not benefit us to drop those columns because a lot of data would be lost.

In [6]:
#check for the countries with missing data
emissions[emissions['ISO 3166-1 alpha-3'].isna()== True].head()

Unnamed: 0,Country,ISO 3166-1 alpha-3,Year,Total,Coal,Oil,Gas,Cement,Flaring,Other,Per Capita
19312,French Equatorial Africa,,1750,0.0,,,,,,,
19313,French Equatorial Africa,,1751,0.0,,,,,,,
19314,French Equatorial Africa,,1752,0.0,,,,,,,
19315,French Equatorial Africa,,1753,0.0,,,,,,,
19316,French Equatorial Africa,,1754,0.0,,,,,,,


In [7]:
countries = (emissions[emissions['ISO 3166-1 alpha-3'].isna()==True]['Country'].unique())
countries 

array(['French Equatorial Africa', 'French West Africa',
       'Kuwaiti Oil Fires', 'Leeward Islands', 'Pacific Islands (Palau)',
       'Ryukyu Islands'], dtype=object)

In [8]:
emissions = emissions[~emissions['Country'].isin(countries)]

In [10]:
emissions.fillna(0, inplace=True)

In [11]:
#looked for missing data again and can see it is severely reduced
emissions.isna().sum()

Country               0
ISO 3166-1 alpha-3    0
Year                  0
Total                 0
Coal                  0
Oil                   0
Gas                   0
Cement                0
Flaring               0
Other                 0
Per Capita            0
dtype: int64

For the next few cells below we wanted to see what our new dataframe looked like and what analysis we could get out of it.

In [12]:
emissions.describe()

Unnamed: 0,Year,Total,Coal,Oil,Gas,Cement,Flaring,Other,Per Capita
count,61472.0,61472.0,61472.0,61472.0,61472.0,61472.0,61472.0,61472.0,61472.0
mean,1885.5,56.502866,26.164406,19.691362,8.265806,1.466239,0.599976,0.288607,1.360931
std,78.519745,834.352507,357.990683,309.641471,147.301985,29.343489,9.936814,6.5732,9.897285
min,1750.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1817.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1885.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1953.25,0.626498,0.0,0.153888,0.0,0.0,0.0,0.0,0.103928
max,2021.0,37123.850352,15051.51277,12345.653374,7921.829472,1672.592372,439.253991,306.638573,834.192642
