# A Comparative Analysis of CO2 Emissions across Countries 

Every year we head about the rise of global warming and those world-ending theories and today I want to see how bad things have been getting over the last few decades. Along with a Time Series since 1960, I also want to find out the differences between countries and look into the reasons why they may exist. 

These are some of the questions we will be trying to answer throughout this notebook:
- How have CO2 Emissions fared in different countries over the last few decades?
- What are the differences in Emissions in each country and what might be the reasons for this (Could we use additional data sources for this)? 
- We will try to overlay some important dates and events and see differences that may have occured (for example, signing of the Paris Agreement)

# Imports

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import folium
import geopandas as gpd


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import warnings
warnings.filterwarnings("ignore")

In [2]:
# The df we will use for CO2 emission trends from 1960-2018
df = pd.read_csv('../input/co2-emissions-1960-2018/CO2_Emissions_1960-2018.csv').T
df = df.rename(columns=df.iloc[0])
df = df.drop(['Country Name'], axis=0)

# Data Cleaning

In [3]:
df.head(5)

In [4]:
print('Total number of countries:', len(df.columns))
na_values = df.isna().sum()
display(na_values.values)
df = df.dropna(axis=1, how='all')
print('Total number of countries with no NaN values:', len(df.columns))

## Further Cleaning

As you saw, we had 16 countries with no recorded data. While we did get rid of those, we still have some countries where almost half the data is missing. We will ignore this missing data for now but we can always revisit this later.

Another point to note is that emissions in Aruba generally seem to be at around 200 which is over 20-50 times the amount of emission of the rest of the world. This leads me to believe that the numbers for Aruba must be a mistake and I will be removing them from the data. But first, let's see if there are any other such outliers.

In [5]:
df = df.astype(float)

In [6]:
df.describe()

In [7]:
# The mean of CO2 Emissions in each country:
mean_list = df.describe().loc['mean'].values
plt.hist(mean_list)
plt.show()
print('Mean of total emissions across the world:', np.mean(mean_list))
print('Number of countries with a mean emission > 50:', len([round(x) for x in mean_list if x > 50]))

Aruba seems to be the only country with the extremely high numbers. If no other countries (not even neighbouring ones) seem to have such numbers, they probably are from incorrect logging of data. Let's remove Aruba now.

In [8]:
df = df.drop('Aruba', axis=1)

# Exploratory Data Analysis

In [10]:
fig = plt.figure(figsize = (18, 100))
spec = gridspec.GridSpec(ncols=5, nrows=50, figure=fig)

countries = df.columns.tolist()[:10]

for i in range(5):
    for j in range(1):
        ax = fig.add_subplot(spec[j,i])
        sns.set_style("white")
        plt.title(countries[i+j], size = 12, fontname = 'monospace')
        ax.plot(df[countries[i+j]].index, df[countries[i+j]].values, color='#1a5d57')
        ax.set_xticklabels(df[countries[i+j]].index[::6], rotation=70)
        ax.set_xticks(df[countries[i+j]].index[::6])
        ax.set_yticklabels([])
        for s in ['top', 'right', 'bottom', 'left']:
            ax.spines[s].set_visible(False)

fig.tight_layout(h_pad = 3)
plt.show()

# Comparative Analaysis

However, these figures only show us how CO2 Emissions have progressed over the 60 years. They don't show us how big the effect has been in each country. For example, as we saw earlier emissions in Aruba were crossing 200 which is over 20000% that of any other country and we wouln't have seen that dramatic increase in these individual plots. 

This is why we need a comparative analysis of some countries (it's hard to do this for all 249 countries we're considering). So, let's set up the dataframe we will need for our next steps. Going forward, my idea is to compare the CO2 emissions in 1960, 1980, 2000 and 2018 pictured on a heatmap of the world. This would allow us to get a better idea of the differences between the countries.

In [17]:
# To use a world heat map, we need to combine our data with country codes which can be read by the geopandas library.
# For this, we use another dataset from Kaggle - "Iso Country Codes Global"

country_df = df.T.reset_index()
country_code = pd.read_csv('../input/iso-country-codes-global/wikipedia-iso-country-codes.csv')

country_df = country_df.rename(columns={'index': 'country'})
country_code = country_code.rename(columns={'English short name lower case': 'country'})

df_global = country_df.merge(country_code, how='left', left_on=['country'], right_on=['country'])
df_global = df_global.rename(columns={'Alpha-3 code': 'iso_a3'})

In [16]:
df_global.head(5)

In [18]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
world.head(5)

In [22]:
mapped = world.merge(df_global, how='left', left_on='iso_a3', right_on='iso_a3')
mapped = mapped.fillna(0)
mapped = mapped.drop(['Alpha-2 code', 'Numeric code','ISO 3166-2', 'pop_est', 'continent', 'country'], axis=1)
mapped.head(5)

In [57]:
import geoplot
import mapclassify

# Fig Setup
fig, ax = plt.subplots(1, figsize=(18,12))
spec = gridspec.GridSpec(ncols=2, nrows=3, figure=fig)

def create_map(year, posx, posy):
    ax = fig.add_subplot(spec[posx,posy])
    mapped.plot(column=year, cmap='Blues', linewidth=0.8, ax=ax, edgecolors='0.8', legend=True)
    ax.set_title('CO2 Emissions in {}'.format(year), fontdict={'fontsize':15})
    ax.set_axis_off()

create_map('1960', 0, 0)
create_map('1980', 0, 1)
create_map('2000', 2, 0)
create_map('2018', 2, 1)


fig.text(0.25, 0.62, 'Comparative Study of CO2 Emissions across Countries', fontsize=17, fontweight='bold', fontfamily='sans-serif')
fig.text(0.25, 0.37, 
'''Right off the bat, we notice one consistent theme - CO2 emissions are generally higher in the West
across all 4 time periods considered. From these visualisations alone, it seems to correlate with the 
development of the economy. In other words, developed economies (like, US, Australia, UK) seem to 
have higher CO2 emissions.
As we approach the 21st century, we see a shift from high rates in the West. They start to even out a 
little and increase in the Middle East and Australia, followed closely by China and some European
countries.
Another important point to note is the scale on the right. There's a jump in emissions to values of 
around 50 in 1980 and it drops back to the range of 10-30 in 2000. This is consistent with what we saw 
in our initial plots. Most countries experienced a significant peak followed by a drop in emissions in
the 1980s.
'''
, fontsize=14, fontweight='light', fontfamily='sans-serif')


fig.tight_layout()
plt.show()