# Visualising the CO2 Emission

## Introduction

#### This notebook performs exploratory data analysis on multiple datasets and visualises the consolidated data.

The main objective is to visualise COVID-19 vaccination process globally and monitor its progress.

Plotly is used to visualise the data.

#### Used datasets are:

- CO2_GHG_emissions-data
- Country Mapping - ISO, Continent, Region
- Population by Country - 2020

#### Visualisations answer some simple questions:

- How has CO2 release been changed since the year 1750?
- What are the countries and regions causing most CO2 release?
- Is there a correlation between a country's CO2 emission and its population?

![GettyImages-155141288-polar-bear-hero.jpg](attachment:GettyImages-155141288-polar-bear-hero.jpg)

## 1. Import necessary libraries

In [2]:
!pip install cufflinks

Collecting cufflinks
  Using cached cufflinks-0.17.3.tar.gz (81 kB)
Collecting colorlover>=0.2.1
  Using cached colorlover-0.3.0-py3-none-any.whl (8.9 kB)
Building wheels for collected packages: cufflinks
  Building wheel for cufflinks (setup.py): started
  Building wheel for cufflinks (setup.py): finished with status 'done'
  Created wheel for cufflinks: filename=cufflinks-0.17.3-py3-none-any.whl size=68734 sha256=26cdcce7d964f08a696e035aa94a65ef54a0c1da488b1e120306c96a75b74528
  Stored in directory: c:\users\rafae\appdata\local\pip\cache\wheels\29\b4\f8\2fd2206eeeba6ccad8167e4e8894b8c4ec27bf1342037fd136
Successfully built cufflinks
Installing collected packages: colorlover, cufflinks
Successfully installed colorlover-0.3.0 cufflinks-0.17.3


In [3]:
import pandas as pd
import numpy as np
import math
import cufflinks as cf
import plotly
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
init_notebook_mode(connected=True)
cf.go_offline()

## 2. Data Manipulation on the CO2 Emission Dataset

### Read the CO2 data frame with necessary columns & parsing the date column

In [4]:
df_co = pd.read_csv('co2_emission.csv', parse_dates = ['Year'])

### Inspect the CO2 emission data frame

In [5]:
df_co.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20853 entries, 0 to 20852
Data columns (total 4 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Entity                          20853 non-null  object        
 1   Code                            18646 non-null  object        
 2   Year                            20853 non-null  datetime64[ns]
 3   Annual CO₂ emissions (tonnes )  20853 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 651.8+ KB


### See which rows do not have ISO Code

In [6]:
df_co_na = df_co[df_co['Code'].isnull()]

df_co_na['Entity'].unique()

array(['Africa', 'Americas (other)', 'Antarctic Fisheries',
       'Asia and Pacific (other)', 'EU-28', 'Europe (other)',
       'International transport', 'Kyrgysztan', 'Middle East',
       'Statistical differences', 'Wallis and Futuna Islands'],
      dtype=object)

#### It looks like there are only two countries, which don't have their ISO ALPHA-3 code missing. They are Kyrgysztan and Wallis and Futuna Islands. We will add their codes, KGZ and WLF manually.

### Filling missing code values

In [7]:
# Create new data frames for Kyrgysztan and Wallis and Futuna Islands and fill their Code column with their codes

df_kgz = df_co[df_co["Entity"].isin(["Kyrgysztan"])]
df_kgz.fillna('KGZ', inplace = True)

df_wlf = df_co[df_co["Entity"].isin(["Wallis and Futuna Islands"])]
df_wlf.fillna('WLF', inplace = True)

### Now, drop all rows containing NaN in their Code column
#### There is no problem with dropping Kyrgysztan and Wallis and Futuna Islands rows since we are going to append their data frames to the main one.

In [8]:
# Drop the rows containing NaN values in their Code column (including the )
# Since Code column is the only column containing NaN, we can directly apply dropna method

df_co.dropna(inplace=True)
df_co.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18646 entries, 0 to 20852
Data columns (total 4 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Entity                          18646 non-null  object        
 1   Code                            18646 non-null  object        
 2   Year                            18646 non-null  datetime64[ns]
 3   Annual CO₂ emissions (tonnes )  18646 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(2)
memory usage: 728.4+ KB


### Append all three data frames together to have the final country dataframe

In [9]:
# Append all 3 data frames
df = pd.concat([df_co, df_kgz, df_wlf])

# Sort values by Country
df.sort_values(['Entity', 'Year'])

df.reset_index(level=0, inplace=True)

del df["index"]

df.head()

Unnamed: 0,Entity,Code,Year,Annual CO₂ emissions (tonnes )
0,Afghanistan,AFG,1949-01-01,14656.0
1,Afghanistan,AFG,1950-01-01,84272.0
2,Afghanistan,AFG,1951-01-01,91600.0
3,Afghanistan,AFG,1952-01-01,91600.0
4,Afghanistan,AFG,1953-01-01,106256.0


### To get the region names for the countries, read the Country Codes dataset into a dataframe

In [11]:
# Read the dataset into a data frame
df_inf = pd.read_csv('continents2.csv', usecols=["alpha-3", "region", "sub-region"])

# Rename the column of data frame
df_inf.rename(columns={'alpha-3':'Code', 'region':'Region', 'sub-region':'Sub-Region'}, inplace=True)

# Have a look at the data frame
df_inf.head()

Unnamed: 0,Code,Region,Sub-Region
0,AFG,Asia,Southern Asia
1,ALA,Europe,Northern Europe
2,ALB,Europe,Southern Europe
3,DZA,Africa,Northern Africa
4,ASM,Oceania,Polynesia


### Merge two datasets to have extended region information for all countries

In [12]:
df_merged = pd.merge(df, df_inf, how='left', on='Code')

df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18696 entries, 0 to 18695
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Entity                          18696 non-null  object        
 1   Code                            18696 non-null  object        
 2   Year                            18696 non-null  datetime64[ns]
 3   Annual CO₂ emissions (tonnes )  18696 non-null  float64       
 4   Region                          18297 non-null  object        
 5   Sub-Region                      18297 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 1022.4+ KB


### Check which rows do not have region (and sub-region)

In [13]:
df_merged_na = df_merged[df_merged['Region'].isnull()]

df_merged_na['Entity'].unique()

array(['Czechoslovakia', 'World'], dtype=object)

### Drop the rows that contains World and check again

In [14]:
df_merged = df_merged[df_merged['Entity'] != 'World']

df_merged_na = df_merged[df_merged['Region'].isnull()]

df_merged_na['Entity'].unique()

array(['Czechoslovakia'], dtype=object)

### Fill the NaN values in the rows that contain Czechoslovakia in their Entity column

In [15]:
df_merged['Region'].fillna('Europe', inplace=True)

df_merged['Sub-Region'].fillna('Eastern Europe', inplace=True)

df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18429 entries, 0 to 18695
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   Entity                          18429 non-null  object        
 1   Code                            18429 non-null  object        
 2   Year                            18429 non-null  datetime64[ns]
 3   Annual CO₂ emissions (tonnes )  18429 non-null  float64       
 4   Region                          18429 non-null  object        
 5   Sub-Region                      18429 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 1007.8+ KB


#### We dealt with all the NaN values in our merged data frame

### Have a look at the final data frame

In [16]:
df_merged.head()

Unnamed: 0,Entity,Code,Year,Annual CO₂ emissions (tonnes ),Region,Sub-Region
0,Afghanistan,AFG,1949-01-01,14656.0,Asia,Southern Asia
1,Afghanistan,AFG,1950-01-01,84272.0,Asia,Southern Asia
2,Afghanistan,AFG,1951-01-01,91600.0,Asia,Southern Asia
3,Afghanistan,AFG,1952-01-01,91600.0,Asia,Southern Asia
4,Afghanistan,AFG,1953-01-01,106256.0,Asia,Southern Asia


## 3. Visualising the CO2 Emission

### 3.1 CO2 Emission Change in Time

#### Let's have a look at the CO2 emission trend through years for every country

In [17]:
# Plot the choropleth map figure
fig = px.line(df_merged,
              x="Year",
              y="Annual CO₂ emissions (tonnes )",
              hover_name = 'Entity',
              hover_data=['Entity','Annual CO₂ emissions (tonnes )'],
              color='Entity',
              labels={'Entity':'Country','Annual CO₂ emissions (tonnes )':'CO₂ Emission'},
              height=600)

# Update the title and adjust its location
fig.update_layout(title="Change in CO₂ Emission Between Years 1750 and 2017 - Countries",
                  title_x=0.50)

# Remove the legend
fig.update_layout(showlegend = False)

# Make background transparent
# fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)', 'paper_bgcolor': 'rgba(0, 0, 0, 0)'})

# Show color scale axis
fig.update(layout_coloraxis_showscale = True)

# Show the figure
fig.show()

#### Now observe the CO2 emission of different regions
For this, we need to create an aggregated data frame by Region and Year

In [18]:
# Group the data frame by Code and Entity columns and sum the CO2 emission
total_reg = df_merged.groupby(["Region", "Year"])["Annual CO₂ emissions (tonnes )"].sum()

# Create a data frame from the resulting series
df_reg = pd.DataFrame(total_reg)

# Resulting data frame will have 2 index columns: Code and Entity
# We should reset the index to convert them into columns
df_reg.reset_index(level=0, inplace=True)
df_reg.reset_index(level=0, inplace=True)

df_reg.head()

Unnamed: 0,Year,Region,Annual CO₂ emissions (tonnes )
0,1884-01-01,Africa,21984.0
1,1885-01-01,Africa,36640.0
2,1886-01-01,Africa,47632.0
3,1887-01-01,Africa,47632.0
4,1888-01-01,Africa,80608.0


#### Plot the area chart for every region in the same y-axis

In [19]:
fig = px.area(df_reg,
              x="Year",
              y="Annual CO₂ emissions (tonnes )",
              color="Region",
              facet_col="Region",
              facet_col_wrap=5,
              labels={'Entity':'Country','Annual CO₂ emissions (tonnes )':'CO₂ Emission'},
              height=350)

# Update the title and adjust its location
fig.update_layout(title="Change in CO₂ Emission Between Years 1750 and 2020 - Regions",
                  title_x=0.50)

# Remove the legend
fig.update_layout(showlegend = False)

# Make background transparent
# fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)', 'paper_bgcolor': 'rgba(0, 0, 0, 0)'})

# Show the figure
fig.show()

Although Europe and Americas are the regions having most of the CO2 emission in total, Asia's CO2 emission has skyrocketed in the last 2 decades.

#### Finally, let's look at the total CO2 emission from all countries in the world.

For this, we should aggregate the data frame by year column

In [20]:
# Group the data frame by Year column and sum the CO2 emission
total_year = df_merged.groupby("Year")["Annual CO₂ emissions (tonnes )"].sum()

# Create a data frame from the resulting series
df_total_year = pd.DataFrame(total_year)

# Resulting data frame will have Year column as index
# We should reset the index to convert it into columns
df_total_year.reset_index(level=0, inplace=True)

df_total_year.head()

Unnamed: 0,Year,Annual CO₂ emissions (tonnes )
0,1751-01-01,9350528.0
1,1752-01-01,9354192.0
2,1753-01-01,9354192.0
3,1754-01-01,9357856.0
4,1755-01-01,9361520.0


#### Now we can plot the line chart to see the total change in CO2 emission in the world from the year 1750

In [21]:
# Plot the choropleth map figure
fig = px.area(df_total_year,
              x="Year",
              y="Annual CO₂ emissions (tonnes )",
              hover_name = 'Year',
              hover_data=['Year','Annual CO₂ emissions (tonnes )'],
              #color='Entity',
              labels={'Year':'Year','Annual CO₂ emissions (tonnes )':'CO₂ Emission'},
              height=600)

# Update the title and adjust its location
fig.update_layout(title="Change in CO₂ Emission Between Years 1750 and 2017",
                  title_x=0.50)

# Remove the legend
fig.update_layout(showlegend = False)

# Make background transparent
# fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)', 'paper_bgcolor': 'rgba(0, 0, 0, 0)'})

# Show color scale axis
fig.update(layout_coloraxis_showscale = True)

# Show the figure
fig.show()

The sudden increase in the CO2 emission has started in 1950 and it continues to increase almost ever year, apart from a couple of exceptions such as the years 1981, 1992 and 2009.

### 3.2 Total CO2 Emission and Its Distribution

So we had a look at the change in CO2 emission in time from the mid-18th century, now let's have a look at where we ended up in terms of countries and regions, and their total individual contribution to the total CO2 emitted through almost 250 years, almost 34 billion tonnes.

#### Since our final data frame contains yearly CO2 emission trend for all countries, we should group the data frame by the countries to have the total CO2 emission

In [22]:
# Group the data frame by Code and Entity columns and sum the CO2 emission
total = df_merged.groupby(["Code","Entity"])["Annual CO₂ emissions (tonnes )"].sum()

# Create a data frame from the resulting series
df_total = pd.DataFrame(total)

# Resulting data frame will have 2 index columns: Code and Entity
# We should reset the index to convert them into columns
df_total.reset_index(level=0, inplace=True)
df_total.reset_index(level=0, inplace=True)

df_total.head()

Unnamed: 0,Entity,Code,Annual CO₂ emissions (tonnes )
0,Aruba,ABW,74231830.0
1,Afghanistan,AFG,178502900.0
2,Angola,AGO,623762300.0
3,Anguilla,AIA,3040078.0
4,Albania,ALB,277278200.0


#### Now we have an aggregated data frame and we can visualise the total CO2 emission values for all countries

In [23]:
# Plot the choropleth map figure
fig = px.choropleth(df_total,
                    locations="Code", 
                    locationmode='ISO-3',
                    color="Annual CO₂ emissions (tonnes )", 
                    hover_name="Entity", 
                    hover_data=['Entity','Annual CO₂ emissions (tonnes )'],
                    color_continuous_scale="Peach",
                    labels={'Entity':'Country','Annual CO₂ emissions (tonnes )':'Total CO₂ Emission'})

# Update the title and adjust its location
fig.update_layout(title="Total CO₂ Emission Between Years 1750 and 2017 - Countries",
                  title_x=0.47)

# Make background transparent
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)', 'paper_bgcolor': 'rgba(0, 0, 0, 0)'})

# Show color scale axis
fig.update(layout_coloraxis_showscale=True)

# Show the figure
fig.show()

#### Not a surprise!
Just as expected, all of the most powerful countries in the world are shining orange and have emitted most of the CO2 gas in the world so far. These countries are US, China, Russia, Germany and the UK as we can observe from the map. Let's have a closer look and see a crystal-clear ranking.

#### Finally, have a look at the animated choropleth map to see how countries' CO2 emission have been changed through years
For this, we will first create a data frame that contains aggregated CO2 emission values for every country.

In [24]:
# Sort data frame by total CO2 emission
df_merged_y = df_merged.sort_values(['Entity','Year'], ascending = [True, True])

# Create a Year Only column and convert it to integer type to use it in the animation frame
df_merged_y['Year Only'] = pd.DatetimeIndex(df_merged_y['Year']).year
df_merged_y['Year'].astype('int')

# Create an aggregated column for CO2 emission with cumsum()
df_merged_y['Cumulative CO2 Emission'] = df_merged_y.groupby('Entity')['Annual CO₂ emissions (tonnes )'].transform(pd.Series.cumsum)

# Sort the data frame again for the animated map plot
df_merged_ys = df_merged_y.sort_values('Year', ascending = True)

# Plot the animated choropleth map figure
fig = px.choropleth(df_merged_ys,
                    locations="Code", 
                    locationmode='ISO-3',
                    color="Cumulative CO2 Emission", 
                    hover_name="Entity", 
                    hover_data=['Cumulative CO2 Emission'],
                    color_continuous_scale="Peach",
                    animation_frame="Year Only"
                   )

# Update the title and adjust its location
fig.update_layout(title="Change in CO₂ Emission Between Years 1750 and 2017 - Countries", 
                  title_x=0.5)

# Show the figure
fig.show()

TypeError: cannot astype a datetimelike from [datetime64[ns]] to [int32]

#### Put the same data into a bar chart to see top 20 countries

In [25]:
# Sort data frame by total CO2 emission
df_total_sorted = df_total.sort_values('Annual CO₂ emissions (tonnes )', ascending = False)

# Select first 20 rows
df_total_sorted = df_total_sorted.iloc[0:19,:]

# Plot the bar figure
fig = px.bar(df_total_sorted,
              x = 'Entity',
              y = 'Annual CO₂ emissions (tonnes )',
              color='Annual CO₂ emissions (tonnes )',
              hover_name = 'Entity',
              hover_data = ['Annual CO₂ emissions (tonnes )'],
              color_continuous_scale = 'Peach',
              labels={'Entity':'Country','Annual CO₂ emissions (tonnes )':'Total CO₂ Emission'},
              height=500)

# Adjust text label size & angle and the title
fig.update_layout(uniformtext_minsize = 15,
                  xaxis_tickangle = -45,
                  title = 'Total CO₂ Emission Between Years 1750 and 2017 - Top 20 Countries',
                  title_x = 0.5)

# Make background transparent
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)', 'paper_bgcolor': 'rgba(0, 0, 0, 0)'})

# Hide color scale axis
fig.update(layout_coloraxis_showscale=False)

# Show the figure
fig.show()

#### Guessed it!
I honestly wrote the countries above before plotting this one, and it looks like I have eagle eyes. The top 5 comprises of US, China, Russia, Germany and the UK (in the exactly same order).

#### In top 20, there are countries from every part of the world. Let's have a look at the Region distribution to understand which region emitted most of the CO2 so far.

In [26]:
# Group the data frame by Code and Entity columns and sum the CO2 emission
total_r = df_merged.groupby(['Region','Entity'])["Annual CO₂ emissions (tonnes )"].sum()

# Create a data frame from the resulting series
df_total_r = pd.DataFrame(total_r)

# Sort the dataframe
df_total_r = df_total_r.sort_values('Annual CO₂ emissions (tonnes )', ascending = False)

# Resulting data frame will have 2 index columns: Entity and Region
# We should reset the index to convert them into columns
df_total_r.reset_index(level=0, inplace=True)
df_total_r.reset_index(level=0, inplace=True)

df_total_r.head(5)

Unnamed: 0,Entity,Region,Annual CO₂ emissions (tonnes )
0,United States,Americas,399378300000.0
1,China,Asia,200136500000.0
2,Russia,Europe,100589100000.0
3,Germany,Europe,90565630000.0
4,United Kingdom,Europe,77071060000.0


#### Plot the same bar chart for regions with the breakdown of countries

In [27]:
# Plot the bar chart
fig = px.bar(df_total_r,
              x = 'Region',
              y = 'Annual CO₂ emissions (tonnes )',
              color='Annual CO₂ emissions (tonnes )',
              hover_name = 'Entity',
              hover_data = ['Annual CO₂ emissions (tonnes )'],
              color_continuous_scale = 'Peach',
              labels={'Entity':'Country','Annual CO₂ emissions (tonnes )':'Total CO₂ Emission'},
              height=500)

# Adjust text label size & angle and the title
fig.update_layout(uniformtext_minsize = 15,
                  xaxis_tickangle = -45,
                  title = 'Total CO₂ Emission Between Years 1750 and 2017 - Regions',
                  title_x = 0.5)

# Make bacground transparent
fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)',
                   'paper_bgcolor': 'rgba(0, 0, 0, 0)'})

# Hide color scale axis
fig.update(layout_coloraxis_showscale = True)

# Show the figure
fig.show()

#### A better view for comparison
Although it is obvious that the US has the highest share of CO2 emissions, surprisingly, Europe's total CO2 emission is slightly above than the total CO2 emission of the countries in Americas region.

#### What about Asia?
In terms of distribution, Asia is somewhere between Americas and Europe. The superpowers of Asia: China, Japan and India generates more than half of the Region's CO2 emission, which is in the 3rd ranking after Europe and Americas regions.

## 4. Relationship Between CO2 Emission and Other Country Info

So far we only played with CO2 emission data, saw its trend in time and distribution among different countries and regions.

### A new dataset for country information 
Now we will introduce a new dataset containing other useful information about countries such as population, density, land area etc.

In [29]:
# Read the dataset into a data frame
df_pop = pd.read_csv('population_by_country_2020.csv',
                     usecols=["Country (or dependency)",
                              "Population (2020)",
                              "Density (P/Km²)",
                              "Land Area (Km²)"])

# Rename the column of data frame
df_pop.rename(columns={'Country (or dependency)':'Entity'}, inplace=True)

# Have a look at the data frame
df_pop.head()

Unnamed: 0,Entity,Population (2020),Density (P/Km²),Land Area (Km²)
0,China,1440297825,153,9388211
1,India,1382345085,464,2973190
2,United States,331341050,36,9147420
3,Indonesia,274021604,151,1811570
4,Pakistan,221612785,287,770880


In [30]:
df_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235 entries, 0 to 234
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Entity             235 non-null    object
 1   Population (2020)  235 non-null    int64 
 2   Density (P/Km²)    235 non-null    int64 
 3   Land Area (Km²)    235 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 7.5+ KB


### Merge the population data frame to the main data frame we used in Part 3.2

Now we will merge the data frame we used in Part 3.2, df_total and the population data frame to have the data frame we are going to use to see correlations between different features

In [31]:
df_all = pd.merge(df_total_r, df_pop, how='left', on='Entity')

df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 223 entries, 0 to 222
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Entity                          223 non-null    object 
 1   Region                          223 non-null    object 
 2   Annual CO₂ emissions (tonnes )  223 non-null    float64
 3   Population (2020)               200 non-null    float64
 4   Density (P/Km²)                 200 non-null    float64
 5   Land Area (Km²)                 200 non-null    float64
dtypes: float64(4), object(2)
memory usage: 12.2+ KB


### Investigate what are the 23 countries having null in their recently merged columns

In [32]:
df_co_na = df_all[df_all['Population (2020)'].isnull()]

df_co_na['Entity'].unique()

array(['Czechoslovakia', 'Czech Republic', 'Macedonia', 'Curacao',
       "Cote d'Ivoire", 'Democratic Republic of Republic of the Congo',
       'Reunion', 'Republic of the Congo', 'Palestine',
       'Sint Maarten (Dutch part)', 'Swaziland',
       'Bonaire Sint Eustatius and Saba', 'Cape Verde',
       'Saint Vincent and the Grenadines', 'Saint Kitts and Nevis',
       'Timor', 'Saint Pierre and Miquelon', 'Micronesia (country)',
       'Turks and Caicos Islands', 'Sao Tome and Principe',
       'Christmas Island', 'Wallis and Futuna Islands', 'Kyrgysztan'],
      dtype=object)

### We should replace the name of the countries not matching in two data frames

For this, we will have a look at all the countries in the population data frame and try to find the countries above manually (Ctrl+F should help in that step).

In [33]:
df_pop['Entity'].unique()

array(['China', 'India', 'United States', 'Indonesia', 'Pakistan',
       'Brazil', 'Nigeria', 'Bangladesh', 'Russia', 'Mexico', 'Japan',
       'Ethiopia', 'Philippines', 'Egypt', 'Vietnam', 'DR Congo',
       'Turkey', 'Iran', 'Germany', 'Thailand', 'United Kingdom',
       'France', 'Italy', 'Tanzania', 'South Africa', 'Myanmar', 'Kenya',
       'South Korea', 'Colombia', 'Spain', 'Uganda', 'Argentina',
       'Algeria', 'Sudan', 'Ukraine', 'Iraq', 'Afghanistan', 'Poland',
       'Canada', 'Morocco', 'Saudi Arabia', 'Uzbekistan', 'Peru',
       'Angola', 'Malaysia', 'Mozambique', 'Ghana', 'Yemen', 'Nepal',
       'Venezuela', 'Madagascar', 'Cameroon', "Côte d'Ivoire",
       'North Korea', 'Australia', 'Niger', 'Taiwan', 'Sri Lanka',
       'Burkina Faso', 'Mali', 'Romania', 'Malawi', 'Chile', 'Kazakhstan',
       'Zambia', 'Guatemala', 'Ecuador', 'Syria', 'Netherlands',
       'Senegal', 'Cambodia', 'Chad', 'Somalia', 'Zimbabwe', 'Guinea',
       'Rwanda', 'Benin', 'Burundi', 'Tuni

So the unmatched countries in the CO2 data frame and their equivalent in the population data frame are given in the dictionary below: 'matches'

### Use replace method to correct the country names

We will replace the unmatched country names in the population data frame and perform the merge again.

In [34]:
# Define the dictionary
matches = {"Côte d'Ivoire":"Cote d'Ivoire",
           "DR Congo":'Democratic Republic of Republic of the Congo',
           "Congo":'Republic of the Congo', 
           "Cabo Verde":'Cape Verde', 
           "Curaçao":'Curacao', 
           "Czech Republic (Czechia)":'Czech Republic', 
           "Micronesia":'Micronesia (country)',
           "Kyrgyzstan":'Kyrgysztan', 
           "Saint Kitts & Nevis":'Saint Kitts and Nevis', 
           "North Macedonia":'Macedonia',
           "State of Palestine":'Palestine', 
           "Réunion":'Reunion',
           "Saint Pierre & Miquelon":'Saint Pierre and Miquelon', 
           "Sao Tome & Principe":'Sao Tome and Principe', 
           "Sint Maarten":'Sint Maarten (Dutch part)', 
           "Turks and Caicos":'Turks and Caicos Islands',
           "Timor-Leste":'Timor',
           "St. Vincent & Grenadines":'Saint Vincent and the Grenadines', 
           "Wallis & Futuna":'Wallis and Futuna Islands'
          } 


# Replace the country names in the population data frame
df_pop.replace({"Entity": matches}, inplace=True)

# Perform the merge again
df_all = pd.merge(df_total_r, df_pop, how='left', on='Entity')

# Since we couldn't find the information about 4 countries and we already covered Czech Republic,
# we can drop these 5 countries
df_all.dropna(inplace=True)

# Check if they are matched
df_all.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 218 entries, 0 to 222
Data columns (total 6 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Entity                          218 non-null    object 
 1   Region                          218 non-null    object 
 2   Annual CO₂ emissions (tonnes )  218 non-null    float64
 3   Population (2020)               218 non-null    float64
 4   Density (P/Km²)                 218 non-null    float64
 5   Land Area (Km²)                 218 non-null    float64
dtypes: float64(4), object(2)
memory usage: 11.9+ KB


In [35]:
df_all.head()

Unnamed: 0,Entity,Region,Annual CO₂ emissions (tonnes ),Population (2020),Density (P/Km²),Land Area (Km²)
0,United States,Americas,399378300000.0,331341000.0,36.0,9147420.0
1,China,Asia,200136500000.0,1440298000.0,153.0,9388211.0
2,Russia,Europe,100589100000.0,145945500.0,9.0,16376870.0
3,Germany,Europe,90565630000.0,83830970.0,240.0,348560.0
4,United Kingdom,Europe,77071060000.0,67948280.0,281.0,241930.0


### Visualising the correlation of CO2 emission and other country features

Lets see if how the total CO2 emission of a country correlates to its population

In [36]:
# Create a Logarithmic Scale for Density column so that we can have a smoother transition in colors
df_all['Log Scale'] = df_all['Density (P/Km²)'].apply(lambda x : math.log2(x+1))

# Plot the scatter chart
fig = px.scatter(df_all, 
                 x="Annual CO₂ emissions (tonnes )", 
                 y="Population (2020)", 
                 size="Land Area (Km²)", 
                 color="Log Scale",
                 color_continuous_scale="Temps",
                 hover_name="Entity", 
                 hover_data=['Entity','Annual CO₂ emissions (tonnes )'],
                 labels={'Entity':'Country',
                         'Annual CO₂ emissions (tonnes )':'CO₂ Emission',
                         'Population (2020)':'Population',
                         'Log Scale':'Density'},
                 log_x=True,
                 log_y=True, 
                 size_max=40)

# Update the title and adjust its location
fig.update_layout(title="Population v. CO2 Emission, 2020", 
                  title_x=0.5)

# Show the figure
fig.show()

### A multi-functional bubble chart

In this bubble chart, we put the two feature, that we want to see the correlation of. And the direct correlation between the population and the CO2 emission of the countries can be clearly seen: as population increases, CO2 emission increases as well.

#### Use all functions of the bubble chart
Another dimension that can be easily observed from the bubble chart is the size of the bubbles, which represents the land area of every country. It is not easy to say how it correlates to CO2 emission because we have some small countries at the top such as  Japan, Germany and the UK.

#### It is colorful as well!
Color functionality allows us to see another dimension in the same chart: the density of every country. Again, not to see the correlation, but just to observe the country density along with all other features in just one visual.

## 5. Conclusion

It is pretty obvious that the amount of CO2 released to the atmosphere has been increased significantly in last 70 years, causing a variety of unwanted consequences all over the world: natural disasters such as fires and tornados and glacial meltdown.

### It's never too late!
Some say that we hit the point 'no-return' for global warming, yet we know that mother nature can heal anything. If we act together now, before losing all the forests, icebergs and infinite number of different species living on earth, we can still heal our world. 

## Before closing...

### If you've made it so far in the notebook, first, congratulations! and second, please let me know about your thoughts!

- LinkedIn: https://www.linkedin.com/in/emrecanokten/
- Kaggle:https://www.kaggle.com/emrecano

### And do not forget the vote the notebook or like the post (whichever is applicable)!

# Thanks a lot!