<a href="https://www.kaggle.com/code/ifeanyichukwunwobodo/exploratory-analysis-of-co2-emission-using-plotly?scriptVersionId=135061634" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Introduction

Carbon dioxide emissions are the primary driver of global climate change. It’s widely recognised that to avoid the worst impacts of climate change, the world needs to urgently reduce emissions. But, how this responsibility is shared between regions, countries, and individuals has been an endless point of contention in international discussions. This project aims at investigating the distribution of Co2 emission around the world.

            

# Data Assessment & Cleaning

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "plotly_dark"

### About the Data

The data consist of various emission indicators around the world from 1850 to 2021. This analysis focuses on 2000 to 2021. This analysis will focus on Co2 emission.

In [2]:
#Let's import the dataset
df = pd.read_csv('/kaggle/input/2022-complete-co2-emissions/owid-co2-data.csv', low_memory=False)
#Let's have a look at some of the rows of the dataset
df.head()

Unnamed: 0,country,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,co2_growth_prct,...,share_global_cumulative_other_co2,share_global_flaring_co2,share_global_gas_co2,share_global_luc_co2,share_global_oil_co2,share_global_other_co2,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Afghanistan,1850,AFG,3752993.0,,,,,,,...,,,,0.121,,,,,,
1,Afghanistan,1851,AFG,3769828.0,,,,,,,...,,,,0.118,,,,,,
2,Afghanistan,1852,AFG,3787706.0,,,,,,,...,,,,0.116,,,,,,
3,Afghanistan,1853,AFG,3806634.0,,,,,,,...,,,,0.115,,,,,,
4,Afghanistan,1854,AFG,3825655.0,,,,,,,...,,,,0.114,,,,,,


In [3]:
# Let's look at the shape of the data
print('The data has', df.shape[0], 'rows and', df.shape[1], 'columns')

The data has 46523 rows and 74 columns


The dataset contains 46523 rows and 74 columns. Not all columns are needed for the analysis.

### Data Cleaning

The columns are more than needed for this analysis. Some step has to be taken before descriptive statistics and exploratory data analysis:
1) Select the neccessary columns.

2) Deal with misssing values.

3) Select the important time frame.

4) Create a new column for GDP per Capita.

5) Create a seperate dataset for the continents in the dataset,

In [4]:
#keep the important columns
important_col = ['year', 'iso_code','country', 'population', 'gdp', 'co2', 'co2_per_capita']
df = df[important_col]


#Select the years within the 21st Century
df = df[(df['year'] >= 2000)]

#create anew gdp_per_capita column
df['gdp_per_capita'] = df['gdp']/df['population']





df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5720 entries, 150 to 46522
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            5720 non-null   int64  
 1   iso_code        5068 non-null   object 
 2   country         5720 non-null   object 
 3   population      5286 non-null   float64
 4   gdp             3140 non-null   float64
 5   co2             5454 non-null   float64
 6   co2_per_capita  5182 non-null   float64
 7   gdp_per_capita  3140 non-null   float64
dtypes: float64(5), int64(1), object(2)
memory usage: 402.2+ KB


In [5]:
# Use series.astype() method to convert integers to datetime format
#df['year'] = pd.to_datetime(df['year'].astype(str))
print(df.dtypes)


year                int64
iso_code           object
country            object
population        float64
gdp               float64
co2               float64
co2_per_capita    float64
gdp_per_capita    float64
dtype: object


'iso_code' and 'country' are of object datatype. Others are float except the year column which is an integer. Although, this is not the right datatype for the year column, the datatype will be ignored as it will be more effecient for our analysis than the datetime format.

In [6]:
fig=px.bar(df.nunique().sort_values(ascending=False), text_auto=True,color_discrete_sequence = ['#03DAC5'], title='Unique Values in the Important Columns of the Dataset')
fig.update_layout(showlegend=False)

The analysis covers 22 years (2000-2022) and 231 countries(`iso_code`). There's a discrepancy between the `country` and `iso_code`. this shows that some countries that do not have `iso_code`. This will be explored later.

## Missing Values

In [7]:
df.isnull().sum()

year                 0
iso_code           652
country              0
population         434
gdp               2580
co2                266
co2_per_capita     538
gdp_per_capita    2580
dtype: int64

The null values in iso_code column is  as a result of data entry issues and it does not make sense to solve them using any of the popular techniques for dealing with missing values. The missing values in the numerical columns will be solved using interpolation technique as this retains the time series feature of the datset.

In [8]:
df = df.interpolate()

In [9]:
df.isnull().sum()

year                0
iso_code          652
country             0
population          0
gdp                 0
co2                 0
co2_per_capita      0
gdp_per_capita      0
dtype: int64

Continents were identified as countries in the dataset. A seperate dataframe was created for the continents. Let's  have a look.

In [10]:
#Sepearate the continents
continent = ['Europe', 'Africa', 'North America', 'South America', 'Antartica', 'Australia','Asia']
continents= df.loc[df.country.isin(continent)]
continents

Unnamed: 0,year,iso_code,country,population,gdp,co2,co2_per_capita,gdp_per_capita
422,2000,,Africa,818952374.0,6.460179e+10,886.403,1.083,2010.420871
423,2001,,Africa,839464127.0,6.385349e+10,884.000,1.053,2052.404072
424,2002,,Africa,860611762.0,6.310519e+10,892.559,1.037,2094.387273
425,2003,,Africa,882349569.0,6.235690e+10,967.170,1.097,2136.370474
426,2004,,Africa,904781595.0,6.160860e+10,1036.637,1.146,2178.353675
...,...,...,...,...,...,...,...,...
38912,2017,,South America,420982650.0,8.539551e+11,1125.113,2.673,16761.945689
38913,2018,,South America,424740741.0,8.625590e+11,1064.608,2.507,17000.998852
38914,2019,,South America,428318218.0,8.711630e+11,1073.178,2.506,17240.052016
38915,2020,,South America,431530105.0,8.797669e+11,981.190,2.274,17479.105180


In [11]:
continents.country.unique()

array(['Africa', 'Asia', 'Australia', 'Europe', 'North America',
       'South America'], dtype=object)

The missing value in the iso_code column contains codes for continents.

### Descriptive statistics on the data

The original dataframe can now be adjusted to include only individual countries (countries with iso_code)

In [12]:
print('Descriptive Statistics of the Numerical Columns for the Country DataFrame')
#Remove the continents 
df = df[df['iso_code'].notnull()]
df.describe().T #Describe the Numerical Columns

Descriptive Statistics of the Numerical Columns for the Country DataFrame


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,5068.0,2010.489,6.346647,2000.0,2005.0,2010.0,2016.0,2021.0
population,5068.0,32085790.0,128059900.0,1833.0,451206.5,5291736.0,19844570.0,1425894000.0
gdp,5068.0,810447500000.0,3681570000000.0,312853600.0,26040070000.0,101387800000.0,355623300000.0,58633220000000.0
co2,5068.0,226.5908,1244.725,0.004,1.05975,7.917,57.12575,24346.94
co2_per_capita,5068.0,5.148844,6.495849,0.015,0.86175,3.1085,7.1085,62.259
gdp_per_capita,5068.0,15968.04,16550.61,438.6102,4632.471,10829.0,22074.5,166150.5


In [13]:
df.query('co2 == @df.co2.max()')

Unnamed: 0,year,iso_code,country,population,gdp,co2,co2_per_capita,gdp_per_capita
45734,2021,ESH,Western Sahara,565590.0,58633220000000.0,24346.944957,4.063783,9678.517079


There's too much controversy about the territory of this country 

In [14]:
df = df.query('iso_code != "ESH"')

In [15]:
df.groupby('country').mean().sort_values(by='co2',ascending=False)['co2'][:5]

country
China                           8059.320636
United States Virgin Islands    6383.075500
United States                   5614.913591
India                           1780.094727
Russia                          1613.785591
Name: co2, dtype: float64

In [16]:
continents.describe().T #let's look at our continents

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,132.0,2010.5,6.368458,2000.0,2005.0,2010.5,2016.0,2021.0
population,132.0,1169813000.0,1421360000.0,19017970.0,396121000.0,662484700.0,1062094000.0,4693332000.0
gdp,132.0,469252500000.0,330378500000.0,48887570000.0,194052500000.0,405671400000.0,759355000000.0,1238560000000.0
co2,132.0,5246.301,5721.576,349.635,960.5732,3239.22,6766.172,21688.99
co2_per_capita,132.0,7.653258,6.021409,1.019,2.49825,5.664,12.37625,19.213
gdp_per_capita,132.0,15025.94,14834.25,2010.421,3211.475,8955.622,21447.95,49583.58


# Exploratory Data Analysis

# What is the Distribution and Trend of Co2 Emission Around the World?

In [17]:
co2_map = px.choropleth(df, locations='iso_code', #determines the points on the map
                        color='co2', 
                        hover_name='country', #label to be hovered (tooltip)
                        title= 'Co2 Emission Around the World', #Map heading
                        color_continuous_scale='RdYlGn_r', #color scale
                        projection='natural earth')#projection
co2_map

Co2 emission is not distributed evenly around the world. China has the highest Co2 emission levels. United States is the second highrst emitter. Russia and India also have high emission levels but theirs is relatively small compared to the other nations mentioned earlier. The African continents and the South Americans has mild level of emission. 
Have this always been the case? 

In [18]:
line = px.line(continents, 'year', 'co2', color='country', log_y=True,
               title='What is the Trend of Co2 Emission in Each Continent Over the Years?')
line.update_yaxes(showgrid=False, showline=True)
line.update_xaxes(showgrid=False, showline=True)
line.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                   hovermode='closest',
                  xaxis_title='', yaxis_title='Co2 Emision',
                  legend = dict(orientation='h', yanchor='auto', y=1.02, xanchor='right', x=1, title='Continents'))



With the exception of Europe and North America (who have a slight downward trend), the general trend of Co2 emission around the world is a slightly upward. Many continents of the world has had a nigh constant rate of increase in Co2 emission except Asia whose Co2 emission levels have followed an upward trend since the 21st century.

In [19]:
co2_map_anim = px.choropleth(df, locations='iso_code',
                         #size='co2',
                        color='co2',
                        labels='country',
                        animation_frame='year',
                        animation_group='country',
                        title= 'Was China Always the Highest Emitter?',
                        projection="natural earth", 
                        color_continuous_scale='RdYlGn_r',
                        )


co2_map_anim


In [20]:
con = continents.groupby('country').mean().reset_index()

bar = px.bar(con, 'co2', 'country', log_x=True, 
             title = 'Which Continent has the Highest Level of Co2 Emission?', 
             color_discrete_sequence = ['#03DAC5']
            )
bar.update_yaxes(showgrid=False, categoryorder='total ascending')
bar.update_xaxes(showgrid=False)
bar.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                  xaxis_title='', yaxis_title='Co2 Emision')


                                                      

Asia has the highest the highest average emission

In [21]:
colors = ['asia' if c == 'Asia' else 'not_asia' for c in con['country']]

pie = px.pie(con, values = 'co2', names = 'country', 
             title = 'Asia Emits More than Half of the World Co2', 
             hole=0.7,
             color=colors,
             category_orders = {'country': ['North America','Europe','Africa', 'South America', 'Australia', 'Asia']},
             color_discrete_map={'asia':'red', 'not_asia':'#03DAC5'}
             )
pie.update_traces(textinfo='percent+label',textposition='outside', rotation=50, hovertemplate=None)
pie.update_layout(margin=dict(t=100, b=0, l=70, r=40), showlegend=False
                 )
pie.add_annotation(dict(x=0.5, y=0.3, align='center', xref='paper', yref='paper',
                        showarrow=False, font_size=12,text=' of Global Co2 is Emitted by Asia'),
                        font=dict(family='Times New Romans'))
pie.add_annotation(dict(x=0.5, y=0.5, align='center', xref='paper', yref='paper',
                        showarrow=False, font_size=100,text=' 51.1% '), 
                       font=dict(family='Times New Romans'), yanchor='auto'
                       )

                                                      

Asia accounts for more than half of the world Co2 emission. That is, 51.6% of global co2 emission is comes from Asia.

In [22]:
country = df.groupby('country').sum().reset_index().sort_values('co2', ascending=False)
bar0 =px.bar(country.head(10), 'co2', 'country', log_x=True, 
             text_auto=True,
             orientation='h', title='Which Countries Emits the Most Co2?',
             color_discrete_sequence = ['#03DAC5'])
bar0.update_yaxes(showgrid=False, categoryorder='total ascending', showline=True)
bar0.update_xaxes(showgrid=False, showline=True, showticklabels=False, ticks='')
bar0.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                  xaxis_title='', yaxis_title='')



China ha the highest level of Co2 emission. This is in line with what we discovered in the first cholorophleth map. It is also in line with what we saw at the continental level as six of the top ten countries are Asian.

According to the bar chart above,  

In [23]:
re =px.scatter(df,  'co2', 
                'population',
                log_y=True, log_x=True,
                color_discrete_sequence = ['#03DAC5'],
                title='Population has a Positive Relationship with Co2 Emission')
re.update_traces(textposition='top right')
re.update_yaxes(showgrid=False, showline=True)
re.update_xaxes(showgrid=False, showline=True)
re.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                   hovermode='closest')

Population has also been seen as a determinant of Co2 emission as the more people mean more demand for pollution product.

Let's explore population around the world over the years.

In [24]:
line1 = px.line(continents, 'year', 'population', color='country', 
                title= 'Asia has a had a significantly higher population over the years',
               log_y=True)
line1.update_yaxes(showgrid=False, showline=True)
line1.update_xaxes(showgrid=False, showline = True)
line1.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                   hovermode='closest', 
                  xaxis_title='', yaxis_title='Co2 Emision',
                  legend = dict(orientation='h', yanchor='auto', y=1.02, xanchor='right', x=1))



Let's explore if there truly is any relationship between population and Co2 emission.The map above shows that the Asian countries have a large population compared to the other continents. Therefore, we have to account for population .

In [25]:
rel0 =px.scatter(df,  'co2', 
                'population', 
                log_y=True, log_x=True,
                color_discrete_sequence = ['#03DAC5'],
               title='The Relationship Between Co2 Emission and Populatiion')
rel0.update_traces(textposition='top right')
rel0.update_yaxes(showgrid=False, showline=True)
rel0.update_xaxes(showgrid=False, showline=True)
rel0.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                   hovermode='closest',
                  xaxis_title='CO2 Emission', yaxis_title='Population'
                  )


    The scatterplot shows a positive relationship between co2 emission and population. From now on we focus on a measure
    that weighs for population, Co2 per capita. That is, co2 emission levels per unit of population.

In [26]:
line2 = px.line(continents, 'year','co2_per_capita', 
                color='country', 
               title='Co2 Emission Per Capita Over the Years')
line2.update_yaxes(showgrid=False, showline=True)
line2.update_xaxes(showgrid=False, showline=True)
line2.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                   hovermode='closest', 
                  xaxis_title='', yaxis_title='Co2 Emission per Capita',
                  legend = dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1, title='Continents'))

It can be seen that using this measure, Co2 per capita is higher in Australia than any other continent of the world. Asia becomes the continent with the 3rd lowest emission level.

In [27]:
bar0 =px.bar(country.head(10), 'co2_per_capita', 'country', log_x=True, orientation='h', 
             title='Which Countries Emits the Most Co2?', 
             text_auto=True,
             color_discrete_sequence = ['#03DAC5'])
bar0.update_yaxes(showgrid=False, categoryorder='total ascending', showline=True)
bar0.update_xaxes(showgrid=False, showline=False, ticks='',showticklabels=False)
bar0.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                  xaxis_title='', yaxis_title='')

United States is the highest Co2 emitter per head followed by Saudi Arabia and Canada. 

In [28]:
rel =px.scatter(df,  'co2', 
                'gdp', 
                size= 'population',
                log_y=True, log_x=True,
                labels='country',
                color_discrete_sequence = ['#03DAC5'],
                title='What is the Relationship Between Co2 Emission and GDP?'
               )
rel.update_traces(textposition='top right')
rel.update_yaxes(showgrid=False, showline=True)
rel.update_xaxes(showgrid=False, showline=True)
rel.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                   hovermode='closest',
                  xaxis_title='CO2 Emission', yaxis_title='GDP'
                 )

There is a positive relationship between GDP and Co2 Emission per Capita. Looking at the graph, Co2 and GDP increases as population increases. The more populated countries are produces high level of Co2 Emission and have a high GDP level. 

#### Does the relationship remain the same at a per capita level?

In [29]:
rel =px.scatter(df,  'co2_per_capita', 
                'gdp_per_capita', 
                size= 'population',
                log_y=True, log_x=True,
                labels='country',
                color_discrete_sequence = ['#03DAC5'],
                title='What is the Relationship Between Co2 per Capita and GDP per Capita?')
rel.update_traces(textposition='top right')
rel.update_yaxes(showgrid=False, showline=True)
rel.update_xaxes(showgrid=False, showline=True)
rel.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                   hovermode='closest',
                  xaxis_title='CO2 Emission Per Capita', yaxis_title='GDP Per Capita')

GDP still show a positive relationship with Co2 on a per capita basis. But the key difference is that, contrary to what was seen before, the richest countries (countrues with the highest GDP) are high polluters and have a low population level. The more populated countries are middle income countries and they emit at a near average level.

In [30]:
anim_rel = px.scatter(df, "co2_per_capita", 
                   y="gdp_per_capita",
                   animation_frame="year",
                   animation_group="country",
                   trendline='ols',
                   trendline_color_override='white',
                   size="population",
                   title='Does the relationship Between GDP Per Capita and CO2 per Capita change with time?',
                   labels='country',
                   color_discrete_sequence = ['#03DAC5'])
anim_rel.update_yaxes(showgrid=False)
anim_rel.update_xaxes(showgrid=False)
anim_rel.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                   hovermode='closest',
                  xaxis_title='CO2 Emission Per Capita', yaxis_title='GDP Per Capita',
                  title_font = dict(size=25, color='#a5a7ab', family='Muli, sans-serif'),
                  font = dict(color='#8a8d93'),
                  legend = dict(orientation='v', yanchor='bottom', y=1.02, xanchor='right', x=1, title='Continents'))

The relationship between GDP per capita and Co2 per ccapita has remained positive for the whole of the 21st century. The countries with low Co2 per capita and GDP per capita are marked with high population.

### Findings

Asia has the highest levels of Co2 emission as it contributes to approximately 52% of global population. China and United States of America produces the highest level of co2 emission. This is exacerbated by the fact the fact that the highest emitters of co2 are also very populous country. To factor for this anther proxy was adopted, Co2 per capita, which revealed that Australia is the highest emitter of co2 per head. USA was shown to be the highest Co2 emitter per head. 
It was alsoo shown that co2 per capita has a positive relationship with Gdp per capita and this relationship has remained positive over the years.

### Conclusion

Plotly is an efficient library for creating incredible visuals. The visuals can be tweaked to meet your personal preference. Who knows it might be the next tool you use for your next project. be sure to leave a comment if you have any question on the tools and concepts adopted for this project or if you have any recommendation or correction to make. If you found this notebook helpful, be sure leave an upvote.