# **Notebook 1: Data Extraction, Transformation and Loading**

## Objectives

* Import raw data from Kaggle into a dataframe
* Clean data to remove duplicate values and remove outliers
* Identify and handle missing data

## Inputs

* Raw data files from [CO2 Emissions Dataset](https://www.kaggle.com/datasets/shreyanshdangi/co-emissions-across-countries-regions-and-sectors/data)

## Outputs

* Generates clean_data.csv for use in hypothesis testing and visualisations

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Import Packages

Import packages needed to run the notebook

In [None]:
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns

# Import Data

Import raw data into dataframe, ready for processing

In [2]:
# set path to data file
path = "../raw_data/data.csv"

# assign data to dataframe
df_raw = pd.read_csv(path)

# display dataframe
df_raw.head()

Unnamed: 0,Description,Name,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,...,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Country,Afghanistan,1850,AFG,3752993.0,,0.0,0.0,,,...,,,,,,,7.436,0.629,,
1,Country,Afghanistan,1851,AFG,3767956.0,,0.0,0.0,,,...,,0.156,0.0,0.0,0.0,0.0,7.5,0.633,,
2,Country,Afghanistan,1852,AFG,3783940.0,,0.0,0.0,,,...,,0.155,0.0,0.0,0.0,0.0,7.56,0.637,,
3,Country,Afghanistan,1853,AFG,3800954.0,,0.0,0.0,,,...,,0.155,0.0,0.0,0.0,0.0,7.62,0.641,,
4,Country,Afghanistan,1854,AFG,3818038.0,,0.0,0.0,,,...,,0.155,0.0,0.0,0.0,0.0,7.678,0.644,,


# Initital Data Cleaning

First I will do some initital data cleaning steps:
1. Check for duplicate rows and remove if found
2. Limit data to country information to fit the requirements of the analysis
3. Limit data to last 50 years to make analysis easier to manage and findings more relevant to current times

In [3]:
# drop any duplicate rows
df_raw.drop_duplicates(inplace=True)

# select rows where Description is equal to Country
df_raw = df_raw.loc[df_raw['Description'] == "Country"]

# limit data to last 50 years
df_raw = df_raw.loc[df_raw["year"] >= 1975]

df_raw.shape

(9702, 80)

The next step is to drop columns which aren't needed for analysis, this will reduce the size of the data file making processing more efficient and the data easier to work with.<br>
(GPT-5 was used to format column names)

In [75]:
# create a list of columns names to keep in the dataset
req_columns = ["Name", "iso_code", "year", "population", "gdp", "primary_energy_consumption", "co2", "co2_including_luc", "consumption_co2", "total_ghg", "co2_growth_abs", "co2_growth_prct",
               "co2_per_capita", "co2_per_gdp", "consumption_co2_per_capita", "consumption_co2_per_gdp", "energy_per_capita", "energy_per_gdp", "cement_co2", "coal_co2", "flaring_co2", "gas_co2", "land_use_change_co2", "oil_co2", "trade_co2",
               "share_global_cement_co2", "share_global_co2", "share_global_co2_including_luc", "share_global_coal_co2", "share_global_flaring_co2", "share_global_gas_co2", "share_global_luc_co2", "share_global_oil_co2", "trade_co2_share",
               "cumulative_co2", "cumulative_co2_including_luc", "share_global_cumulative_co2", "share_of_temperature_change_from_ghg", "temperature_change_from_co2", "temperature_change_from_ghg"]

# create a new dataframe with only the required columns
df_trimmed = df_raw[req_columns].copy()
df_trimmed.head()

Unnamed: 0,Name,iso_code,year,population,gdp,primary_energy_consumption,co2,co2_including_luc,consumption_co2,total_ghg,...,share_global_gas_co2,share_global_luc_co2,share_global_oil_co2,trade_co2_share,cumulative_co2,cumulative_co2_including_luc,share_global_cumulative_co2,share_of_temperature_change_from_ghg,temperature_change_from_co2,temperature_change_from_ghg
125,Afghanistan,AFG,1975,12773967.0,15177770000.0,,2.121,5.732,,21.901,...,0.021,0.071,0.011,,21.287,280.878,0.004,0.117,0.0,0.001
126,Afghanistan,AFG,1976,13059861.0,16023610000.0,,1.981,5.286,,21.624,...,0.013,0.065,0.01,,23.267,286.164,0.004,0.116,0.0,0.001
127,Afghanistan,AFG,1977,13340758.0,15207360000.0,,2.384,5.391,,21.517,...,0.021,0.055,0.011,,25.652,291.555,0.005,0.115,0.0,0.001
128,Afghanistan,AFG,1978,13611445.0,16337830000.0,,2.153,4.786,,20.887,...,0.012,0.049,0.01,,27.805,296.34,0.005,0.113,0.0,0.001
129,Afghanistan,AFG,1979,13655572.0,15913790000.0,,2.233,4.99,,20.564,...,0.014,0.059,0.013,,30.038,301.33,0.005,0.112,0.0,0.001


---

# Initial Data Exploration

Next I will use y-data profiling to learn more about the dataset and find out if any further transformation is required to prepare it for use in visualisations.<br>
The preview from the table above shows some columns that will need to be rescaled (e.g. population and gdp) and some missing data to be handled (e.g. primary_energy_consumption and trade_co2_share).

In [5]:
profile = ProfileReport(df=df_trimmed, minimal=True)
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

This has highlighted missing data in most columns which will need to be handled.<br>
The minimum population is 501, we might want to remove countries with very small numbers as these are unlikely to have high emission levels.Alternatively these countries could be combined and treated as a single category.<br>
GDP is largly reported in scientific notation format and is not intuitive to interpret for most people, it could be beneficial to rescale this column. It is currently reported in $, but grouping as thousands or millions may make more sense.

---

# Basic Descriptives and Visualisations

This section includes basic data descriptives and visualisations which are used to inform final data transformations.<br>

## Population
First I will look at population for the most recent year (2023) to identify countries with particularly low population numbers which could be targets for removal from the dataset.<br>
It is assumed that population will tend to have increased over time and that no country has had a sudden and severe drop in population figures in recent years.<br>
(GPT-5 was used to refactor the database creation code and https://stackoverflow.com/questions/40347689/dataframe-describe-suppress-scientific-notation was used to reformat the descriptives output)

In [71]:
# create dataframe with population data for 2023
df_temp = df_trimmed.loc[df_trimmed['year'] == 2023, ['Name','population', 'share_global_co2_including_luc']]
# reset dataframe index
df_temp.reset_index(drop=True)

# create descriptives for population and global co2 share, making sure format doesn't include scientific notation
df_temp.describe().apply(lambda s: s.apply('{0:.2f}'.format))

Unnamed: 0,population,share_global_co2_including_luc
count,198.0,193.0
mean,40785592.49,0.51
std,149011328.0,2.31
min,501.0,-0.01
25%,1856267.0,0.02
50%,9123057.0,0.07
75%,30797891.75,0.19
max,1438069597.0,28.02


Some countries have a very low population, I want to conduct impact-based filtering to remove countries with low population as long as their overall contribution to emissions is also low.<br>
A value of 1 million people is commonly used in global climate analysis as a cutoff point and we will combine this with a minimum threshold of 0.1% CO2 share to remove low impact and low population countries<br>
(Method and steps for data filtering suggested by copilot)

In [73]:
# identify countries with a population above 1m or global co2 emissions above 0.1% in the year 2023
df_filtered = df_temp[(df_temp['population'] >= 1000000) | (df_temp['share_global_co2_including_luc'] >= 0.1)]

# remove countries which do not meet the thresholds from the main dataset
df_trimmed = df_trimmed[df_trimmed['Name'].isin(df_filtered['Name'])].copy()
df_trimmed['population'].describe().apply(lambda x: format(x, '.0f'))


count          7791
mean       38132121
std       132481020
min          193100
25%         3840494
50%         9307346
75%        25943786
max      1438069597
Name: population, dtype: object

The dataset has reduced in size to 7791 rows and our minimum population count is much higher.<br>
Finally, I will rescale the population to be measured in millions, to be more easily human readable.<br>
(Code adapted from https://stackoverflow.com/questions/43675014/dividing-a-dataframe-column-and-then-rounding)

In [76]:
# create a new population column by dividing population values by 1m and round to 2 decimal places
df_trimmed['pop(m)'] = df_trimmed['population'].div(1000000).round(2)
df_trimmed.head()

Unnamed: 0,Name,iso_code,year,population,gdp,primary_energy_consumption,co2,co2_including_luc,consumption_co2,total_ghg,...,share_global_luc_co2,share_global_oil_co2,trade_co2_share,cumulative_co2,cumulative_co2_including_luc,share_global_cumulative_co2,share_of_temperature_change_from_ghg,temperature_change_from_co2,temperature_change_from_ghg,pop(m)
125,Afghanistan,AFG,1975,12773967.0,15177770000.0,,2.121,5.732,,21.901,...,0.071,0.011,,21.287,280.878,0.004,0.117,0.0,0.001,12.77
126,Afghanistan,AFG,1976,13059861.0,16023610000.0,,1.981,5.286,,21.624,...,0.065,0.01,,23.267,286.164,0.004,0.116,0.0,0.001,13.06
127,Afghanistan,AFG,1977,13340758.0,15207360000.0,,2.384,5.391,,21.517,...,0.055,0.011,,25.652,291.555,0.005,0.115,0.0,0.001,13.34
128,Afghanistan,AFG,1978,13611445.0,16337830000.0,,2.153,4.786,,20.887,...,0.049,0.01,,27.805,296.34,0.005,0.113,0.0,0.001,13.61
129,Afghanistan,AFG,1979,13655572.0,15913790000.0,,2.233,4.99,,20.564,...,0.059,0.013,,30.038,301.33,0.005,0.112,0.0,0.001,13.66


---

# Output Cleaned Data

Save cleaned data to a scv file ready to be used for visualisations, hypothesis testing, model building and dashboarding.

In [None]:
# export cleaned data to csv


This is the end of Notebook 1, users should open 02_statistics_and_visualisations.ipynb to continue analysis.