# **Notebook 1: Data Extraction, Transformation and Loading**

## Objectives

* Import raw data from Kaggle into a dataframe
* Clean data to remove duplicate values and remove outliers
* Identify and handle missing data

## Inputs

* Raw data files from [CO2 Emissions Dataset](https://www.kaggle.com/datasets/shreyanshdangi/co-emissions-across-countries-regions-and-sectors/data)

## Outputs

* Generates clean_data.csv for use in hypothesis testing and visualisations

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Import Packages

Import packages needed to run the notebook

In [1]:
import numpy as np
import pandas as pd

# Import Data

Import raw data into dataframe, ready for processing

In [2]:
# set path to data file
path = "../raw_data/data.csv"

# assign data to dataframe
df_raw = pd.read_csv(path)

# display dataframe
df_raw.head()

Unnamed: 0,Description,Name,year,iso_code,population,gdp,cement_co2,cement_co2_per_capita,co2,co2_growth_abs,...,share_global_other_co2,share_of_temperature_change_from_ghg,temperature_change_from_ch4,temperature_change_from_co2,temperature_change_from_ghg,temperature_change_from_n2o,total_ghg,total_ghg_excluding_lucf,trade_co2,trade_co2_share
0,Country,Afghanistan,1850,AFG,3752993.0,,0.0,0.0,,,...,,,,,,,7.436,0.629,,
1,Country,Afghanistan,1851,AFG,3767956.0,,0.0,0.0,,,...,,0.156,0.0,0.0,0.0,0.0,7.5,0.633,,
2,Country,Afghanistan,1852,AFG,3783940.0,,0.0,0.0,,,...,,0.155,0.0,0.0,0.0,0.0,7.56,0.637,,
3,Country,Afghanistan,1853,AFG,3800954.0,,0.0,0.0,,,...,,0.155,0.0,0.0,0.0,0.0,7.62,0.641,,
4,Country,Afghanistan,1854,AFG,3818038.0,,0.0,0.0,,,...,,0.155,0.0,0.0,0.0,0.0,7.678,0.644,,


# Data Cleaning

First I will do some initital data cleaning steps:
1. Check for duplicate rows and remove if found
2. Limit data to country information to fit the requirements of the analysis
3. Limit data to last 50 years to make analysis easier to manage and findings more relevant to current times

In [3]:
# drop any duplicate rows
df_raw.drop_duplicates(inplace=True)

# select rows where Description is equal to Country
df_raw = df_raw.loc[df_raw['Description'] == "Country"]

# limit data to last 50 years
df_raw = df_raw.loc[df_raw["year"] >= 1975]

df_raw.shape

(9702, 80)

The next step is to drop columns which aren't needed for analysis, this will reduce the size of the data file making processing more efficient and the data easier to work with.<br>
(GPT-5 was used to format column names)

In [None]:
# create a list of columns names to keep in the dataset
req_columns = ["Name", "iso_code", "year", "population", "gdp", "primary_energy_consumption", "co2", "co2_including_luc", "consumption_co2", "total_ghg", "co2_growth_abs", "co2_growth_prct",
               "co2_per_capita", "co2_per_gdp", "consumption_co2_per_capita", "consumption_co2_per_gdp", "energy_per_capita", "energy_per_gdp", "cement_co2", "coal_co2", "flaring_co2", "gas_co2", "land_use_change_co2", "oil_co2", "trade_co2",
               "share_global_cement_co2", "share_global_co2", "share_global_co2_including_luc", "share_global_coal_co2", "share_global_flaring_co2", "share_global_gas_co2", "share_global_luc_co2", "share_global_oil_co2", "trade_co2_share",
               "cumulative_co2", "cumulative_co2_including_luc", "share_global_cumulative_co2", "share_of_temperature_change_from_ghg", "temperature_change_from_co2", "temperature_change_from_ghg"]

# create a new dataframe with only the required columns
df_trimmed = df_raw[req_columns]
df_trimmed.head()

Unnamed: 0,Name,iso_code,year,population,gdp,primary_energy_consumption,co2,co2_including_luc,consumption_co2,total_ghg,...,share_global_gas_co2,share_global_luc_co2,share_global_oil_co2,trade_co2_share,cumulative_co2,cumulative_co2_including_luc,share_global_cumulative_co2,share_of_temperature_change_from_ghg,temperature_change_from_co2,temperature_change_from_ghg
125,Afghanistan,AFG,1975,12773967.0,15177770000.0,,2.121,5.732,,21.901,...,0.021,0.071,0.011,,21.287,280.878,0.004,0.117,0.0,0.001
126,Afghanistan,AFG,1976,13059861.0,16023610000.0,,1.981,5.286,,21.624,...,0.013,0.065,0.01,,23.267,286.164,0.004,0.116,0.0,0.001
127,Afghanistan,AFG,1977,13340758.0,15207360000.0,,2.384,5.391,,21.517,...,0.021,0.055,0.011,,25.652,291.555,0.005,0.115,0.0,0.001
128,Afghanistan,AFG,1978,13611445.0,16337830000.0,,2.153,4.786,,20.887,...,0.012,0.049,0.01,,27.805,296.34,0.005,0.113,0.0,0.001
129,Afghanistan,AFG,1979,13655572.0,15913790000.0,,2.233,4.99,,20.564,...,0.014,0.059,0.013,,30.038,301.33,0.005,0.112,0.0,0.001


---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
