# **ETL Process**

## Objectives

* Extract data from raw csv dataset into a pandas DataFrame
* Clean and transform extracted DataFrame
* Load processed pandas data frame into a csv file

## Inputs

* Raw dataset required, gathered from kaggle

## Outputs

* Generate processed dataset as a csv file

## Additional Comments

* Dataset source: https://www.kaggle.com/datasets/mackness/global-gdp-and-co-emissions-dataset-19602022?resource=download 
* Source dataset is already cleaned, but cleaning step will still take place



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\hackathon\\hackathon1\\global-gdp-and-co2-emissions\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\fanxi\\OneDrive\\Documents\\hackathon\\hackathon1\\global-gdp-and-co2-emissions'

# Initial setup

### Prerequisites
Ensure that the reqired packages in requirements.txt are installed to the kernel.

### Import libaries
Import numpy and pandas libary

In [4]:
import numpy as np
import pandas as pd

---

# Data extraction

### Extracting Dataset 
Create a pandas dataset from the CSV file.

In [13]:
df = pd.read_csv('dataset/raw/gdp_co2_by_country.csv') ## Directory to the dataset
df

Unnamed: 0,Country Name,Country Code,Year,Population,Pop Log,Pop Outliers,Pop Category,CO2,CO2 %,Per Capita CO2,...,CO2 Log,CO2 Outliers,Emissions Category,GDP USD,GDP USD Log,GDP %,GDP % Winsor,GDP Per Capita,GDP Category,CO2 Per GDP
0,Afghanistan,AFG,1961,9214082.0,16.036244,not outlier,1M-10M,0.491,,5.328800e-08,...,0.399447,False,Moderate,308.318270,5.734371,-10.119484,-10.119484,0.000033,Low GDP,0.001593
1,Afghanistan,AFG,1962,9404411.0,16.056689,not outlier,1M-10M,0.689,40.325866,7.326349e-08,...,0.524137,False,Moderate,308.318270,5.734371,-10.119484,-10.119484,0.000033,Low GDP,0.002235
2,Afghanistan,AFG,1963,9604491.0,16.077741,not outlier,1M-10M,0.707,2.612482,7.361140e-08,...,0.534737,False,Moderate,308.318270,5.734371,-10.119484,-10.119484,0.000032,Low GDP,0.002293
3,Afghanistan,AFG,1964,9814318.0,16.099353,not outlier,1M-10M,0.839,18.670438,8.548735e-08,...,0.609222,False,Moderate,308.318270,5.734371,-10.119484,-10.119484,0.000031,Low GDP,0.002721
4,Afghanistan,AFG,1965,10036003.0,16.121690,not outlier,10M-100M,1.007,20.023838,1.003388e-07,...,0.696641,False,Moderate,308.318270,5.734371,-10.119484,-10.119484,0.000031,Low GDP,0.003266
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12502,Zimbabwe,ZWE,2019,15271377.0,16.541491,not outlier,10M-100M,10.263,-8.406961,6.720416e-07,...,2.421523,False,High,1350.309851,7.208830,-7.785580,-7.785580,0.000088,Low GDP,0.007600
12503,Zimbabwe,ZWE,2020,15526888.0,16.558084,not outlier,10M-100M,8.495,-17.226932,5.471154e-07,...,2.250765,False,High,1224.272314,7.110918,-9.333971,-9.333971,0.000079,Low GDP,0.006939
12504,Zimbabwe,ZWE,2021,15797220.0,16.575345,not outlier,10M-100M,10.204,20.117716,6.459364e-07,...,2.416271,False,High,1305.220113,7.174893,6.611911,6.611911,0.000083,Low GDP,0.007818
12505,Zimbabwe,ZWE,2022,16069061.0,16.592406,not outlier,10M-100M,10.425,2.165817,6.487622e-07,...,2.435804,False,High,1361.914530,7.217381,4.343667,4.343667,0.000085,Low GDP,0.007655


#### DataFrame info
Below are some information of the DataFrame including:
* Shape of DataFrame
* Features of DataFrame
* Data type of values

In [None]:
## The dimention of the DataFrame (rows, columns)
df.shape
## From this we can see the number of features (columns) and the number of samples (rows) in the DataFrame

(12507, 21)

In [None]:
## The features of the DataFrame (Names of columns)
df.columns

Index(['Country Name', 'Country Code', 'Year', 'Population', 'Pop Log',
       'Pop Outliers', 'Pop Category', 'CO2', 'CO2 %', 'Per Capita CO2',
       'Cumulative CO2', 'CO2 Log', 'CO2 Outliers', 'Emissions Category',
       'GDP USD', 'GDP USD Log', 'GDP %', 'GDP % Winsor', 'GDP Per Capita',
       'GDP Category', 'CO2 Per GDP'],
      dtype='object')

In [None]:
## Data type of values
df.dtypes

Country Name           object
Country Code           object
Year                    int64
Population            float64
Pop Log               float64
Pop Outliers           object
Pop Category           object
CO2                   float64
CO2 %                 float64
Per Capita CO2        float64
Cumulative CO2        float64
CO2 Log               float64
CO2 Outliers           object
Emissions Category     object
GDP USD               float64
GDP USD Log           float64
GDP %                 float64
GDP % Winsor          float64
GDP Per Capita        float64
GDP Category           object
CO2 Per GDP           float64
dtype: object

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
