<h1 style = "text-align:center; text-decoration: underline"> Preprocessing of environmental data </h1>

<i>Libraries</i>

In [63]:
import pandas as pd
import os

<hr>
<i>Variables and Paths<i>

In [64]:
###### Variables
co2_data_path = os.path.join("../Data_raw/annual-co2-emissions-per-country.csv")
energy_data_path = os.path.join("../Data_raw/per-capita-energy-use.csv")
air_data_path = os.path.join("../Data_raw/long-run-air-pollution.csv")

# Reading in data
co2_data = pd.read_csv(co2_data_path)
energy_data = pd.read_csv(energy_data_path)
air_data = pd.read_csv(air_data_path)

<hr>
<i>Function definitions</i>

In [65]:
#Function for inspecting the datafiles
def inspect_data(input_file: pd.DataFrame) -> None:
    '''
    The function inspect_data expects a pandas DataFrame as input and prints the 
    dimensions, structure and variables of the DataFrame.

    Parameters:
    - input_file: pd.DataFrame

    Returns:
    - None

    >>> inspect_data(co2_data)
    Structure of co2_data dataframe
    '''

    #prints dimensions of dataframe
    print(f"Shape of {input_file} dataframe")
    input_file.shape

    #prints structure of dataframe
    print(f"Info of {input_file} dataframe")
    input_file.info()

    #prints the first 5 rows of dataframe
    print(f"Head of {input_file} dataframe")
    input_file.head(n = 5)

<hr>
<b>Inspection<b>

In [66]:
inspect_data(co2_data)

Shape of           Entity Code  Year  Annual CO₂ emissions  time
0    Afghanistan  AFG  2000            1047127.94  2000
1    Afghanistan  AFG  2023           11020218.00  2023
2        Albania  ALB  2000            3024926.00  2000
3        Albania  ALB  2023            5144279.00  2023
4        Algeria  DZA  2000           85398600.00  2000
..           ...  ...   ...                   ...   ...
389        Yemen  YEM  2023           10034892.00  2023
390       Zambia  ZMB  2000            1784113.00  2000
391       Zambia  ZMB  2023            7748922.00  2023
392     Zimbabwe  ZWE  2000           13814538.00  2000
393     Zimbabwe  ZWE  2023           11164030.00  2023

[394 rows x 5 columns] dataframe
Info of           Entity Code  Year  Annual CO₂ emissions  time
0    Afghanistan  AFG  2000            1047127.94  2000
1    Afghanistan  AFG  2023           11020218.00  2023
2        Albania  ALB  2000            3024926.00  2000
3        Albania  ALB  2023            5144279.00  20

In [67]:
inspect_data(air_data)

Shape of            Entity Code  Year  Nitrogen oxides emissions from all sectors  \
0     Afghanistan  AFG  2000                                  120223.375   
1     Afghanistan  AFG  2001                                   90231.766   
2     Afghanistan  AFG  2002                                   86566.555   
3     Afghanistan  AFG  2003                                   89515.340   
4     Afghanistan  AFG  2004                                   95819.650   
...           ...  ...   ...                                         ...   
5078     Zimbabwe  ZWE  2018                                   83017.195   
5079     Zimbabwe  ZWE  2019                                   80245.730   
5080     Zimbabwe  ZWE  2020                                   68585.510   
5081     Zimbabwe  ZWE  2021                                   71677.440   
5082     Zimbabwe  ZWE  2022                                   74016.900   

      Sulfur dioxide emissions from all sectors  \
0                          

In [68]:
inspect_data(energy_data)

Shape of           Entity Code    Year  \
0    Afghanistan  AFG  2020.0   
1    Afghanistan  NaN     NaN   
2        Albania  ALB  2020.0   
3        Albania  NaN     NaN   
4        Algeria  DZA  2020.0   
..           ...  ...     ...   
385        Yemen  NaN     NaN   
386       Zambia  ZMB  2020.0   
387       Zambia  NaN     NaN   
388     Zimbabwe  ZWE  2020.0   
389     Zimbabwe  NaN     NaN   

     Primary energy consumption per capita (kWh/person)  time  
0                                           1200.74160   2020  
1                                            990.46967   2023  
2                                           7352.82100   2020  
3                                           8032.07030   2023  
4                                          14733.60450   2020  
..                                                 ...    ...  
385                                          875.51770   2023  
386                                         2019.72830   2020  
387               

I noticed that in **co2_data** and **energy_data** there are duplicate columns namely **time** and **year**.

I will remove on of them, to keep the dataframe simpler

Although in **energy_data** the decision becomes clearer because **year_column** is the same as **time** but with missing values for 2023.

In [69]:
co2_data.columns

Index(['Entity', 'Code', 'Year', 'Annual CO₂ emissions', 'time'], dtype='object')

In [70]:
# Dropping duplicate columns
co2_data = co2_data.drop(columns = ["time"])

energy_data = energy_data.drop(columns = ["Year"])
# Rename time column to year
energy_data = energy_data.rename(columns = {"time": "Year"})


In [71]:
co2_data.head(n = 5)

Unnamed: 0,Entity,Code,Year,Annual CO₂ emissions
0,Afghanistan,AFG,2000,1047127.94
1,Afghanistan,AFG,2023,11020218.0
2,Albania,ALB,2000,3024926.0
3,Albania,ALB,2023,5144279.0
4,Algeria,DZA,2000,85398600.0


In [72]:
energy_data.head(n = 5)

Unnamed: 0,Entity,Code,Primary energy consumption per capita (kWh/person),Year
0,Afghanistan,AFG,1200.7416,2020
1,Afghanistan,,990.46967,2023
2,Albania,ALB,7352.821,2020
3,Albania,,8032.0703,2023
4,Algeria,DZA,14733.6045,2020


In [73]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=True --inplace preprocessing.ipynb
!jupyter nbconvert --to script preprocessing.ipynb




[NbConvertApp] Converting notebook preprocessing.ipynb to notebook
[NbConvertApp] Writing 5186 bytes to preprocessing.ipynb
[NbConvertApp] Converting notebook preprocessing.ipynb to script
[NbConvertApp] Writing 2308 bytes to preprocessing.py
