## Question 1: Comprehensive Data Acquisition and Preprocessing 
**Task:** 
Download and preprocess CO2 emissions data along with a wide range of socio-economic 
and environmental indicators from the World Bank’s Climate Change database. 
 
**Instructions:** 
1. Access the World Bank database using Python, R, or MATLAB. 
2. Download CO2 emissions data and as many relevant socio-economic and environmental 
indicators as possible (e.g., GDP, population, energy consumption, urbanization rate, 
education level, etc.). 
3. Clean and preprocess the data, addressing missing values, outliers, and ensuring 
consistency across indicators. 
4. Provide a detailed summary of the dataset, including key statistics, correlations between 
variables, and any notable patterns or anomalies

**How to use**
- The API calls aren't working at the moment since I am obtaining a 502 error code from the World Bank Database server.
- You can run all cells, just make sure to uncomment cell 5 if it's the first time you run it.
- Please refer to the deliverables folder and the [q1.md](deliverables/q1.md) file for an overview of the results of this first question.

In [1]:
import wbdata
import pandas as pd
from datetime import datetime
from helpers_v2 import *

Key '2572116086130111078' not in persistent cache.
Key '-8432638068931429607' not in persistent cache.
Key '6966304242584891041' not in persistent cache.
Key '-8930835760588398345' not in persistent cache.
Key '3439344468497918485' not in persistent cache.
Key '-5411766246040514076' not in persistent cache.
Key '2487318062359677408' not in persistent cache.
Key '2572116086130111078' not in persistent cache.
Key '-8432638068931429607' not in persistent cache.
Key '6966304242584891041' not in persistent cache.
Key '-8930835760588398345' not in persistent cache.
Key '3439344468497918485' not in persistent cache.
Key '-5411766246040514076' not in persistent cache.
Key '2487318062359677408' not in persistent cache.
Key '2572116086130111078' not in persistent cache.
Key '-8432638068931429607' not in persistent cache.
Key '6966304242584891041' not in persistent cache.
Key '-8930835760588398345' not in persistent cache.
Key '3439344468497918485' not in persistent cache.
Key '-54117662460405140

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
manipulate_data = DataManipulation()

In [4]:
# Uncomment this if you want to download the data from the API.
indicators = {
    'EN.ATM.CO2E.KT': 'CO2_emissions',        # CO2 emissions (kt)
    'NY.GDP.MKTP.CD': 'GDP',                  # GDP (current US$)
    'SP.POP.TOTL': 'Population',              # Population, total
    'EG.USE.PCAP.KG.OE': 'Energy_use',        # Energy use (kg of oil equivalent per capita)
    'SP.URB.TOTL.IN.ZS': 'Urbanization_rate', # Urban population (% of total population)
    'EG.ELC.RNEW.ZS' : 'Renewable_elec_output' # Renewable electricity output (% of total electricity output)
}

date_range = ('1990', '2020') # Dates where CO2 data is avaliable

df = manipulate_data.get_wb_data(indicators, date_range)

df.to_csv('raw_data/wbdata_raw.csv', index=False)

# Note: we didn't include any education-related factor since they have a lot of missing data

Downloading data from 1990 to 2020


In [5]:
df.head()

Unnamed: 0,country,date,CO2_emissions,GDP,Population,Energy_use,Urbanization_rate,Renewable_elec_output
0,Africa Eastern and Southern,2020,544952.503,929074100000.0,685112979.0,,36.828302,
1,Africa Eastern and Southern,2019,610723.5,1006527000000.0,667242986.0,,36.336259,
2,Africa Eastern and Southern,2018,598720.9575,1012719000000.0,649757148.0,,35.847598,
3,Africa Eastern and Southern,2017,590905.482,940105500000.0,632746570.0,,35.358901,
4,Africa Eastern and Southern,2016,580219.242,829830000000.0,616377605.0,,34.894753,


In [6]:
df.info()

<class 'wbdata.client.DataFrame'>
RangeIndex: 8246 entries, 0 to 8245
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                8246 non-null   object 
 1   date                   8246 non-null   object 
 2   CO2_emissions          7408 non-null   float64
 3   GDP                    7839 non-null   float64
 4   Population             8215 non-null   float64
 5   Energy_use             4740 non-null   float64
 6   Urbanization_rate      8153 non-null   float64
 7   Renewable_elec_output  6894 non-null   float64
dtypes: float64(6), object(2)
memory usage: 515.5+ KB


In [7]:
# Eliminate non country data
df = manipulate_data.eliminate_non_country_data(df) 
print(f'# of countries in the dataset: {len(df.country.unique())}')


# of countries in the dataset: 186


In [8]:
manipulate_data.missing_data_percentage(df)

country                   0.000000
date                      0.000000
CO2_emissions             9.157128
GDP                       3.520638
Population                0.000000
Energy_use               44.779743
Urbanization_rate         0.000000
Renewable_elec_output    16.059660
dtype: float64


In [9]:
# Checking the number of years in the current df
print(f'# of years in the dataset: {len(df.date.unique())}')

# of years in the dataset: 31


In [10]:
# Count missing values per country
missing_values_by_country = manipulate_data.get_missing_value_groupby(df, groupby_column='country', sort_by='CO2_emissions')
missing_values_by_country.head()


Unnamed: 0,country,CO2_emissions,GDP,Energy_use,Renewable_elec_output
138,Puerto Rico,31,0,31,5
110,Monaco,31,0,31,5
33,Cayman Islands,31,16,31,5
120,New Caledonia,31,0,31,5
69,Guam,31,12,31,5


In [12]:
countries_to_remove = manipulate_data.get_items_to_remove(df, missing_values_by_country, target_col_name='date', groupby_col = 'country')
list(countries_to_remove)

31
15.5


138       Puerto Rico
110            Monaco
33     Cayman Islands
120     New Caledonia
69               Guam
            ...      
57               Fiji
73             Guyana
68            Grenada
71             Guinea
72      Guinea-Bissau
Name: country, Length: 67, dtype: object

In [13]:
df = df[~df.country.isin(countries_to_remove)].reset_index(drop=True)
df.info()

<class 'wbdata.client.DataFrame'>
RangeIndex: 3689 entries, 0 to 3688
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                3689 non-null   object 
 1   date                   3689 non-null   object 
 2   CO2_emissions          3688 non-null   float64
 3   GDP                    3656 non-null   float64
 4   Population             3689 non-null   float64
 5   Energy_use             2994 non-null   float64
 6   Urbanization_rate      3689 non-null   float64
 7   Renewable_elec_output  3098 non-null   float64
dtypes: float64(6), object(2)
memory usage: 230.7+ KB


In [14]:
# Print current number of countries
print(f'# of countries in the dataset: {len(df.country.unique())}')

# of countries in the dataset: 119


In [15]:
# Count missing values per year
missing_values_by_year = manipulate_data.get_missing_value_groupby(df, groupby_column='date', sort_by='Energy_use')
missing_values_by_year.head()

Unnamed: 0,date,CO2_emissions,GDP,Energy_use,Renewable_elec_output
30,2020,0,1,119,119
29,2019,0,1,119,118
28,2018,0,1,119,118
27,2017,0,1,119,118
26,2016,0,1,119,118


In [16]:
years_to_remove = manipulate_data.get_items_to_remove(df, missing_values_by_year, target_col_name='country', groupby_col='date')
years_to_remove.sort_values()

25    2015
26    2016
27    2017
28    2018
29    2019
30    2020
Name: date, dtype: object

In [17]:
df = df[~df.date.isin(years_to_remove)].reset_index(drop=True)
df.info()

<class 'wbdata.client.DataFrame'>
RangeIndex: 2975 entries, 0 to 2974
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   country                2975 non-null   object 
 1   date                   2975 non-null   object 
 2   CO2_emissions          2974 non-null   float64
 3   GDP                    2948 non-null   float64
 4   Population             2975 non-null   float64
 5   Energy_use             2963 non-null   float64
 6   Urbanization_rate      2975 non-null   float64
 7   Renewable_elec_output  2975 non-null   float64
dtypes: float64(6), object(2)
memory usage: 186.1+ KB


In [18]:
df.date.unique()

array(['2014', '2013', '2012', '2011', '2010', '2009', '2008', '2007',
       '2006', '2005', '2004', '2003', '2002', '2001', '2000', '1999',
       '1998', '1997', '1996', '1995', '1994', '1993', '1992', '1991',
       '1990'], dtype=object)

In [19]:
manipulate_data.missing_data_percentage(df)

country                  0.000000
date                     0.000000
CO2_emissions            0.033613
GDP                      0.907563
Population               0.000000
Energy_use               0.403361
Urbanization_rate        0.000000
Renewable_elec_output    0.000000
dtype: float64


In [None]:
# Imputing data