## Question 1: Comprehensive Data Acquisition and Preprocessing 
**Task:** 
Download and preprocess CO2 emissions data along with a wide range of socio-economic 
and environmental indicators from the World Bank’s Climate Change database. 
 
**Instructions:** 
1. Access the World Bank database using Python, R, or MATLAB. 
2. Download CO2 emissions data and as many relevant socio-economic and environmental 
indicators as possible (e.g., GDP, population, energy consumption, urbanization rate, 
education level, etc.). 
3. Clean and preprocess the data, addressing missing values, outliers, and ensuring 
consistency across indicators. 
4. Provide a detailed summary of the dataset, including key statistics, correlations between 
variables, and any notable patterns or anomalies

In [1]:
from helpers import GetData, DataFrameAnalyzer, ManipulateData
import pandas as pd

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
data_getter = GetData()

In [4]:
indicators_id_ls = ['EG.USE.PCAP.KG.OE', 'EG.USE.ELEC.KH.PC', 'NY.GDP.MKTP.CD', 'EN.ATM.CO2E.KT', 'SP.POP.TOTL', 'SP.URB.TOTL', 'SE.PRM.CMPT.ZS', 'EG.ELC.RNEW.ZS']
indicators_name_ls = ['energy_use', 'elect_power_consm', 'gdp', 'cotwo_emissions', 'popu_total', 'urban_popu', 'primary_completion_rate', 'renewable_elec_output']
# data_getter.download_data(indicators_id_ls, 'raw_data')

In [5]:
# Unzip and rename csv files. Uncomment if its the first time you run the notebook
# data_getter.unzip_downloaded_files()
# data_getter.rename_csv_files(indicator_ids=indicators_id_ls)

In [6]:
# Create a dictionary of dataframes
csv_files = {}
for indicator_name, indicator_id in list(zip(indicators_name_ls, indicators_id_ls)):
    print(indicator_name, indicator_id)
    csv_files[indicator_name] = f'raw_data/csv_files/{indicator_id}.csv'

print(csv_files)

data_frames = {var_name: pd.read_csv(file_path, skiprows=4) for var_name, file_path in csv_files.items()}

energy_use EG.USE.PCAP.KG.OE
elect_power_consm EG.USE.ELEC.KH.PC
gdp NY.GDP.MKTP.CD
cotwo_emissions EN.ATM.CO2E.KT
popu_total SP.POP.TOTL
urban_popu SP.URB.TOTL
primary_completion_rate SE.PRM.CMPT.ZS
renewable_elec_output EG.ELC.RNEW.ZS
{'energy_use': 'raw_data/csv_files/EG.USE.PCAP.KG.OE.csv', 'elect_power_consm': 'raw_data/csv_files/EG.USE.ELEC.KH.PC.csv', 'gdp': 'raw_data/csv_files/NY.GDP.MKTP.CD.csv', 'cotwo_emissions': 'raw_data/csv_files/EN.ATM.CO2E.KT.csv', 'popu_total': 'raw_data/csv_files/SP.POP.TOTL.csv', 'urban_popu': 'raw_data/csv_files/SP.URB.TOTL.csv', 'primary_completion_rate': 'raw_data/csv_files/SE.PRM.CMPT.ZS.csv', 'renewable_elec_output': 'raw_data/csv_files/EG.ELC.RNEW.ZS.csv'}


In [7]:
analyzer = DataFrameAnalyzer()

In [8]:
analyzer.print_shape(data_frames)

Shape of energy_use: (266, 69)
Shape of elect_power_consm: (266, 69)
Shape of gdp: (266, 69)
Shape of cotwo_emissions: (266, 69)
Shape of popu_total: (266, 69)
Shape of urban_popu: (266, 69)
Shape of primary_completion_rate: (266, 69)
Shape of renewable_elec_output: (266, 69)


In [9]:
analyzer.print_info(data_frames)

Info of energy_use:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 69 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    266 non-null    object 
 1   Country Code    266 non-null    object 
 2   Indicator Name  266 non-null    object 
 3   Indicator Code  266 non-null    object 
 4   1960            31 non-null     float64
 5   1961            31 non-null     float64
 6   1962            31 non-null     float64
 7   1963            31 non-null     float64
 8   1964            31 non-null     float64
 9   1965            32 non-null     float64
 10  1966            32 non-null     float64
 11  1967            32 non-null     float64
 12  1968            32 non-null     float64
 13  1969            32 non-null     float64
 14  1970            32 non-null     float64
 15  1971            151 non-null    float64
 16  1972            151 non-null    float64
 17  1973           

### Data Cleaning

In [13]:
# Remove all the non-country rows in each dataframe
manipulate_data = ManipulateData()
data_frames = manipulate_data.eliminate_non_country_data(dataframe_dict=data_frames)

In [14]:
# Checking new shapes
analyzer.print_shape(data_frames)


Shape of energy_use: (186, 69)
Shape of elect_power_consm: (186, 69)
Shape of gdp: (186, 69)
Shape of cotwo_emissions: (186, 69)
Shape of popu_total: (186, 69)
Shape of urban_popu: (186, 69)
Shape of primary_completion_rate: (186, 69)
Shape of renewable_elec_output: (186, 69)


In [15]:
cotwo_clean_df = data_frames['cotwo_emissions'].dropna(axis=1, how='all')
cotwo_clean_df = cotwo_clean_df.dropna(thresh=10)
cotwo_clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 169 entries, 2 to 265
Data columns (total 35 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    169 non-null    object 
 1   Country Code    169 non-null    object 
 2   Indicator Name  169 non-null    object 
 3   Indicator Code  169 non-null    object 
 4   1990            168 non-null    float64
 5   1991            169 non-null    float64
 6   1992            169 non-null    float64
 7   1993            169 non-null    float64
 8   1994            169 non-null    float64
 9   1995            169 non-null    float64
 10  1996            169 non-null    float64
 11  1997            169 non-null    float64
 12  1998            169 non-null    float64
 13  1999            169 non-null    float64
 14  2000            169 non-null    float64
 15  2001            169 non-null    float64
 16  2002            169 non-null    float64
 17  2003            169 non-null    float64


In [16]:
# Make all other dataframes to have the same years and countries as the CO2 emissions dataframe
data_frames['cotwo_emissions'] = cotwo_clean_df
analyzer.check_column_names_equal(dataframes=data_frames)
data_frames = manipulate_data.modify_dataframes_based_on_a_target_dataframe(dataframe_dict=data_frames)

The dataframe energy_use has different column names
The dataframe elect_power_consm has different column names
The dataframe gdp has different column names
- All columns are equal -
The dataframe popu_total has different column names
The dataframe urban_popu has different column names
The dataframe primary_completion_rate has different column names
The dataframe renewable_elec_output has different column names


In [17]:
analyzer.check_column_names_equal(data_frames)

- All columns are equal -
- All columns are equal -
- All columns are equal -
- All columns are equal -
- All columns are equal -
- All columns are equal -
- All columns are equal -
- All columns are equal -


In [19]:
analyzer.print_shape(data_frames)

Shape of energy_use: (169, 35)
Shape of elect_power_consm: (169, 35)
Shape of gdp: (169, 35)
Shape of cotwo_emissions: (169, 35)
Shape of popu_total: (169, 35)
Shape of urban_popu: (169, 35)
Shape of primary_completion_rate: (169, 35)
Shape of renewable_elec_output: (169, 35)


In [20]:
analyzer.print_info(data_frames)

Info of energy_use:
<class 'pandas.core.frame.DataFrame'>
Index: 169 entries, 2 to 265
Data columns (total 35 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Country Name    169 non-null    object 
 1   Country Code    169 non-null    object 
 2   Indicator Name  169 non-null    object 
 3   Indicator Code  169 non-null    object 
 4   1990            138 non-null    float64
 5   1991            117 non-null    float64
 6   1992            118 non-null    float64
 7   1993            118 non-null    float64
 8   1994            118 non-null    float64
 9   1995            119 non-null    float64
 10  1996            119 non-null    float64
 11  1997            119 non-null    float64
 12  1998            119 non-null    float64
 13  1999            119 non-null    float64
 14  2000            121 non-null    float64
 15  2001            121 non-null    float64
 16  2002            121 non-null    float64
 17  2003            121 