# Prediction of CO2 emissions from country-specific data
# A Machine Learning project 

***

# Stage 1: Data cleaning and preparation

***

### Notebook Contents:
0. Introduction - project and notebook summary, notes on the data source
1. Notebook setup - libraries and data import
2. Global data overview
3. Definition of the initial project goals
3. Data cleaning
    - dealing with missing values
    - transformation of the columns into a numerical data type
    - renaming of features
    - removing empty columns and rows
4. Data frame transformation
    - melting of the data for each variable
    - integration of the data into a suitable data frame format
5. Removal of missing values
    - detection of missing values
    - removal of missing values by filtering the columns and rows, so that minimal amount of features and rows are lost
5. Export the clean data frame to a file

***

## 0. Introduction

### Project summary
**Aim of the project**:
Analysis of country-specific data and development of machine learning models in order to predict CO2 emissions from country parameters. The project uses the publicly available dataset Climate Change Data from the World Bank Group, which provides data on the vast majority of countries over a range of years for parameters such as:

* country: the vast majority of countries worldwide
* year: ranging from 1990 to 2023
* CO2 emissions 
* population-specific parameter: Population density
* country economic indicators: GDP, GNI, Unemployment, etc.
* land-related parameters: Food production index, Agricultural land, and marine protected areas, Tree Cover Loss, etc.
* climate data: Nitrous oxide emissions, Cooling Degree Days, Heat Index 35 etc.
* energy use: Electricity production from coal sources, Renewable electricity output, Energy use, etc
* certain types of medical data: Life expectancy at birth, etc
* etc.



The project is divided into three stages:

1. Data cleaning and preparation
2. Data exploration and visualization
3. Predictive analysis with varios machine learning algorithms

Each of the stages is described in a separate Jupyter Notebook(.ipynp file).

***

### Notebook summary - Stage 1: Data cleaning and preparation
**Aim of this notebook**: The subject of this particular notebook is to explain the first stage of the project - the cleaning and transformation of the available data in order to prepare it for the visualization and analysis (described in further notebooks).

**Input**: comma separated values (CSV) file from the original online source

**Output**: comma separated values (CSV) file containing the cleaned data ready for visualization and analysis

**Programming language**: Python 3.8

**Libraries used in this notebook**: pandas, numpy

***

### Data source

The used data comes from the Environment, Social And Governance Data of the World Bank Group, which provides country-specific data on parameters such as CO2 emissions, energy use, Population density, Agricultural land, GDP, GNI, etc.


The dataset is publicly available at https://datacatalog.worldbank.org/search/dataset/0037651/Environment--Social-and-Governance-Data and licenced under the <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International license</a>.

***


## 1. Notebook Setup
Import all needed libraries:

In [None]:
import pandas as pd
import numpy as np

In [3]:
# adjusting the path to the file is necessary, because the script is not executed in the same directory as the file
# import os
# print(os.getcwd()) # get current working directory to see where the script is executed

# Load the data into a DataFrame
file_path = 'Desktop/1.Semester/MachineLearing/Project/CO2/data/ESGEXCEL.csv' # please adjust the path to the file

try:
    data = pd.read_csv(file_path, delimiter=';', on_bad_lines='warn')
except pd.errors.ParserError as e:
    print(f"Error parsing CSV: {e}")

/Users/frederikbomheuer


(16969, 68)

***

## 2. Global data overview

A global overview of the imported data 

In [12]:
print("Shape of the original dataset:")
data.shape

Shape of the original dataset:


(16969, 68)

In [13]:
# Inspect the columns of the DataFrame
print("Columns in the DataFrame:", data.columns)
print("Column data types:")
data.dtypes

Columns in the DataFrame: Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022',
       '2023'],
      dtype='object')
Column data types:


Country Name      object
Country Code      object
Indicator Name    object
Indicator Code    object
1960              object
                   ...  
2019              object
2020              object
2021              object
2022              object
2023              object
Length: 68, dtype: object

In [12]:
# Display the first few rows of the transformed DataFrame
print("Overview of the first 5 rows in the DataFrame:")
data.head()

Overview of the first 5 rows in the DataFrame:


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Arab World,ARB,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,,,,,,,...,8670571737,8694277798,8722870518,8739085552,8761786211,8779873981,8794826443,8809253583.0,,
1,Arab World,ARB,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,...,8883227593,8905385159,8953901606,9066275374,8917693883,9035280229,9063505001,9084566129.0,,
2,Arab World,ARB,Adjusted savings: natural resources depletion ...,NY.ADJ.DRES.GN.ZS,,,,,,,...,1005055446,6130654513,5265859073,6245422284,8187713883,7234435527,4598505988,,,
3,Arab World,ARB,Adjusted savings: net forest depletion (% of GNI),NY.ADJ.DFOR.GN.ZS,,,,,,,...,84360637,96672323,92911395,102683973,57123056,64515779,75685583,,,
4,Arab World,ARB,Agricultural land (% of land area),AG.LND.AGRI.ZS,,309814141.0,3098266305.0,3100705428.0,3101800095.0,3104246564.0,...,398344215,3987257452,3993781393,3998445203,3996973753,3990703091,3997329048,3997074236.0,,


In [4]:
print("Descriptive statistics of the columns:")
data.describe()

Descriptive statistics of the columns:


Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
count,16969,16969,16969,16969,1623,2156,2165,2170,2175,2212,...,12509,12390,11783,11577,11710,11440,10731,7215,3640,453
unique,239,239,71,71,1541,2078,2090,2089,2095,2126,...,10724,10621,10136,10025,9972,9988,9467,6334,3543,445
top,Arab World,ARB,Access to clean fuels and technologies for coo...,EG.CFT.ACCS.ZS,0,0,0,0,0,0,...,0,0,0,0,0,0,0,100,100,6235160271
freq,71,71,239,239,27,17,17,16,15,21,...,372,367,266,271,280,265,268,178,15,3


In [28]:
# get more information about the columns
data['Indicator Name'].unique() 

array(['Access to clean fuels and technologies for cooking (% of population)',
       'Access to electricity (% of population)',
       'Adjusted savings: natural resources depletion (% of GNI)',
       'Adjusted savings: net forest depletion (% of GNI)',
       'Agricultural land (% of land area)',
       'Agriculture, forestry, and fishing, value added (% of GDP)',
       'Annual freshwater withdrawals, total (% of internal resources)',
       'Annualized average growth rate in per capita real survey mean consumption or income, total population (%)',
       'Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)',
       'Children in employment, total (% of children ages 7-14)',
       'CO2 emissions (metric tons per capita)', 'Coastal protection',
       'Control of Corruption: Estimate', 'Cooling Degree Days',
       'Economic and Social Rights Performance Score',
       'Electricity production from coal sources (% of total)',
       '

In [6]:
data['Indicator Code'].unique()

array(['EG.CFT.ACCS.ZS', 'EG.ELC.ACCS.ZS', 'NY.ADJ.DRES.GN.ZS',
       'NY.ADJ.DFOR.GN.ZS', 'AG.LND.AGRI.ZS', 'NV.AGR.TOTL.ZS',
       'ER.H2O.FWTL.ZS', 'SI.SPR.PCAP.ZG', 'SH.DTH.COMM.ZS',
       'SL.TLF.0714.ZS', 'EN.ATM.CO2E.PC', 'EN.CLC.CSTP.ZS', 'CC.EST',
       'EN.CLC.CDDY.XD', 'SD.ESR.PERF.XQ', 'EG.ELC.COAL.ZS',
       'EG.IMP.CONS.ZS', 'EG.EGY.PRIM.PP.KD', 'EG.USE.PCAP.KG.OE',
       'SP.DYN.TFRT.IN', 'AG.PRD.FOOD.XD', 'AG.LND.FRST.ZS',
       'EG.USE.COMM.FO.ZS', 'NY.GDP.MKTP.KD.ZG', 'EN.CLC.GHGR.MT.CE',
       'SI.POV.GINI', 'GE.EST', 'SE.XPD.TOTL.GB.ZS', 'EN.CLC.HEAT.XD',
       'EN.CLC.HDDY.XD', 'SH.MED.BEDS.ZS', 'SI.DST.FRST.20',
       'IT.NET.USER.ZS', 'SL.TLF.ACTI.ZS', 'EN.LND.LTMP.DC',
       'ER.H2O.FWST.ZS', 'SP.DYN.LE00.IN', 'SE.ADT.LITR.ZS',
       'EN.MAM.THRD.NO', 'EN.ATM.METH.PC', 'SH.DYN.MORT', 'SM.POP.NETM',
       'EN.ATM.NOXE.PC', 'IP.PAT.RESD', 'SH.H2O.SMDW.ZS',
       'SH.STA.SMSS.ZS', 'EN.ATM.PM25.MC.M3', 'PV.EST',
       'SP.POP.65UP.TO.ZS', 'EN.POP.DNST

In [7]:
data['Country Name'].unique()

array(['Arab World', 'Caribbean small states',
       'Central Europe and the Baltics', 'Early-demographic dividend',
       'East Asia & Pacific',
       'East Asia & Pacific (excluding high income)',
       'East Asia & Pacific (IDA & IBRD)', 'Euro area',
       'Europe & Central Asia',
       'Europe & Central Asia (excluding high income)',
       'Europe & Central Asia (IDA & IBRD)', 'European Union',
       'Fragile and conflict affected situations',
       'Heavily indebted poor countries (HIPC)', 'High income',
       'IBRD only', 'IDA & IBRD total', 'IDA blend', 'IDA only',
       'IDA total', 'Late-demographic dividend',
       'Latin America & Caribbean',
       'Latin America & Caribbean (excluding high income)',
       'Latin America & Caribbean (IDA & IBRD)',
       'Least developed countries: UN classification',
       'Low & middle income', 'Low income', 'Lower middle income',
       'Middle East & North Africa',
       'Middle East & North Africa (excluding high income)

Column data types:


Country Name      object
Country Code      object
Indicator Name    object
Indicator Code    object
1960              object
                   ...  
2019              object
2020              object
2021              object
2022              object
2023              object
Length: 68, dtype: object

### Findings from the global overview**

This global overview gives away the following facts about the available data:

- shape: 68 columns, 16969 rows
- all columns are of type "object" - neither numeric, nor string/text values
A certain amount of missing values, denoted both as NaN (not a number values)
- The columns represent key values such as country, but also the corresponding years and the indicator code/name
- The columns 'Country Code' and 'Indicator Code' do not give any information and are therefore obsolete
- The column 'Indicator Name' contains the country-specific features required for the analysis
- The names of the features in the column 'Indicator Name' are clear but too long

*** 

## 3. Define the initial project goals

The first overview of the raw data allows to define initial goals and objectives of the machine learning project. These will be refined in the future as more information insight is gained from the data. However, this initial goal definition will help develop a strategy and organize the data cleaning, transformation and visualization.

The data series available can be summarized into the following country-specific parameter/feature categories:

* target variable: CO2 emissions 

Features:

* country: the vast majority of countries worldwide
* year: ranging from 1990 to 2023
* population-specific parameter: Population density
* country economic indicators: GDP, GNI, Unemployment, etc.
* land-related parameters: Food production index, Agricultural land, and marine protected areas, Tree Cover Loss, etc.
* climate data: Nitrous oxide emissions, Cooling Degree Days, Heat Index 35 etc.
* energy use: Electricity production from coal sources, Renewable electricity output, Energy use, etc
* certain types of medical data: Life expectancy at birth, etc
* etc.

Such a dataset would suggest to investigate the influence of country-specific parameters such as economic parameres, population, energy use, land use and others on climate-related data or the factors affecting the climate like emissions, precipitations, etc.

**Initial goal of the machine learning project:** Analyze the relationships among these variable categories and evaluate the contribution of factors like country economy, energy use, land use, etc. on greenhouse gas emissions, precipitations, etc. Finally, develop a machine learning model capable of predicting climate-related data or emissions from the other country-specific parameters.

As more data insight will be gained with along the course of the project, the definition of these goals will be refined in more detail.

***

## 4. Data cleaning

### Organization of the data cleaning and transformation

The main aim of the data cleaning and transformation is to represent the features (the country parameters contained in the column *'Indicator Name'*) as separate columns and to make each row identifiable by a country and a year. At the same time, it would make sense to transform the years into a single column.

Additionally, it is necessary to get rid of empty rows or columns and deal with the remaining cells with missing values.

For these purposes, the following tasks have to be undertaken:

1. Remove the unnecessary columns "Country Code", "Indicator Code"
2. Transform the year columns into a numerical data type
3. Rename the features in column "Indicator Name"

### 4.1 Removing the unnecessary columns "Country Code", "Indicator Code"

In [5]:
# assign the data to a new DataFrame, which will be modified
data_clean = data

print("Original number of columns:")
print(data_clean.shape[1])

data_clean = data_clean.drop(['Country Code', 'Indicator Code'], axis='columns')

print("Current number of columns:")
print(data_clean.shape[1])

Original number of columns:
68
Current number of columns:
66


### 4.2 Transform the year columns into a numerical data type

In [6]:
# Convert only the relevant columns to numeric types
numeric_columns = data_clean.columns[2:]  # the first four columns are non-numeric
data_clean[numeric_columns] = data_clean[numeric_columns].apply(pd.to_numeric, errors='coerce')

# Print the column data types after transformation
print("Column data types after transformation:")
print(data_clean.dtypes)

Column data types after transformation:
Country Name       object
Indicator Name     object
1960              float64
1961              float64
1962              float64
                   ...   
2019              float64
2020              float64
2021              float64
2022              float64
2023              float64
Length: 66, dtype: object


***

## 5. Data frame transformation

This is how the current data frame looks like:

In [14]:
data_clean.head()

Unnamed: 0,Country Name,Indicator Name,1960,1961,1962,1963,1964,1965,1966,1967,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Arab World,Access to clean fuels and technologies for coo...,,,,,,,,,...,,,,,,,,,,
1,Arab World,Access to electricity (% of population),,,,,,,,,...,,,,,,,,,,
2,Arab World,Adjusted savings: natural resources depletion ...,,,,,,,,,...,,,,,,,,,,
3,Arab World,Adjusted savings: net forest depletion (% of GNI),,,,,,,,,...,,,,,,,,,,
4,Arab World,Agricultural land (% of land area),,,,,,,,,...,,,,,,,,,,


In [8]:
# save the feature names into a list of strings
chosen_cols = data_clean['Indicator Name'].unique()

# define an empty list, where sub-dataframes for each feature will be saved
frame_list = []

# iterate over all chosen features
for variable in chosen_cols:
    
    # pick only rows corresponding to the current feature
    frame = data_clean[data_clean['Indicator Name'] == variable]
    
    # melt all the values for all years into one column and rename the columns correspondingly
    frame = frame.melt(id_vars=['Country Name', 'Indicator Name']).rename(columns={'Country Name': 'country', 'variable': 'year', 'value': variable}).drop(['Indicator Name'], axis='columns')
    
    # add the melted dataframe for the current feature into the list
    frame_list.append(frame)


# merge all sub-frames into a single dataframe, making an outer binding on the key columns 'country','year'
from functools import reduce
all_vars = reduce(lambda left, right: pd.merge(left, right, on=['country','year'], how='outer'), frame_list)

In [9]:
all_vars.head()

Unnamed: 0,country,year,Access to clean fuels and technologies for cooking (% of population),Access to electricity (% of population),Adjusted savings: natural resources depletion (% of GNI),Adjusted savings: net forest depletion (% of GNI),Agricultural land (% of land area),"Agriculture, forestry, and fishing, value added (% of GDP)","Annual freshwater withdrawals, total (% of internal resources)","Annualized average growth rate in per capita real survey mean consumption or income, total population (%)",...,"School enrollment, primary (% gross)","School enrollment, primary and secondary (gross), gender parity index (GPI)",Scientific and technical journal articles,Standardised Precipitation-Evapotranspiration Index,Strength of legal rights index (0=weak to 12=strong),Terrestrial and marine protected areas (% of total territorial area),Tree Cover Loss (hectares),"Unemployment, total (% of total labor force) (modeled ILO estimate)",Unmet need for contraception (% of married women ages 15-49),Voice and Accountability: Estimate
0,Afghanistan,1960,,,,,,,,,...,,,,,,,,,,
1,Afghanistan,1961,,,,,,,,,...,,,,,,,,,,
2,Afghanistan,1962,,,,,,,,,...,,,,,,,,,,
3,Afghanistan,1963,,,,,,,,,...,,,,,,,,,,
4,Afghanistan,1964,,,,,,,,,...,,,,,,,,,,


***

## 6. Remove the remaining missing values in an optimal way

Although some columns and rows with empty cells have already been deleted, there are still remaining missing values:

In [15]:
print("check the amount of missing values in each column")
all_vars.isnull().sum()

check the amount of missing values in each column


country                                                                     0
year                                                                        0
Access to clean fuels and technologies for cooking (% of population)    13654
Access to electricity (% of population)                                 13202
Adjusted savings: natural resources depletion (% of GNI)                15155
                                                                        ...  
Terrestrial and marine protected areas (% of total territorial area)    15290
Tree Cover Loss (hectares)                                              11536
Unemployment, total (% of total labor force) (modeled ILO estimate)     15233
Unmet need for contraception (% of married women ages 15-49)            15219
Voice and Accountability: Estimate                                      15296
Length: 73, dtype: int64

### 6.1 Filtering the years by missing values

Checking the amount of missing values for each year:

In [16]:
all_vars_clean = all_vars

#define an array with the unique year values
years_count_missing = dict.fromkeys(all_vars_clean['year'].unique(), 0)
for ind, row in all_vars_clean.iterrows():
    years_count_missing[row['year']] += row.isnull().sum()

# sort the years by missing values
years_missing_sorted = dict(sorted(years_count_missing.items(), key=lambda item: item[1]))

# print the missing values for each year
print("missing values by year:")
for key, val in years_missing_sorted.items():
    print(key, ":", val)

missing values by year:
2018 : 15452
2014 : 15571
2013 : 15585
2015 : 15607
2017 : 15694
2016 : 15712
2019 : 15728
2012 : 15752
2009 : 15765
2010 : 15767
2007 : 15769
2011 : 15771
2006 : 15773
2004 : 15778
2005 : 15788
2008 : 15790
2003 : 15808
2002 : 15821
2020 : 15886
1990 : 15920
1993 : 15980
1991 : 15995
1992 : 16001
2000 : 16005
2001 : 16008
1994 : 16028
1995 : 16059
1999 : 16082
1997 : 16086
1996 : 16100
1998 : 16121
1980 : 16310
1982 : 16311
1981 : 16314
2021 : 16317
1989 : 16320
1985 : 16322
1984 : 16326
1986 : 16332
1987 : 16333
1983 : 16341
1988 : 16342
1975 : 16376
1976 : 16386
1977 : 16401
1978 : 16401
1979 : 16402
1971 : 16411
1972 : 16415
1974 : 16416
1973 : 16435
1970 : 16465
1967 : 16703
1960 : 16706
1966 : 16706
1964 : 16707
1963 : 16708
1965 : 16708
1968 : 16708
1962 : 16709
1969 : 16709
1961 : 16711
2022 : 16924
2023 : 16969


The purpose of the filtering is to delete rows with a significant amount of missing values for certain countries without removing too many years. So it is important to choose the proper limit for NaN values allowed per year. The previous output suggests to pick the years between 2002 and 2020 for the further analysis (since they are all below 16.00 NaN values):

In [21]:
all_vars_clean.head()

# transform the year column to integer
all_vars_clean['year'] = all_vars_clean['year'].astype(int)

Type of the year column: object


In [22]:
print("number of missing values in the whole dataset before filtering the years:")
print(all_vars_clean.isnull().sum().sum())
print("number of rows before filtering the years:")
print(all_vars_clean.shape[0])

# filter only rows for years between 1991 and 2008 (having less missing values)
all_vars_clean = all_vars_clean[(all_vars_clean['year'] >= 2002) & (all_vars_clean['year'] <= 2020)]

print("number of missing values in the whole dataset after filtering the years:")
print(all_vars_clean.isnull().sum().sum())
print("number of rows after filtering the years:")
print(all_vars_clean.shape[0])

number of missing values in the whole dataset before filtering the years:
1035846
number of rows before filtering the years:
15296
number of missing values in the whole dataset after filtering the years:
298817
number of rows after filtering the years:
4541


### 6.2 Filtering the countries by missing values

The same procedure is applied to the filtering of countries with missing values. The following snippet shows the number of NaNs for each country.

In [23]:
# check the amount of missing values by country

# define an array with the unique country values
countries_count_missing = dict.fromkeys(all_vars_clean['country'].unique(), 0)

# iterate through all rows and count the amount of NaN values for each country
for ind, row in all_vars_clean.iterrows():
    countries_count_missing[row['country']] += row.isnull().sum()

# sort the countries by missing values
countries_missing_sorted = dict(sorted(countries_count_missing.items(), key=lambda item: item[1]))

# print the missing values for each country
print("missing values by country:")
for key, val in countries_missing_sorted.items():
    print(key, ":", val)

missing values by country:
Singapore : 1141
Monaco : 1157
Kuwait : 1163
Nauru : 1176
Latvia : 1179
Brunei Darussalam : 1181
Netherlands : 1181
Iceland : 1182
Germany : 1185
Switzerland : 1187
Estonia : 1189
Qatar : 1189
Antigua and Barbuda : 1191
Bahrain : 1191
Malta : 1191
Trinidad and Tobago : 1191
Finland : 1192
Spain : 1192
Lithuania : 1193
Barbados : 1194
France : 1194
Slovak Republic : 1195
Cyprus : 1196
Hungary : 1196
Solomon Islands : 1197
Canada : 1198
New Zealand : 1198
Oman : 1198
Sweden : 1198
Luxembourg : 1199
Palau : 1199
San Marino : 1199
United Kingdom : 1199
Norway : 1200
Chile : 1201
Poland : 1201
United Arab Emirates : 1201
Czechia : 1206
Israel : 1206
Belgium : 1207
Portugal : 1207
Greece : 1208
Bahamas, The : 1209
Denmark : 1209
St. Kitts and Nevis : 1209
Seychelles : 1210
Korea, Rep. : 1211
Romania : 1211
Saudi Arabia : 1211
Moldova : 1213
Tuvalu : 1213
Ireland : 1214
Italy : 1214
United States : 1214
Australia : 1215
Austria : 1216
Belarus : 1217
Uruguay : 1217
K

This output would suggest to remove rows for countries with more than 1.300 missing values since we are also getting rid of the columns where countries are combinded. 

In [24]:
print("number of missing values in the whole dataset before filtering the countries:")
print(all_vars_clean.isnull().sum().sum())
print("number of rows before filtering the countries:")
print(all_vars_clean.shape[0])


# filter only rows for countries with less than 90 missing values
countries_filter = []
for key, val in countries_missing_sorted.items():
    if val<1300:
        countries_filter.append(key)

all_vars_clean = all_vars_clean[all_vars_clean['country'].isin(countries_filter)]

print("number of missing values in the whole dataset after filtering the countries:")
print(all_vars_clean.isnull().sum().sum())
print("number of rows after filtering the countries:")
print(all_vars_clean.shape[0])

number of missing values in the whole dataset before filtering the countries:
298817
number of rows before filtering the countries:
4541
number of missing values in the whole dataset after filtering the countries:
239338
number of rows after filtering the countries:
3686


### 6.3 Checking the features (columns) for missing values

The NaN values count in each column is:

In [30]:
all_vars_clean.isnull().sum().sort_values(ascending=False)

Land Surface Temperature                                                                                     3686
Economic and Social Rights Performance Score                                                                 3686
Standardised Precipitation-Evapotranspiration Index                                                          3686
School enrollment, primary and secondary (gross), gender parity index (GPI)                                  3686
Rule of Law: Estimate                                                                                        3686
Research and development expenditure (% of GDP)                                                              3686
Regulatory Quality: Estimate                                                                                 3686
Ratio of female to male labor force participation rate (%) (modeled ILO estimate)                            3686
Population ages 65 and above (% of total population)                                    

In [34]:
# check with country has missing CO2 emissions
countries_cco2_missing = all_vars_clean[all_vars_clean['CO2 emissions (metric tons per capita)'].isnull()]['country'].unique()
# delete all rows with missing CO2 emissions
all_vars_clean = all_vars_clean.dropna(subset=['CO2 emissions (metric tons per capita)'])

all_vars_clean.isnull().sum().sort_values(ascending=False)

country                                                                                                      0
Level of water stress: freshwater withdrawal as a proportion of available freshwater resources               0
Prevalence of overweight (% of adults)                                                                       0
Poverty headcount ratio at national poverty lines (% of population)                                          0
Population density (people per sq. km of land area)                                                          0
Population ages 65 and above (% of total population)                                                         0
Political Stability and Absence of Violence/Terrorism: Estimate                                              0
PM2.5 air pollution, mean annual exposure (micrograms per cubic meter)                                       0
People using safely managed sanitation services (% of population)                                            0
P