In [1]:
# importing libraries
import pandas as pd
import numpy as np

In [2]:
# importing and validating dataset

data = pd.read_csv('../data/raw/assembled_data.csv')
data.drop(['Unnamed: 0'], axis=1, inplace=True)
print(data.shape)
print(data.head())


(154, 35)
      Country Country Code  Net Trade Status  Price Control in 2021  \
0    Albania           ALB                 1                      0   
1    Algeria           DZA                 2                      1   
2     Angola           AGO                 1                      1   
3  Argentina           ARG                 1                      1   
4    Armenia           ARM                 1                      0   

   Subsidy in 2021  Tax Reduction in 2021  Price Control in 2022  \
0                1                      0                      1   
1                2                      0                      1   
2                2                      0                      1   
3                0                      0                      1   
4                0                      0                      0   

   Subsidy in 2022  Tax Reduction in 2022  Price Control in 2023  ...  \
0                1                      0                      1  ...   
1       

In the last notebook, we found that there were some missing values in the fundamental data. Let's show them again here.

In [3]:
for name, column in data.items():
    if name[0:7] != 'Country':
        print(name)
        print(np.sum(column.isnull()))

Net Trade Status
0
Price Control in 2021
0
Subsidy in 2021
0
Tax Reduction in 2021
0
Price Control in 2022
0
Subsidy in 2022
0
Tax Reduction in 2022
0
Price Control in 2023
0
Subsidy in 2023
0
Tax Reduction in 2023
0
Price Control in 2024
0
Subsidy in 2024
0
Tax Reduction in 2024
0
gdp data in 2020
3
gdp data in 2021
3
gdp data in 2022
3
gdp data in 2023
4
gdp data in 2024
10
population data in 2020
2
population data in 2021
2
population data in 2022
2
population data in 2023
2
population data in 2024
2
land area data in 2020
2
land area data in 2021
2
land area data in 2022
2
land area data in 2023
2
land area data in 2024
154
democracy index data in 2020
12
democracy index data in 2021
12
democracy index data in 2022
12
democracy index data in 2023
12
democracy index data in 2024
12


Now, we will take a look at which rows contain those missing values. First, we will remove the 2024 land area data, since this is missing in all rows. This will allow for a much cleaner analysis of the missing values in the remaining rows.

In [4]:
pd.set_option('display.max_columns', None)
data.drop(['land area data in 2024'], axis=1, inplace=True)
countries_with_missing_values = data[data.isnull().any(axis=1)]
print(countries_with_missing_values)

                               Country Country Code  Net Trade Status  \
8                        Bahamas, The           BHS                 1   
11                           Barbados           BRB                 1   
13                             Belize           BLZ                 1   
15                            Bermuda           BMU                 1   
16                             Bhutan           BTN                 1   
28                     Cayman Islands           CYM                 1   
39                             Curacao          CUW                 1   
43                           Dominica           DMA                 1   
50                           Ethiopia           ETH                 1   
54                      French Guiana           GUF                 1   
61                          Greenland           GRL                 1   
62                            Grenada           GRD                 1   
80                       Korea, South           KOR

There is our full list of 17 countries which have at least 1 null value. We will go through these by column, sometimes filling out additional columns as we do so depending on the method used to treat the missing values.

To start, we will look at French Guiana and Martinique. Both are missing all fundamental data! A quick search on Britannica shows that these are both French colonies. While both have some local government, they ultimately still report to the government of France. Because of this, their ability to make their own policies on this is limited compared to sovereign nations. Since there are almost no colonies in the dataset, the best way to handle these is to exclude them.

Next, Venezuela is the only other country to be missing gdp data. Given the current political situation there, this is unsurprising. However, it is relatively easy to find this data from other sources. Here, we will be using statista (https://www.statista.com/statistics/370937/gross-domestic-product-gdp-in-venezuela/?srsltid=AfmBOoqJLUsuZrpsNU1CcOt4JMvIF0saBfbmhjNFr6XHO7E-lfnTdY9H) to get the 5 data points we need. Note that this is total GDP, not per capita, so we will need to divide by the population.

Ethiopia is the only country remaining that is missing gdp from 2023. I found it here: https://tradingeconomics.com/ethiopia/gdp.

In [5]:

Venezuela_GDP = {2020: 42.84 * (10 ** 9), 2021: 56.62 * (10 ** 9), 2022: 89.01 * (10 ** 9), 2023: 102.38 * (10 ** 9), 2024: 119.81 * (10 ** 9)}
for year in Venezuela_GDP.keys():
    data.at[150,'gdp data in ' + str(year)] = Venezuela_GDP[year] / data.at[150,'population data in ' + str(year)]
data.at[50,'gdp data in 2023'] = 163.7 * (10 ** 9) / data.at[50,'population data in 2023']
data.drop([54,93], axis=0, inplace=True)


In [6]:
pd.set_option('display.max_columns', None)
countries_with_missing_values = data[data.isnull().any(axis=1)]
print(countries_with_missing_values)

                               Country Country Code  Net Trade Status  \
8                        Bahamas, The           BHS                 1   
11                           Barbados           BRB                 1   
13                             Belize           BLZ                 1   
15                            Bermuda           BMU                 1   
16                             Bhutan           BTN                 1   
28                     Cayman Islands           CYM                 1   
39                             Curacao          CUW                 1   
43                           Dominica           DMA                 1   
50                           Ethiopia           ETH                 1   
61                          Greenland           GRL                 1   
62                            Grenada           GRD                 1   
80                       Korea, South           KOR                 2   
84                            Lebanon           LBN

With all that done, the only missing values left are in the columns for 2024 GDP data,and the columns for democracy indexes. Recall that the goal is to predict what policies will be implemented in a given year before they are actually implemented. Therefore, when making the prediction, we will not have data from that year. Since the oil subsidy data only goes up to 2024, we will not have any use for the 2024 fundamental data. Therefore, we can drop those columns, and all that's left to look at is democracy indexes.

In [7]:
data.drop(['gdp data in 2024', 'population data in 2024', 'democracy index data in 2024'], axis=1, inplace=True)
countries_with_missing_values = data[data.isnull().any(axis=1)]
print(countries_with_missing_values)

                               Country Country Code  Net Trade Status  \
8                        Bahamas, The           BHS                 1   
11                           Barbados           BRB                 1   
13                             Belize           BLZ                 1   
15                            Bermuda           BMU                 1   
28                     Cayman Islands           CYM                 1   
39                             Curacao          CUW                 1   
43                           Dominica           DMA                 1   
61                          Greenland           GRL                 1   
62                            Grenada           GRD                 1   
133  Saint Vincent and the Grenadines           VCT                 1   

     Price Control in 2021  Subsidy in 2021  Tax Reduction in 2021  \
8                        1                0                      0   
11                       1                0             

Just 10 countries left to figure out the democracy indexes of!

Of these, 4 are not actually countries, but are territories: Bermuda and The Cayman Islands are British Territories, Curacao is a Dutch territory, and Greenland is Denmark's. For consistency, these will be excluded. The others are independent countries. A similar index can be found on freedomhouse's website here: https://freedomhouse.org/country. While the index is not exactly the same, it is sufficient for the purposes of this analysis, since we are only looking to capture large differences between democracies and authoritarian regimes. In addition, since democracy indexes don't change quickly, we will use the value from the latest year when it exists for all years after that.

In [8]:
data.set_index('Country Code', inplace=True)
# pulling values from freedomhouse moved onto a 1-10 scale. Using last available year whe data does not go back as far as 2020.
data_dict = {'Country Code': ['BHS', 'BRB', 'BLZ', 'DMA', 'GRD', 'VCT'], 2020: [9.1,9.5,8.7,9.3,8.9,9.1], 2021: [9.1,9.5,8.7,9.3,8.9,9.1], 2022: [9.1,9.5,8.7,9.3,8.9,9.1], 2023: [9.1,9.4,8.7,9.3,8.9,9.1]}
democracy_indexes = pd.DataFrame(data_dict)
democracy_indexes.set_index('Country Code', inplace=True)
democracy_indexes.rename(columns=lambda x: 'democracy index data in ' + str(x), inplace=True)
data.loc[['BHS', 'BRB', 'BLZ', 'DMA', 'GRD', 'VCT'], democracy_indexes.columns] = democracy_indexes

In [9]:
for name, column in data.items():
    if name != 'Country':
        print(name)
        print(np.sum(column.isnull()))

Net Trade Status
0
Price Control in 2021
0
Subsidy in 2021
0
Tax Reduction in 2021
0
Price Control in 2022
0
Subsidy in 2022
0
Tax Reduction in 2022
0
Price Control in 2023
0
Subsidy in 2023
0
Tax Reduction in 2023
0
Price Control in 2024
0
Subsidy in 2024
0
Tax Reduction in 2024
0
gdp data in 2020
0
gdp data in 2021
0
gdp data in 2022
0
gdp data in 2023
0
population data in 2020
0
population data in 2021
0
population data in 2022
0
population data in 2023
0
land area data in 2020
0
land area data in 2021
0
land area data in 2022
0
land area data in 2023
0
democracy index data in 2020
4
democracy index data in 2021
4
democracy index data in 2022
4
democracy index data in 2023
4


We still need to remove the 4 rows for countries that are actually just territories and to export the data.

In [10]:
data.drop(['BMU', 'CYM', 'CUW', 'GRL'], axis=0, inplace=True)
for name, column in data.items():
    if name != 'Country':
        print(name)
        print(np.sum(column.isnull()))

Net Trade Status
0
Price Control in 2021
0
Subsidy in 2021
0
Tax Reduction in 2021
0
Price Control in 2022
0
Subsidy in 2022
0
Tax Reduction in 2022
0
Price Control in 2023
0
Subsidy in 2023
0
Tax Reduction in 2023
0
Price Control in 2024
0
Subsidy in 2024
0
Tax Reduction in 2024
0
gdp data in 2020
0
gdp data in 2021
0
gdp data in 2022
0
gdp data in 2023
0
population data in 2020
0
population data in 2021
0
population data in 2022
0
population data in 2023
0
land area data in 2020
0
land area data in 2021
0
land area data in 2022
0
land area data in 2023
0
democracy index data in 2020
0
democracy index data in 2021
0
democracy index data in 2022
0
democracy index data in 2023
0


In [11]:
data.to_csv('../data/interim/preprocessed_data.csv')

That completes this notebook. In it, we have dealt with all the missing values in our master dataframe, through removing a few rows that didn't make sense to include, pulling in additional data sources, and by extrapolation from existing data. In the next notebook, we will do some reformatting and create some new columns and dataframes to get it fully ready for machine learning models.