# **Understanding and Predicting Economic Development: A Machine Learning Approach**

This is a Jupyter notebook that presents the code and explanation for the execution of the data acquisition, cleaning and preprocessing and subsequent fitting of the machine learning model, both supervised and unsupervised models.

The author will present a markdown above each executed cell explaining the functions, input and output of the code being executed. 

Please do not attempt to change any line of code within this file without a proper understanding of each line, else the entire code **might** break down.

# **CELL 1**

The cell below installs the World Bank library will is an API that gives us access into the World Bank Development Indicators Database. It also installs the fancyimpute library which is used later in the code to fill missing variables in the spooled dataset.
This cell is executed independently because Python Integrated Development Environments (IDEs) such as Anaconda and Pycharm do not usually have these libraries installed. 

In [1]:
#Install Required Libraries
!pip install wbdata
!pip install fancyimpute
!pip install openpyxl

# **CELL 2**

The cell below imports the many libraries which will be required by functions and dataframes in the code below. Such libraries include Pandas for interacting with DataFrames, Sklearn for accessing Machine Learning Models and Seaborn or Matplotlib for plotting graphs. 
!!If any of these libraries cause an error in execution, kindly install the required library using the *!pip install XXX* protocol into your python version.

In [2]:
#Import Required Modules and Libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import wbdata    
from fancyimpute import KNN
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
import sklearn.utils
from sklearn.model_selection import train_test_split
import imblearn
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

# **CELL 2**

The cell below initializes four variables into memory;
* Indicators: This is a dictionary-type variables which includes the World Bank Designation for each of the variable which we are interested in using for our model and the associated name for the variable.
* countries_list: This is a list-type variable of all 217 countries (economies) which the code will request their data from the World Bank API
* developing_economies: This is a list-type variable that contains the names of all developing economies as classified by the World Economic Situation and Prospects 2022 report by the United Nations.
* emerging economies: This is a list-type variable that contains the names of all transitioning economies as classified by the World Economic Situation and Prospects 2022 report by the United Nations.
* developed_economies: This is a list-type variable that contains the names of all developed economies as classified by the World Economic Situation and Prospects 2022 report by the United Nations.


In [3]:
indicators = {"NE.IMP.GNFS.KD" : "Imports of goods and services (constant 2015 US$)",
              "NE.EXP.GNFS.KD" : "Exports of goods and services (constant 2015 US$)",
              "NY.GDP.MKTP.KD" : "GDP (constant 2015 US$)",
              "SP.POP.TOTL" : "Population, total",
              "SH.XPD.GHED.GD.ZS" : "Domestic general government health expenditure (% of GDP)",
              "SH.XPD.OOPC.CH.ZS" : "Out-of-pocket expenditure (% of current health expenditure)",
              "SE.XPD.TOTL.GB.ZS" : "Government expenditure on education, total (% of government expenditure)",
              "EG.USE.ELEC.KH.PC" : "Electric power consumption (kWh per capita)",
              "NE.GDI.TOTL.ZS" : "Gross capital formation (% of GDP)",
              "BN.KLT.DINV.CD" : "Foreign direct investment, net (BoP, current US$)",
              "MS.MIL.XPND.GD.ZS" : "Military expenditure (% of GDP)",
              "NY.GDP.PCAP.KD" : "GDP per capita (constant 2015 US$)", 
              "SL.UEM.TOTL.ZS" : "Unemployment, total (% of total labor force) (modeled ILO estimate)",
              "SP.DYN.TFRT.IN" : "Fertility rate, total (births per woman)",
              "SP.DYN.IMRT.IN" : "Mortality rate, infant (per 1,000 live births)",
              "SP.DYN.LE00.IN" : "Life expectancy at birth, total (years)",
              "SE.ENR.PRIM.FM.ZS" : "School enrollment, primary (gross), gender parity index (GPI)",
              "NV.IND.TOTL.ZS" : "Industry (including construction), value added (% of GDP)",
              "NV.IND.MANF.ZS" : "Manufacturing, value added (% of GDP)",
              "IS.AIR.PSGR" : "Air transport, passengers carried",
              "EN.ATM.CO2E.KT" : "CO2 emissions (kt)",
              "NY.GDP.MINR.RT.ZS" : "Mineral rents (% of GDP)",
              "TM.VAL.FOOD.ZS.UN" : "Food imports (% of merchandise imports)",
              "IQ.CPA.TRAN.XQ" : "CPIA transparency, accountability, and corruption in the public sector rating (1=low to 6=high)",
              "IQ.CPA.FINQ.XQ" : "CPIA quality of budgetary and financial management rating (1=low to 6=high)",
              "IQ.CPA.PROP.XQ" : "CPIA property rights and rule-based governance rating (1=low to 6=high)",     
              "IQ.CPA.TRAD.XQ" : "CPIA trade rating (1=low to 6=high)"}
countries_list = ["Afghanistan","Albania","Algeria","American Samoa","Andorra","Angola","Antigua and Barbuda","Argentina","Armenia","Aruba","Australia","Austria","Azerbaijan","Bahamas, The","Bahrain","Bangladesh","Barbados","Belarus","Belgium","Belize","Benin","Bermuda","Bhutan","Bolivia","Bosnia and Herzegovina","Botswana","Brazil","British Virgin Islands","Brunei Darussalam","Bulgaria","Burkina Faso","Burundi","Cabo Verde","Cambodia","Cameroon","Canada","Cayman Islands","Central African Republic","Chad","Channel Islands","Chile","China","Colombia","Comoros","Congo, Dem. Rep.","Congo, Rep.","Costa Rica","Cote d'Ivoire","Croatia","Cuba","Curacao","Cyprus","Czech Republic","Denmark","Djibouti","Dominica","Dominican Republic","Ecuador","Egypt, Arab Rep.","El Salvador","Equatorial Guinea","Eritrea","Estonia","Eswatini","Ethiopia","Faroe Islands","Fiji","Finland","France","French Polynesia","Gabon","Gambia, The","Georgia","Germany","Ghana","Gibraltar","Greece","Greenland","Grenada","Guam","Guatemala","Guinea","Guinea-Bissau","Guyana","Haiti","Honduras","Hong Kong SAR, China","Hungary","Iceland","India","Indonesia","Iran, Islamic Rep.","Iraq","Ireland","Isle of Man","Israel","Italy","Jamaica","Japan","Jordan","Kazakhstan","Kenya","Kiribati","Korea, Dem. People's Rep.","Korea, Rep.","Kosovo","Kuwait","Kyrgyz Republic","Lao PDR","Latvia","Lebanon","Lesotho","Liberia","Libya","Liechtenstein","Lithuania","Luxembourg","Macao SAR, China","Madagascar","Malawi","Malaysia","Maldives","Mali","Malta","Marshall Islands","Mauritania","Mauritius","Mexico","Micronesia, Fed. Sts.","Moldova","Monaco","Mongolia","Montenegro","Morocco","Mozambique","Myanmar","Namibia","Nauru","Nepal","Netherlands","New Caledonia","New Zealand","Nicaragua","Niger","Nigeria","North Macedonia","Northern Mariana Islands","Norway","Oman","Pakistan","Palau","Panama","Papua New Guinea","Paraguay","Peru","Philippines","Poland","Portugal","Puerto Rico","Qatar","Romania","Russian Federation","Rwanda","Samoa","San Marino","Sao Tome and Principe","Saudi Arabia","Senegal","Serbia","Seychelles","Sierra Leone","Singapore","Sint Maarten (Dutch part)","Slovak Republic","Slovenia","Solomon Islands","Somalia","South Africa","South Sudan","Spain","Sri Lanka","St. Kitts and Nevis","St. Lucia","St. Martin (French part)","St. Vincent and the Grenadines","Sudan","Suriname","Sweden","Switzerland","Syrian Arab Republic","Tajikistan","Tanzania","Thailand","Timor-Leste","Togo","Tonga","Trinidad and Tobago","Tunisia","Turkey","Turkmenistan","Turks and Caicos Islands","Tuvalu","Uganda","Ukraine","United Arab Emirates","United Kingdom","United States","Uruguay","Uzbekistan","Vanuatu","Venezuela, RB","Vietnam","Virgin Islands (U.S.)","West Bank and Gaza","Yemen, Rep.","Zambia","Zimbabwe"]
developing_economies = ["Algeria", "Egypt, Arab Rep.", "Libya", "Mauritania", "Morocco", "Sudan", "Tunisia", "Cameroon", "Central African Republic", "Chad", "Congo, Rep.", "Equatorial Guinea", 
"Sao Tome and Prinicipe", "Gabon", "Burundi", "Comoros", "Congo, Dem. Rep.", "Djibouti", "Eritrea", "Ethiopia", "Kenya", "Madagascar", "Rwanda", "Somalia", "South Sudan", "Uganda", "Tanzania", "Angola", 
"Botswana", "Eswatini", "Lesotho", "Malawi", "Mauritius", "Mozambique", "Namibia", "South Africa", "Zambia", "Zimbabwe", "Benin", "Burkina Faso", "Cabo Verde", "Cote d'Ivoire", "Gambia, The", "Ghana", "Guinea", 
"Guinea-Bissau", "Liberia", "Mali", "Niger", "Nigeria", "Senegal", "Sierra Leone", "Togo", "Brunei Darussalam", "Cambodia", "China", "Korea, Dem. People's Rep.", "Fiji", "Hong Kong SAR, China", "Indonesia", "Kiribati", "Lao PDR", "Malaysia", "Mongolia", 
"Myanmar", "Papua New Guinea", "Philippines", "Korea, Rep.", "Samoa", "Singapore", "Solomon Islands", "Taiwan", "Thailand", "Timor-Leste", "Vanuatu", "Vietnam", "Afghanistan", "Bangladesh", "Bhutan", "India", "Iran, Islamic Rep.", "Maldives", "Nepal", "Pakistan", "Sri Lanka", "Bahrain", "Iraq", "Israel", "Jordan", 
"Kuwait", "Lebanon", "Oman", "Qatar", "Saudi Arabia", "State of Palestine", "Syrian Arab Republic", "Turkey", "United Arab Emirates", "Yemen, Rep.", "Bahamas, The", "Barbados", "Belize", "Guyana", 
"Jamaica", "Suriname", "Trinidad and Tobago", "Costa Rica", "Cuba", "Dominican Republic", "El Salvador", "Guatemala", "Haiti", "Honduras", "Mexico", "Nicaragua", "Panama", "Argentina", "Bolivia", "Brazil", 
"Chile", "Colombia", "Ecuador", "Paraguay", "Peru", "Uruguay", "Venezuela", "Venezuela, RB", 'Seychelles', 'Tonga']
emerging_economies = ["Albania", "Bosnia and Herzegovina", "Montenegro", "North Macedonia", "Serbia", "Armenia", "Azerbaijan", "Belarus", "Georgia", "Kazakhstan", "Kyrgyz Republic", "Moldova", "Russian Federation", "Tajikistan", "Turkmenistan", "Ukraine", "Uzbekistan"]
developed_economies = ["Canada", "United States", "Australia", "Japan", "New Zealand", "Austria", "Belgium", "Denmark", "Finland", "France", "Germany", "Greece", "Ireland", "Italy", "Luxembourg", "Netherlands", "Portugal", "Spain", "Sweden", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", "Estonia", "Hungary", "Latvia", "Lithuania", "Malta", "Poland", "Romania", "Slovak Republic", "Slovenia", "Iceland", "Norway", "Switzerland", "United Kingdom"]


# **CELL 3**

The cell below contains functions which will be used to execute different stages of data spooling, cleaning and model fitting. The code was organised in this format to help ensure clarity in procedure and enable the user some flexibility in changing some inputs into some functions. The following is an explanation of the purpose of each function:

**The spool_values function**: This function is the first call to be made in this code structure. It programmatically accesses the World Bank API and downloads the relevant data for our specified variables and countries. Its inputs is the dictionary-type variable called "indicators" above and the list-type variables called "countries_list above". It returns a pandas dataframe containing values from 1960 - 2021 (or the most recent year in the database) for all countries in  "countries_list" and all variables in "indicators". 

**The clean_data function**: This function is called second and takes as the resulting dataframe from the "spool_values" function as an input. The intention of this function is to systematically reduce missing variables in the dataset. It does this by;
* It calls in the dataset spooled from the WDI from the spool_values function above.
* If the NaN_drop variable is True, it executes a conditional statement. This argument in the function uses the NaN_perc variable to drop (delete) rows with missing values less than NaN_perc. For example, if NaN_drop is True and NaN_perc is 60%, then every row in our dataset that has 40% or more of its values missing is deleted from the dataset. For example, if Japan in 1972 has 41% of the variables missing, that year is dropped from our dataset. 
* If the country_drop variable is True, it executes a conditional statement. This argument uses the cut_off variable to drop all data for a country in our dataset that does not have at least a set percentage of the annual data available. For example, if country_drop is True and cut_off is 60%, then if Japan does not have at least 37 years of annual data in the dataset, it is dropped entirely. (1960 – 2021 represents 61 years)
The issue of missing data is significant given that we are working with many countries and will eventually fill in missing data in our dataset. The idea is to try as much as possible to maintain the data structure such that annual data and countries that have a lot of missing values are dropped entirely. In the original report we have used a threshold of 50% for missing variables in the row and 50% for number of rows (years) which a country must have to remain in the data.
!!Any subsequent users can change the cutoff for rows and countrys should they wish to vary the criteria with which missing variables are treated in the context explained above. The values can range from 1 to 100.

**The fill_values function**: The fill_values function is executed on a country-by-country basis. It uses the dataset outputted by the clean_data function above. The function uses the K-nearest neighbor model to make predictions for missing values in our model and fill all missing data. If the neighbors' argument is specified as 3, the model will use the nearest three annual data to a missing datapoint to predict the missing value. 
For example, in our dataset, if the birth rate variable was missing for Japan in 1972, the model uses 1969, 1970, and 1971 birth rate data points for Japan to predict the missing 1972 value. Please note that this imputation is done on a country-by-country basis. The output of this function is a dataframe will all missing values predicted.
!! Any subsequent users can change the neighbors argument to values above 3 should they wish to change how the KNN model predicts the missing values. A value of 3 or 5 is usually recommended.
Please see the [link](https://towardsdatascience.com/the-use-of-knn-for-missing-values-cf33d935c637) here for more explanation of the KNN imputation technique. 

**The visualise_missing function**: The visualize_missing function is used to create a graphical representation of missing values in our dataset. This is simply an informative graph to help us understand which variables in the dataset have the most missing values. This function is applied to the output of the spool_values, clean_data and fill_values functions to see the ratio of missing variables in the dataframe.

**The standardization function**: This function standardizes the dataframe outputed from the "fill_values" function. Standardising a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. This process is essential primarily when the different variables used in ML models have different scales (Brownlee, 2016). This prevents variables that have higher orders from biasing the model.The output of this function is simply the same dataframe but rescaled such that the mean and standard deviation are 0 and 1 respectively.

**The feature_engineering function**: This function simply performs feature engineering by creating three new variables from the variables already in the dataset. We perform feature engineering to create three variables: trade openness (import + export/ GDP), CO2 emission per capita (CO2 emission (kt)/ Total population) and Air Transportation per capita (Air Transportation (number of passengers)/ Total population). The import, export and total population columns are then dropped because they are not among the list of variables which we employ in our modelling. 
!!Any subsequent users should delete this function if they do not intend to have these features (variables) in the dataset which they use to fit their own models. An error will be thrown if the import, export and population variables were not part of the originally spooled variables from the WB database.

**The economy_classification function**: This function will add a column to the output of the "feature_engineering" dataset in preparation for the supervised machine learning. The new column is the dependent variable column featuring the three classes necessary for the analysis. Developing economies take the class 0, transition economies take the class 1 and developed economies take the class two. This class assignment is done based on the developing_economies, emerging economies and developed_economies list specified above and culled based on the World Economic Situation and Prospects 2022 report. 

**The unsupervised_ML function**: This function fits an usupervised ML model to produce clusters for the dataset. It's arguments (inputs) include:
*            dataframe; The dataset
*            year: The specific year for the countries in the dataset on which to run the cluster algorithm on
*            clusters: The number of clusters to include in our dataset. The report form of this work uses five clusters for the economies in the dataset.
*            model: Either the KMeans or GMM model.
The output of this function is a dataset with a list of countries and the cluster assignment for each country, along with a two principal component analysis variables should the user which to visualise the dimensioanlity reduction of the variables supplied to the model.

**The supervised_ML function**: This function fits a supervised ML model to understand and predict the three classes supplied to the model. It's arguments include:
*            dataframe; The dataset
*            split_size: The size of the randomly selected test dataset (the size used for the report is 30%)
*            clusters: The number of clusters to include in our dataset
*            model: Either the RFC, SVMC, or XGBC
*            excluded_country: The country to be excluded should the user which to perform an out of sample prediction. This country is excluded from the model fitting  process and then its class predicted with the fitted model.
There are two outputs for this function. First the fitted model, from which the feature importance listing is derived and the out-of-sample test data should the user choose to select one. However, while fitting the model, the confusion matrix is printed.
!!If the user does not want to include any country for out of sample testing, then this argument should take the value of "None".

**The feature_importance_graph function**: This function returns a graph of the feature importance for the variables used in either of the three models in the supervised_ML function. It takes the following arguments:
* fitted_model: The variable in which the model for the supervised_ml was saved into
* model_name: Either the RFC, SVMC, or XGBC (this must align with the model into which the supervised_ML was saved into.)

In [34]:
def spool_values(variables, countries):
    """This is a function that returns the listed indicator values
        Args:
            indicators; a dictionary containing required indicator codes and name
            countries_list: A list of countries which we require for our dataset
        Returns: a dataframe of data queried from the World Bank Development Indicators DataBase
        """
    wbdata.get_source()
    wbdata.get_indicator(source=2)
    print("Spooling data for " + str(len(variables)) + " variables, across " + str(len(countries)) + " economies.")
    df = wbdata.get_dataframe(variables, country="all", cache=True)
    df = df.reset_index(level=0)
    df = df.reset_index(level=0)
    final = df[df['country'].isin(countries)]
    return final


def clean_data(dataframe, NaN_drop, NaN_perc, country_drop, cutoff):
    """This is a function that cleans a dataframe based on the number of missing variables that exists in each row. It also drops countries that have annual data less than the cutoff
        Args:
            dataframe; A dataframe containing the panel data
            NaN_drop; A True or false variable which specifies whether rows with missing values above our threshold should be dropped. 
            NaN_perc; delete the rows of a dataframe based on NaN percentage. It means by the percentage of missing values the rows contains. For example, deleting dataframe rows where NaN value are either 25% or more than 25%. Range: 0 -100
            country_drop: A True or false variable which specifies whether the country with annual date below the cutoff should be dropped. 
            cutoff; The minimum number of annual data each country should have. A country is dropped if it has less. Range: 0 - 100
        Returns: a dataframe with less missing rows and countries
        """
    print("The original shape of our dataset is: " + str(dataframe.shape) + " with " + str(dataframe['country'].nunique()) + " countries")
    if NaN_drop == True:
        min_count =  int(((NaN_perc)/100)*dataframe.shape[1] + 1)
        print("Each row in our dataset must have at least " + str(NaN_perc) + "% of the values present")
        print("We will drop rows that do not meet this condition")
        print("")
        mod_df = df.dropna(axis=0, thresh=min_count)
        cleaned_df = mod_df
    else:
        cleaned_df = dataframe
        mod_df = dataframe
    if country_drop == True:
        last_year = int(mod_df['date'].max())
        first_year = int(mod_df['date'].min())
        cutoff_value = int(round((cutoff/100)*(last_year - first_year)))
        print("Each country in our dataset must have at least " + str(cutoff) + "% of the years present")
        print("We will completely drop countries that do not meet this condition")
        print('')
        count = pd.DataFrame(mod_df['country'].value_counts())
        count = count.reset_index(level=0)
        count_2 = count[count['country']>= cutoff_value]
        qualified_countries = sorted(list(count_2['index']))
        cleaned_df = mod_df[mod_df['country'].isin(qualified_countries)]
        print("The new shape of our dataset is: " + str(cleaned_df.shape) + " with " + str(cleaned_df['country'].nunique()) + " countries")
        print("")
    else:
        print("The new shape of our dataset is: " + str(cleaned_df.shape) + " with " + str(cleaned_df['country'].nunique()) + " countries")
        print('')
    return cleaned_df


def fill_values(dataframe, neighbors):
    """A K- Nearest Neighbor model which fills the NAs in our dataframe
        Args:
            dataset; The dataset
            neighbors: A KNN parameter for the neigbors used in modelling. Usually integer 3 or 5 
        Returns: a dataframe with all missing values (NaNs) imputed
        """
    country_list = sorted(set(list(dataframe['country'])))
    dataset_columns = list(dataframe.columns)
    filled_data = pd.DataFrame()
    for country in country_list:
        print(" Now imputing missing variable for " + str(country))
        iteration_incomplete_dataset = dataframe.loc[dataframe['country'].isin([country])]
        iteration_incomplete_dataset.drop(['country'], axis=1, inplace=True)
        filled_country_data = pd.DataFrame(KNN(k=neighbors).fit_transform(iteration_incomplete_dataset))
        complete_data = filled_country_data
        complete_data.insert(1, 'country', country)
        filled_data = pd.concat([filled_data, complete_data])
    filled_data.columns = dataset_columns
    filled_data.date = filled_data.date.astype(int)
    return filled_data

def visualise_missing(dataframe):
    """This function will visualise the percentage of missing variables in a dataframe
        Args:
            dataframe; The dataset
        Returns: A graphical representation of missing variable across all variables in the inputted dataframe
        """
    
    #Visualizing Missing Data using Seaborn
    plt.figure(figsize=(10,6))
    sns.displot(data=dataframe.isna().melt(value_name="Variable Missing"), y="variable", hue="Variable Missing", multiple="fill", aspect=1.25)
    plt.savefig("visualizing_missing_data_with_barplot_Seaborn_distplot.png", dpi=100)
    return

def standardization(dataframe):
    """This function will standardize the variables in our dataset cause it to have a mean of zero and standard deviation of 1.
        Args:
            dataframe; The dataset
        Returns: A dataset with the list of countries with all variables standardized
        """
    scaler= StandardScaler()
    standardized = dataframe.copy()
    standardized = standardized.reset_index(drop=True)
    country_column = standardized[['country', 'date']]
    standardized.drop("country", axis=1, inplace=True)
    standardized.drop("date", axis=1, inplace=True)
    standardized_data = scaler.fit_transform(standardized)
    standardized_df = pd.DataFrame(standardized_data, columns=standardized.columns)
    standardized_df = country_column.join(standardized_df)
    return standardized_df

def feature_engineering(filled_data):
    """This function will help us create some specified variables such as trade openness from the complete dataset
        Args:
            dataframe; The dataset
        Returns: A dataset with the the complete columns which the analysis requires
        """
    #This code is the feature engineering for tradeopenness and Co2 emission (kt per capita)
    filled_data["Trade Openness"] = (filled_data["Imports of goods and services (constant 2015 US$)"] + filled_data["Exports of goods and services (constant 2015 US$)"]) / filled_data["GDP (constant 2015 US$)"]
    filled_data["CO2 emissions (kt per capita)"] = filled_data["CO2 emissions (kt)"]/ filled_data["Population, total"]
    filled_data["Air transport, passengers carried per capita"] = filled_data["Air transport, passengers carried"]/ filled_data["Population, total"]
    filled_data.drop("Imports of goods and services (constant 2015 US$)", axis=1, inplace=True)
    filled_data.drop("Exports of goods and services (constant 2015 US$)", axis=1, inplace=True)
    filled_data.drop("Population, total", axis=1, inplace=True)
    filled_data.drop("CO2 emissions (kt)", axis=1, inplace=True)
    filled_data.drop("Air transport, passengers carried", axis=1, inplace=True)
    ##reindex columns
    full_data = filled_data[['date', 'country', "Trade Openness", 'Domestic general government health expenditure (% of GDP)', 'Out-of-pocket expenditure (% of current health expenditure)', 'Government expenditure on education, total (% of government expenditure)', 'Electric power consumption (kWh per capita)', 'Gross capital formation (% of GDP)', 'Foreign direct investment, net (BoP, current US$)', 'Military expenditure (% of GDP)', 'GDP per capita (constant 2015 US$)', 'Unemployment, total (% of total labor force) (modeled ILO estimate)', 'Fertility rate, total (births per woman)', 'Mortality rate, infant (per 1,000 live births)', 'Life expectancy at birth, total (years)', 'School enrollment, primary (gross), gender parity index (GPI)', 'Industry (including construction), value added (% of GDP)', 'Manufacturing, value added (% of GDP)', "Air transport, passengers carried per capita", "CO2 emissions (kt per capita)", 'Mineral rents (% of GDP)', 'Food imports (% of merchandise imports)', 'CPIA transparency, accountability, and corruption in the public sector rating (1=low to 6=high)', 'CPIA quality of budgetary and financial management rating (1=low to 6=high)', 'CPIA property rights and rule-based governance rating (1=low to 6=high)', 'CPIA trade rating (1=low to 6=high)']]
    return full_data

def economy_classification(dataset):
    """This function will create (an extra column) the dependent variable column with three classes in prepartion for the supervised ML model.
        Args:
            dataframe; The dataset
        Returns: A dataset with an extra column which is the economic classification; developing (0), transitiong (1), developed (2)
         """
    ml_data = dataset.copy()
    def dictionary_function(dataframe):   
        myDict = {}
        myDict["0"] = developing_economies
        myDict["1"] = emerging_economies
        myDict["2"] = developed_economies
        dataframe = dataframe.copy()

        if dataframe['country'] in (developing_economies):
            return 0
        elif dataframe['country'] in (emerging_economies):
            return 1
        elif dataframe['country'] in (developed_economies):
            return 2
        else:
            return "missing"
    
    ml_data['Economy'] = ml_data.apply(dictionary_function, axis=1)
    ml_data.drop(ml_data.index[ml_data['Economy'] == "missing"], inplace=True)
    return ml_data

def unsupervised_ML(dataframe, year, clusters, model):
    """This function will run an unsupervised machine learning with specified clusters for a specific year in our dataset.
        Args:
            dataframe; The dataset
            year: The year to run the cluster on
            clusters: The number of clusters to include in our dataset
            model: Either the KMeans or Gaussian Mixture Model
        Returns: A dataset with the list of countries, the Principle Component Analysis 2-dimensional representation and the relevant clusters.
        """
    data = dataframe.copy()
    data = data.loc[data['date'] == int(year)]
    data.index = range(len(data.index))
    country_subset_column = data[['country']]
    data = data.dropna()
    #backup = df2.copy()
    data = data.iloc[:,2:]
    #df2_backup = df2.copy()
    
    if model == "KMeans":
        clustering_kmeans = KMeans(n_clusters=clusters, random_state=1)
        pca_num_components = 2
        reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
        pca = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
        pca['clusters'] = clustering_kmeans.fit_predict(data)
        result = country_subset_column.join(pca)
        return result
        
    elif model == "GMM":
        gmm = GaussianMixture(n_components=(int(clusters)), random_state=1)
        pca_num_components = 2
        reduced_data = PCA(n_components=pca_num_components).fit_transform(data)
        pca = pd.DataFrame(reduced_data,columns=['pca1','pca2'])
        gmm.fit(data)
        probs = gmm.predict_proba(data)
        props = probs.round(3)
        pca['clusters'] = gmm.fit_predict(data)
        probs = pd.DataFrame(probs,columns=['Cluster 0 Prob','Cluster 1 Prob', 'Cluster 2 Prob','Cluster 3 Prob', 'Cluster 4 Prob'])
        probs = probs.round({'Cluster 0 Prob' : 3,'Cluster 1 Prob' : 3, 'Cluster 2 Prob' : 3,'Cluster 3 Prob' : 3, 'Cluster 4 Prob' : 3})
        result = country_subset_column.join(pca)
        result = result.join(probs)
        return result
        
    else:
        print("You have not selected either a KMeans model or a Gaussian Mixture Model(GMM). Please reinput the correct function parameters")
        
        
def supervised_ml(dataframe, split_size, model, excluded_country):
    """This function will run a supervised machine learning with a split ratio for the size of the test dataset.
        Args:
            dataframe; The dataset
            split_size: The size of the randomly selected test dataset
            model: Either the RFC, SVMC, or XGBC
            excluded_country: The country to be excluded should the user which to perform an out of sample prediction. This country is excluded from the model fitting process and then its class predicted with the fitted model.
        Returns: A fitted Machine Learning Model
        """
    input_df= dataframe.copy()
    test_country = pd.DataFrame()
    
    if excluded_country == "None":
        print("We have not excluded any country to test the prediction accuracy.")
    else:
        print("We will exclude " +str(excluded_country) + " from the learning process of the various ML models, then predict its classification later")
        excluded_country_df = input_df.loc[input_df['country'] == str(excluded_country)]
        test_country = test_country.append(excluded_country_df)
        test_country = test_country.reset_index(drop= True)
    
    input_df.drop(input_df.index[input_df['country'] == excluded_country], inplace = True)
    input_df.drop(columns = ["country","date"], inplace=True) 
    input_df = input_df.dropna()    
    input_df = sklearn.utils.shuffle(input_df)
    input_df = input_df.reset_index(drop= True)
    input_df["Economy"] = input_df['Economy'].astype('int')
    
    X = input_df.drop(columns= 'Economy')
    y = input_df['Economy'] 
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size= split_size, random_state= 1 )
    smote = SMOTE(random_state=1)
    x_train_balanced, y_balanced = smote.fit_resample(x_train, y_train)
    
    if model == "RFC":
        clf_forest=RandomForestClassifier(random_state=1)
        clf_forest.fit(x_train_balanced, y_balanced)
        y_pred_forest=clf_forest.predict(x_test)
        print("Accuracy for Random Forest Classificer:",metrics.accuracy_score(y_test, y_pred_forest))
        print(classification_report(y_test, y_pred_forest))
        if excluded_country == "None":  
            return clf_forest, test_country
        else:
            test_country_variables = test_country.drop(['Economy', 'country', 'date'], axis=1)
            test_country_prediction = clf_forest.predict(test_country_variables)
            result = pd.DataFrame(test_country_prediction,columns=['Classification'])
            test_country["Classification"] = result["Classification"]  
            return clf_forest, test_country
        
    elif model == "SVMC":
        svmc=SVC() 
        svmc.fit(x_train_balanced, y_balanced)
        y_pred_svmc = svmc.predict(x_test)
        print("Accuracy for Support Vector Model Classifier:",metrics.accuracy_score(y_test, y_pred_svmc))
        print(classification_report(y_test, y_pred_svmc))
        if excluded_country == "None":
            return svmc, test_country
        else:
            test_country_variables = test_country.drop(['Economy', 'country', 'date'], axis=1)
            test_country_prediction = svmc.predict(test_country_variables)
            result = pd.DataFrame(test_country_prediction,columns=['Classification'])
            test_country["Classification"] = result["Classification"]      
            return svmc, test_country
    
    elif model == "XGBC":
        clf_xgb = XGBClassifier(random_state=1) 
        clf_xgb.fit(x_train_balanced, y_balanced)
        y_pred_xgb = clf_xgb.predict(x_test)
        accuracy = metrics.accuracy_score(y_test, y_pred_xgb)
        print("Accuracy for XGBoost Classifier: %.4f%%" % (accuracy *100))
        print(classification_report(y_test, y_pred_xgb))
        if excluded_country == "None":
            return clf_xgb, test_country
        else:
            test_country_variables = test_country.drop(['Economy', 'country', 'date'], axis=1)
            test_country_prediction = clf_xgb.predict(test_country_variables)
            result = pd.DataFrame(test_country_prediction,columns=['Classification'])
            test_country["Classification"] = result["Classification"]      
            return clf_xgb, test_country
        
    else:
        print("You have not selected either a RFC/ SVMC or XGBC. Please reinput the correct function parameters")


def feature_importance_graph(fitted_model, model_name):
    """This function will return a graph showing the feature importance for the fitted ML model
        Args:
            fitted_model: The variable in which the model for the supervised_ml was defined in
            model_name: Either the RFC, SVMC, or XGBC
        Returns: A dataset with the features (variables) and their weighted importance for the model's classification goal. It also plots a graph.
        """
    if model_name == "RFC" or model_name == "XGBC":
        features = ml_data.drop(columns= ["Economy", "country", 'date'])
        predictors = [x for x in features.columns]
        feat_imp = pd.Series(fitted_model.feature_importances_, predictors).sort_values(ascending=False)
        feat_imp = feat_imp[0:50]
        plt.rcParams['figure.figsize'] = 20, 5
        feat_imp.plot(kind='bar', title='Feature Importance')
        plt.ylabel('Feature Importance Score')
        feat_imp = feat_imp.copy()
        feature_importance_df = pd.DataFrame({'Feature': feat_imp.index, str(model_name): feat_imp.values})
        return feature_importance_df
        
    elif model_name == "SVMC":
        print("The Support Vector Machine Model was fitted using the default kernel: rbf. Thus it cannot output a feature importance listing")

    else:
        print("You have not selected either a RFC/ SVMC or XGBC. Please reinput the correct function parameters")

# **CELL 4**
The cell below will simply execute the function to pull the necessary dataset from the WB Database and visualise missing variables ratio.

In [5]:
#This code will pull the relevant datasets from the World Bank repository
df = spool_values(indicators, countries_list)
visualise_missing(df)

# **CELL 5**
The cell below will simply execute the function to clean missing variables based on the criteria inputed and visualise the resulting missing variables.

In [6]:
#This code will delimit the dataset by dropping countries with too many missing variables
cleaned_df = clean_data(df, True, 50, True, 50)
visualise_missing(cleaned_df)

# **CELL 6**
The cell below will simply execute the function to fill missing variables in our dataset using KNN imputation technique.

In [7]:
#This code will fill the missing variables in the dataset
filled_data = fill_values(cleaned_df, 3)
#visualise_missing(filled_data)

# **CELL 7**
The cell below will simply execute the function to perform feature engineering and create the variables which were specified before.

In [8]:
#This code is the feature engineering for tradeopenness and Co2 emission (kt per capita)
full_data = feature_engineering(filled_data)

# **CELL 8**
The cell below will simply execute the function to standardize all the data in the dataset.

In [9]:
#This executes the standardisation functions
standardized_df = standardization(full_data)

# **CELL 9**
The cell below will simply execute the function to perform the unsupervised ML on a specific year using 5 clusters and a specified model: KMeans or GMM.

In [10]:
#This executes the specified unsupervised ML model
KMeans_unsupervised = unsupervised_ML(standardized_df, 2000, 5, "KMeans")

# **CELL 10**
The cell below will simply execute the function to create the 3 classes needed for the dependent variable, in preparation for the supervised ML.

In [11]:
#Label economies: developed, developing and transitioning in preparation for the supervised ML model
ml_data = economy_classification(standardized_df)

# **CELL 11**
The cell below will simply execute the function to perform the supervised ML on our data using a 30% test data split for validation and a specified model: RFC, SVMC or XGBC.

In [61]:
#This executes the specified supervised ML model
model_output, test_country = supervised_ml(ml_data, 0.3, "SVMC", "None")

In [60]:
#This cell will output the prediction of the selected model as against the world bank classificiation for a selected out-of-sample country (if any has been slected)
test_country.head(35)

# **CELL 12**
The cell below will simply execute the function to visualise the variable importance of the specific ML model executed above. The output can also be analysed as a dataframe.

In [33]:
#This will visualise the feature importance of the model fitted above
RFC = feature_importance_graph(model_output, "RFC")

In [29]:
RFC.to_excel("RFC Output.xlsx")