<h1 style="text-align: center;" markdown="1">Machine Learning Algorithms for Digital Wage Payment Prediction</h1> 
<h2 style="text-align: center;" markdown="2">An AIMS Masters project in Collaboration with the Global Centre on Digital Wages for Decent Work (ILO)</h2>


> *The widespread adoption of digital payments has become increasingly important. In light of this, the aim of this project is to investigate the probability of an individual receiving digital wages in Africa. To achieve this, a series of empirical comparative assessments of machine learning classification algorithms will be conducted. The objective is to determine the effectiveness of these algorithms in predicting digital wage payments. Therefore, this notebook forms part of a larger project that seeks to explore the potential of machine learning in addressing issues related to financial inclusion in Africa.*

<h1 style="text-align: center;" markdown="3"> Preliminary Data Exploration and Preparation</h1> 

# Table of Contents
[Load the Original Data File](#load-data)   
[Save CSV Files](#save-data)

## Load the Original Data File <a class="anchor" id="load-data"></a>

First, we load a few essential modules used in notebook. We have developed several utility functions in the `load_data.py` file located in the `src/data` directory that will be used throughout this project for convenience.

In [1]:
%matplotlib inline

import os
import sys
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

# Add local functions to the path
sys.path.append(os.path.join(os.pardir, 'src'))
from data import load_data

We first create a function to drop columns with missing values accounting for more than 25% of the data.

In [2]:
def drop_cols_by_na(df, threshold=0.75):
    """
    Drop columns with less than `threshold` non-NA values of total values
    
    Parameters:
    -----------
    df : pd.DataFrame
        The dataframe to process
        
    threshold : float, default 0.75
        The minimum proportion of non-NA values to keep the column
        
    Returns:
    --------
    pd.DataFrame
        The processed dataframe with dropped columns
    """
    num_rows = df.shape[0]
    non_na_counts = df.notna().sum()
    keep_cols = non_na_counts[non_na_counts >= num_rows * threshold].index
    return df[keep_cols]


We create another function to load the data we require. Several data cleaning procedures have also been implemented within this function to ensure that the final output is partially clean and ready for the next step of preparation.

In [3]:
def get_african_data(filepath):
    """
    Load the original data from the Global Findex Database as a CSV file 
    and subset data from the African continent in a correct format 
    
    Parameters:
    -----------
    filepath : filepath to original csv file
    
    Returns:
    --------
    pd.DataFrame
        The processed dataframe 
    """
    #load data
    filepath = load_data.ORG
    global_data_2021 = pd.read_csv(filepath, encoding='latin-1')
     
        
    #subset rows for African countries
    africa_and_middle_east_data_2021 = global_data_2021[global_data_2021['regionwb'].isin(
        ['Middle East & North Africa (excluding high income)', 'Sub-Saharan Africa (excluding high income)'])]
    africa_data_2021 = africa_and_middle_east_data_2021[~africa_and_middle_east_data_2021['economy'].isin(
        ['Iran, Islamic Rep.', 'Jordan', 'Iraq', 'Lebanon', 'West Bank and Gaza' ])]
    
    
    #recaliberate the weights to consider population size differences between the countries
    #get the sum of unscaled weights for each country
    sum_weights_by_country = africa_data_2021.groupby('economy')['wgt'].sum()
    # compute the population-scaled weight for each observation
    africa_data_2021['pop_scaled_wgt'] = africa_data_2021.apply(lambda row: (row['wgt'] * row['pop_adult']) / 
                                                               sum_weights_by_country[row['economy']], axis=1)
    
    
    #reorder columns
    columns = africa_data_2021.columns.to_list()
    new_column_order = columns[0:6] + [columns[len(columns)-1]] + columns[6:len(columns)-1]
    africa_data_2021 = africa_data_2021.reindex(new_column_order, axis='columns')
    
    
    #encode the regions
    #define function to encode the regions
    def encode_regions(data):
        data.loc[(data['regionwb'] == 'Middle East & North Africa (excluding high income)'), 'regionwb'] = 1
        data.loc[(data['regionwb'] == 'Sub-Saharan Africa (excluding high income)'), 'regionwb'] = 2
        return data
    # call the function and assign the returned value back
    africa_data_2021 = encode_regions(africa_data_2021)
    
    
    #Select only individuals who receive wages
    africa_data_2021 = africa_data_2021[(africa_data_2021['fin32']==1)]
    africa_data_2021.drop(['fin32'], axis=1, inplace = True)
    
    #drop columns with less than 75% non-NA values of total values
    africa_data_2021 = drop_cols_by_na(africa_data_2021, threshold=0.75)
    #drop irrelevant columns to the analysis
    africa_data_2021.drop(['economycode', 'wpid_random', 'wgt', 'pop_adult', 'fin45_1', 
                           'anydigpayment', 'fin20', 'receive_agriculture', 'fin45'], axis=1, inplace = True)
    
    
    #drop columns with duplicate information (same info in other columns)
    '''
    --------------------------------------------------------------
    | Remove                          | Retain                    |
    |-------------------------------------------------------------|
    | account_fin, account_mob, fin2  | account                   |
    | saved                           | fin16, fin17a, fin17b     |
    | borrowed                        | fin20, fin22a, fin22b     |
    | fin24a, fin24b                  | fin24                     |
    | fin26, fin28                    | remittances               |
    | fin34a, fin34b, fin34d, fin34e  | receive_wages             |
    | fin37                           | receive_transfers         |
    | fin38                           | receive_pension           |
    | fin42                           | receive_agriculture       |
    | fin44a, fin44b, fin44c, fin44d  | fin45                     |
    | fin30                           | pay_utilities             |
    |_____________________________________________________________|
    
    '''
    # 'fin34d', 'fin34e' already dropped by drop_cols_by_na()
    africa_data_2021.drop(['account_fin', 'account_mob', 'fin2', 'saved', 'borrowed', 'fin24a', 'fin24b', 
                           'fin26', 'fin28', 'fin34a', 'fin34b', 'fin37', 'fin38', 'fin42', 
                           'fin44a', 'fin44b', 'fin44c', 'fin44d', 'fin30'], axis=1, inplace = True)
    
    
    #convert dk and rf to nos
    def replace_values(df):
        replacements = {
            'educ': {4: 1, 5: 1},
#             'fin2': {3: 2, 4: 2},
            'fin14_1': {3: 2, 4: 2},
            'fin14a': {3: 2, 4: 2},
            'fin14a1': {3: 2, 4: 2},
            'fin14b': {3: 2, 4: 2},
            'fin16': {3: 2, 4: 2},
            'fin17a': {3: 2, 4: 2},
            'fin17b': {3: 2, 4: 2},
#             'fin20': {3: 2, 4: 2},
            'fin22a': {3: 2, 4: 2},
            'fin22b': {3: 2, 4: 2},
            'fin24': {8: 7, 9: 7},
#             'fin26': {3: 2, 4: 2},
#             'fin28': {3: 2, 4: 2},
#             'fin32': {3: 2, 4: 2},
            'fin33': {3: 2, 4: 2},
            'fin45': {6: 5},
#             'fin45_1': {5: 3, 4: 3},
            'receive_transfers': {5: 4},
            'receive_pension': {5: 4},
#             'receive_agriculture': {5: 4},
            'pay_utilities': {5: 4},
            'remittances': {6: 5},
            'mobileowner': {3: 2, 4: 2},
            'internetaccess': {3: 2, 4: 2}
        }
        return df.replace(replacements)
    africa_data_2021 = replace_values(africa_data_2021)
    
    # merge receive transfers and receive pension to one feature receive welfare payments
    africa_data_2021["receive_welfare_payments"] = 1
    africa_data_2021.loc[(africa_data_2021['receive_transfers'] == 4) | (africa_data_2021['receive_pension'] == 4), 
                     'receive_welfare_payments'] = 2
    africa_data_2021.drop(['receive_transfers', 'receive_pension'], axis=1, inplace = True)
    
    #merge fin14a, fin14a1 and fin14b into one feature internet_fin_transc (used internet for financial transactions)
    africa_data_2021["internet_fin_transc"] = 2
    africa_data_2021.loc[(africa_data_2021['fin14a'] == 1) | (africa_data_2021['fin14a1'] == 1) | 
                         (africa_data_2021['fin14b'] == 1), 'internet_fin_transc'] = 1
    africa_data_2021.drop(['fin14a', 'fin14a1', 'fin14b'], axis=1, inplace = True)
    
    return africa_data_2021

We apply the above two functions to the downloaded file from the Global Findex database

In [4]:
filepath = load_data.ORG
africa_data_2021 = get_african_data(filepath)

## Save CSV Files <a class="anchor" id="load-data"></a>

We first of all save a csv file with full feature set to the necessary file path.

In [5]:
filepath = load_data.FULL_MERGED
africa_data_2021.to_csv(filepath, encoding='utf-8-sig')

We then save another csv file with only demographic-related features

In [6]:
DEMOGRAPHIC_FEATURES = ['economy', 'regionwb', 'pop_scaled_wgt', 'female', 'age', 'educ', 'inc_q', 'emp_in', 
                        'urbanicity_f2f', 'account', 'fin33', 'receive_wages', 'mobileowner', 
                        'internetaccess', 'receive_welfare_payments']
africa_data_2021 = africa_data_2021[DEMOGRAPHIC_FEATURES]

In [7]:
filepath = load_data.DEM_MERGED
africa_data_2021.to_csv(filepath, encoding='utf-8-sig')