<h1 style="text-align: center;" markdown="1">Machine Learning Algorithms for Digital Wage Payment Prediction</h1> 
<h2 style="text-align: center;" markdown="2">An AIMS Masters project in Collaboration with the Global Centre on Digital Wages for Decent Work (ILO)</h2>


> *The widespread adoption of digital payments has become increasingly important. In light of this, the aim of this project is to investigate the probability of an individual receiving digital wages in Africa. To achieve this, a series of empirical comparative assessments of machine learning classification algorithms will be conducted. The objective is to determine the effectiveness of these algorithms in predicting digital wage payments. Therefore, this notebook forms part of a larger project that seeks to explore the potential of machine learning in addressing issues related to financial inclusion in Africa.*

<h1 style="text-align: center;" markdown="3"> Final Data Preparation</h1> 

# Table of Contents
[Load Raw Data](#load-data)  
[Inspect and Clean Data](#inspect)      
[Create Test/Train Split and Save Data](#save-data)

## Load the Raw Data <a class="anchor" id="load-data"></a>

First, we load a few essential modules used in notebook. We have developed several utility functions in the `load_data.py` file located in the `src/data` directory that will be used throughout this project for convenience.

In [1]:
%matplotlib inline

import os
import sys
import numpy as np
import pandas as pd


from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer
from sklearn.impute import IterativeImputer

from sklearn.model_selection import train_test_split


# Add local functions to the path
sys.path.append(os.path.join(os.pardir, 'src'))
from data import load_data
from features import process_features

## Inspect and Clean Data <a class="anchor" id="inspect"></a>

To begin, we will load the data. The data is in a uniform format, allowing us to develop useful functions that can be applied to all datasets.

For this project, we are utilizing survey data that has been mostly cleaned and formatted into CSV files. However, there are a few outstanding issues that we will address as we create the data loading function.


1. To ensure that the categorical variables are not read as numeric, we convert all variables to their appropriate data types after reading them using `pandas.read_csv()`.

1. Our dataset's target variable is `receive_wages`, and since this is a binary classification task, we only need two categories. However, this variable has three categories. To simplify this, we can merge categories 1 (`received payments into an account`) and 3 (`received payments using other methods`) into a single category, which will represent those who received their wages digitally. This category is coded as 1 thereafter.

1. The `age` variable contains single-year values, which we can group into more meaningful categories based on the [UN's recommended standard international age classifications](https://unstats.un.org/unsd/publication/seriesm/seriesm_74e.pdf). Since the minimum age in the dataset is 15, we group the ages into four categories: 15-24 (youth), 25-44 (young adulthood), 45-64 (middle adulthood), and 65+ (older adulthood).

1. The first column in the dataset is unnamed and represents the row position of each observation in the original uncleaned dataset (`data/raw/GLOBAL/global_data_2021.csv`). We rename it as `id` and set it as the index.

In [2]:
def load_csv_file(filepath):
    
    """ 
    Load data in correct format from CSV file
    
    Parameters:
    -----------
    filepath : a filepath to the file to be loaded
        
    Returns:
    --------
    data: the processed dataframe 
    
    """
    data = pd.read_csv(filepath)
    
    #drop the regionwb column
    data.drop(['regionwb', 'receive_welfare_payments'], axis=1, inplace = True)
    
    #define function to categorize age
    def categorize_age(data):
        data.loc[(data['age'] >= 15) & (data['age'] <= 24), 'age'] = 1
        data.loc[(data['age'] >= 25) & (data['age'] <= 44), 'age'] = 2
        data.loc[(data['age'] >= 45) & (data['age'] <= 64), 'age'] = 3
        data.loc[data['age'] >= 65, 'age'] = 4
        data['age'] = data['age'].astype('category')
        return data
    # call the function and assign the returned value back to data
    data = categorize_age(data)
    
    # convert those who responded to receiving wages through other methods to digital wage receipients
    # and rename the variable from receive_wages to receive_digital_wages
    data.loc[data.receive_wages == 3, 'receive_wages'] = 1
    data.rename( columns={'receive_wages':'receive_digital_wages'}, inplace=True )
    
    # rename the first column to be id (referencing row position in original uncleaned data) and setting it as index
    data.rename( columns={'Unnamed: 0':'id'}, inplace=True )
    data.set_index('id', inplace = True)
    
    numeric_variables = ['pop_scaled_wgt']
    # convert the categorical variables into the category type
    for c in data.columns:
        if c not in numeric_variables:
            data[c] = data[c].astype('category')
    
    #Drop Algeria, Gabon, Mauritius and Morocco due to missing values not at random
    data.dropna(subset = ['urbanicity_f2f'], inplace=True)
    
    data['economy'] = data['economy'].cat.remove_unused_categories()
      
    return data

In [3]:
# def knn_imputation(df, k=3, impute_type='F'):
#     '''' 
#     Performs K-Nearest Neighbors imputation on a pandas DataFrame with missing categorical values. 
#     The function takes in a pandas DataFrame with missing values in categorical columns. 
#     The function uses KNNImputer from scikit-learn to impute the missing values by computing 
#     distances between the samples in the dataset and their nearest neighbors. 
    
#     Parameters: 
#     --------------
#     df: pandas DataFrame with missing categorical values. 
#     k: number of nearest neighbors to use when performing imputation (default=3).
#     impute_age: boolean value indicating whether to impute missing values in the 'age' column (default=False).
#                 - if F, the function will impute missing values in the 'age' column along with the other 
#                     categorical columns. 
#                 - if D, the function will only impute missing values in the 'age' column
#                 - if the value is neither F nor D, an error message will be returned. 
                
#     Returns:
#     -------------
#     df: a pandas DataFrame with missing categorical values imputed using KNNImputer
    
#     '''
    
#     if impute_type not in ['F', 'D']:
#         raise ValueError("Invalid value for impute_type. Expected 'F' or 'D', but got {}".format(impute_type))

#     if impute_type == 'F':
#         categorical_cols = ['fin45', 'age', 'economy']
#     else:
#         categorical_cols = ['age', 'economy']

#     data = pd.get_dummies(df, columns=categorical_cols)

#     imputer = KNNImputer(n_neighbors=k)
#     imputed_data = imputer.fit_transform(data)
#     imputed_data = pd.DataFrame(imputed_data, columns=data.columns)

#     if impute_type == 'F':
#         imputed_data['fin45'] = np.argmax(imputed_data[['fin45_1.0', 'fin45_2.0', 'fin45_3.0', 
#                                                         'fin45_4.0', 'fin45_5.0']].values, axis=1)

#     imputed_data['age'] = np.argmax(imputed_data[['age_1.0', 'age_2.0', 'age_3.0', 'age_4.0']].values, axis=1)

#     if impute_type == 'F':
#         df = imputed_data.drop(['fin45_1.0', 'fin45_2.0', 'fin45_3.0','fin45_4.0', 'fin45_5.0'], axis=1)
#     else:
#         df = imputed_data.copy()

#     df = df.drop(['age_1.0', 'age_2.0', 'age_3.0', 'age_4.0'], axis=1)
    
#     numeric_variables = ['pop_scaled_wgt']
    
#     # convert the categorical variables into the category type
#     for c in df.columns:
#         if c not in numeric_variables:
#             df[c] = df[c].astype('category')
            
#     return df

We'll load the datasets using the function above

In [3]:
# Load full feature data
filepath = load_data.FULL_MERGED
full_merged = load_csv_file(filepath)

from sklearn.preprocessing import LabelEncoder

categorical_cols =['economy']

for col in categorical_cols:
    encoder = LabelEncoder()
    full_merged[col] = encoder.fit_transform(full_merged[col])
    
full_merged.economy.unique()
full_merged["economy"] = full_merged["economy"].astype("category")
full_merged.info()


# full_merged = knn_imputation(full_merged, k=3, impute_type='D')

s = 'The full feature dataset has {:,} rows and {:,} columns'
print(s.format(full_merged.shape[0], full_merged.shape[1]))
s = 'Percentage Receiving Digital Wages: {:0.1%} \tPercentage Not Receiving Digital Wages: {:0.1%}'
print(s.format(full_merged.receive_digital_wages.value_counts(normalize=True).loc[1], 
               full_merged.receive_digital_wages.value_counts(normalize=True).loc[2]))
full_merged.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5806 entries, 9023 to 127848
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   economy                5806 non-null   category
 1   pop_scaled_wgt         5806 non-null   float64 
 2   female                 5806 non-null   category
 3   age                    5797 non-null   category
 4   educ                   5806 non-null   category
 5   inc_q                  5806 non-null   category
 6   emp_in                 5806 non-null   category
 7   urbanicity_f2f         5806 non-null   category
 8   account                5806 non-null   category
 9   fin14_1                5806 non-null   category
 10  fin16                  5806 non-null   category
 11  fin17a                 5806 non-null   category
 12  fin17b                 5806 non-null   category
 13  fin22a                 5806 non-null   category
 14  fin22b                 5806 non-nul

Unnamed: 0_level_0,economy,pop_scaled_wgt,female,age,educ,inc_q,emp_in,urbanicity_f2f,account,fin14_1,...,fin22b,fin24,fin33,receive_digital_wages,pay_utilities,remittances,mobileowner,internetaccess,merchantpay_dig,internet_fin_transc
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
9023,0,4731.949144,2,1.0,2,5,1.0,1.0,1,2,...,2,2,2.0,2,4,1.0,1,1,0.0,1
9026,0,4469.830001,1,1.0,2,3,1.0,2.0,1,2,...,2,1,2.0,1,2,1.0,1,2,0.0,2
9027,0,17001.509861,1,3.0,2,5,1.0,2.0,1,2,...,1,2,1.0,1,1,1.0,1,1,0.0,1
9028,0,9407.418933,2,1.0,2,2,1.0,1.0,1,2,...,2,3,2.0,2,4,1.0,1,1,0.0,1
9034,0,4830.758167,2,3.0,2,5,1.0,1.0,1,2,...,1,2,1.0,1,4,1.0,1,2,0.0,1


In [70]:
g = full_merged[full_merged['receive_digital_wages']==2]
g.inc_q.value_counts()/g.inc_q.value_counts().sum()

5    0.285243
4    0.225365
3    0.206978
2    0.158416
1    0.123998
Name: inc_q, dtype: float64

In [6]:
# Load data with demographic-related features only
filepath = load_data.DEM_MERGED
dem_merged = load_csv_file(filepath)

# dem_merged = knn_imputation(dem_merged, k=3, impute_type='D')

s = 'The demographic-related feature set data has {:,} rows and {:,} columns'
print(s.format(dem_merged.shape[0], dem_merged.shape[1]))
s = 'Percentage Receiving Digital Wages: {:0.1%} \tPercentage Not Receiving Digital Wages: {:0.1%}'
print(s.format(dem_merged.receive_digital_wages.value_counts(normalize=True).loc[1], 
               dem_merged.receive_digital_wages.value_counts(normalize=True).loc[2]))
dem_merged.head()

The demographic-related feature set data has 5,806 rows and 13 columns
Percentage Receiving Digital Wages: 63.5% 	Percentage Not Receiving Digital Wages: 36.5%


Unnamed: 0_level_0,economy,pop_scaled_wgt,female,age,educ,inc_q,emp_in,urbanicity_f2f,account,fin33,receive_digital_wages,mobileowner,internetaccess
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
9023,Benin,4731.949144,2,1.0,2,5,1.0,1.0,1,2.0,2,1,1
9026,Benin,4469.830001,1,1.0,2,3,1.0,2.0,1,2.0,1,1,2
9027,Benin,17001.509861,1,3.0,2,5,1.0,2.0,1,1.0,1,1,1
9028,Benin,9407.418933,2,1.0,2,2,1.0,1.0,1,2.0,2,1,1
9034,Benin,4830.758167,2,3.0,2,5,1.0,1.0,1,1.0,1,1,2


### Convert target variable, `receive_digital_wages`, to a Boolean

The task at hand is a binary classification problem, hence we need to have a single column named `receive_digital_wages`. This column should be assigned a value of `True` if the individual receives their wages digitally, and `False` if they don't. This can be achieved by changing the data type of the existing dummy variable.

In [7]:
full_merged.receive_digital_wages = (full_merged.receive_digital_wages == 1)
dem_merged.receive_digital_wages = (dem_merged.receive_digital_wages == 1)

## Create Train/Test Split and Save Data <a class="anchor" id="save-data"></a>

As a final step, we want to split the data into training and test sets which will be used by all of the algorithms. We'll reserve 25% of the data as a test set. 

Each dataset to be analysed will have two associated csv files: train, test.

We will create two sets of training and test sets for both the full feature set and demographic-related featureset data. In one set, we will have the categorical variables as they are. In the other, we will create dummy variables.

In [8]:
# Split the data into Train and Test Sets
full_merged_train, full_merged_test = train_test_split(full_merged, 
                                       test_size=0.25,
                                       random_state=1500,
                                       stratify=full_merged.receive_digital_wages)

dem_merged_train, dem_merged_test = train_test_split(dem_merged, 
                                       test_size=0.25,
                                       random_state=1500,
                                       stratify=dem_merged.receive_digital_wages)

# Save data to files
TRAIN_PATH, TEST_PATH = load_data.get_data_filepaths('full_merged')
full_merged_train.to_pickle(TRAIN_PATH)
full_merged_test.to_pickle(TEST_PATH)

TRAIN_PATH, TEST_PATH = load_data.get_data_filepaths('dem_merged')
dem_merged_train.to_pickle(TRAIN_PATH)
dem_merged_test.to_pickle(TEST_PATH)


### Dummy variables <a class="anchor" id="dummy-variables"></a>

In the context of classification algorithms, categorical variables may need to be converted to numeric inputs. One popular method for this is creating dummy variables, where a binary column is created for each unique category in the categorical feature. However, it is not necessary to create a dummy variable for every column, as this can lead to multicollinearity issues known as the dummy variable trap. If we have n columns from n categories, every column is actually a linear combination of the other columns, creating the multicollinearity problem. To avoid this, we can drop the first dummy variable for each categorical variable. Pandas provides a useful function called get_dummies() that simplifies this process.

In addition to creating dummy variables, we may also need to remove features that are not useful for classification problems, such as empty or constant columns, as well as duplicate columns.

In [9]:
# #Dropped due to multicolinearity issues (addressed in next section)
# full_merged.drop(['regionwb', 'receive_welfare_payments'], axis=1, inplace = True)
# dem_merged.drop(['regionwb', 'receive_welfare_payments'], axis=1, inplace = True)

In [10]:
# create dummy variables for categoricals
full_merged = pd.get_dummies(full_merged, drop_first=True, dummy_na=True, prefix_sep='__')
dem_merged = pd.get_dummies(dem_merged, drop_first=True, dummy_na=True, prefix_sep='__')

print("Full feature dataset with dummy variables added has shape", full_merged.shape)
print("Demographic-related feature set data with dummy variables added has shape", dem_merged.shape)

Full feature dataset with dummy variables added has shape (5806, 85)
Demographic-related feature set data with dummy variables added has shape (5806, 53)


In [11]:
# remove columns with only one unique value (all nan dummies from columns with no missing values)
full_merged = full_merged.loc[:, full_merged.nunique(axis=0) > 1]
dem_merged = dem_merged.loc[:, dem_merged.nunique(axis=0) > 1]

print("Full feature dataset with constant columns dropped has shape", full_merged.shape)
print("Demographic-related feature set data with constant columns dropped has shape", dem_merged.shape)

Full feature dataset with constant columns dropped has shape (5806, 64)
Demographic-related feature set data with constant columns dropped has shape (5806, 43)


In [12]:
# remove duplicate columns - these end up being all from nan or Not Applicable dummies 
process_features.drop_duplicate_columns(full_merged, ignore=['pop_scaled_wgt'], inplace=True)
process_features.drop_duplicate_columns(dem_merged, ignore=['pop_scaled_wgt'], inplace=True)

print("Full feature dataset with duplicate columns dropped has shape", full_merged.shape)
print("Demographic-related feature set data with duplicate columns dropped has shape", dem_merged.shape)

Full feature dataset with duplicate columns dropped has shape (5806, 64)
Demographic-related feature set data with duplicate columns dropped has shape (5806, 43)


### Multicolinearity <a class="anchor" id="colinearity"></a>

Collinearity is the presence of two or more qualities that combine linearly or have a high degree of correlation. This may affect the computation of model coefficients. Once these features are recognized, they can be gradually eliminated, leaving behind only useful, non-redundant features. One method of determining if a feature is multicollinear is by computing its (VIF). This is accomplished through the 'get_vif' function, which gives a dataframe so we may examine the features. A VIF of 1 in the outcomes denotes the absence of collinearity. A VIF greater than 1 indicates the presence of some variable collinearity. In general, the model may have issues if the VIF is substantially greater than 5 or 10.

In [13]:
# from statsmodels.stats.outliers_influence import variance_inflation_factor

# def get_vif(X, intercept_col='intercept'):
#     if intercept_col is not None and intercept_col in X.columns:
#         X = X.copy().drop(intercept_col, axis=1)
    
#     vi_factors = [variance_inflation_factor(X.values, i)
#                              for i in range(X.shape[1])]
    
#     return pd.Series(vi_factors,
#                      index=X.columns,
#                      name='variance_inflaction_factor')

# def standardize(df, numeric_only=True):
#     if numeric_only is True:
#     # find non-boolean columns
#         cols = df.loc[:, df.dtypes != 'uint8'].columns
#     else:
#         cols = df.columns
#     for field in cols:
#         mean, std = df[field].mean(), df[field].std()
#         # account for constant columns
#         if np.all(df[field] - mean != 0):
#             df.loc[:, field] = (df[field] - mean) / std
    
#     return df

# get_vif(standardize(full_merged))[40:64,]

In [14]:
# get_vif(standardize(dem_merged))

### Create Train/Test Split and Save Data with Dummy Variables<a class="anchor" id="save-data"></a>

As a final step, we want to split the data into training and test sets which will be used by all of the algorithms. We'll reserve 25% of the data as a test set. 

Each dataset to be analysed will have two associated csv files: train, test

In [15]:
# Split the data into Train and Test Sets
full_merged_train, full_merged_test = train_test_split(full_merged, 
                                       test_size=0.25,
                                       random_state=1500,
                                       stratify=full_merged.receive_digital_wages)
dem_merged_train, dem_merged_test = train_test_split(dem_merged, 
                                       test_size=0.25,
                                       random_state=1500,
                                       stratify=dem_merged.receive_digital_wages)

# Save data to files
TRAIN_PATH, TEST_PATH = load_data.get_data_filepaths('full_merged_dumvar')
full_merged_train.to_pickle(TRAIN_PATH)
full_merged_test.to_pickle(TEST_PATH)



TRAIN_PATH, TEST_PATH = load_data.get_data_filepaths('dem_merged_dumvar')
dem_merged_train.to_pickle(TRAIN_PATH)
dem_merged_test.to_pickle(TEST_PATH)

    