# **Financial Inclusion Prediction in Africa**

Here, I worked on the 'Financial Inclusion in Africa' dataset that was provided as part of the Financial Inclusion in Africa Challenge hosted by the Zindi platform.

The term financial inclusion here, means that individuals and businesses have access to useful and affordable financial products and services that meet their needs_transactions, payments, savings, credit and insurance_and are delivered in a responsible and sustainable way.

The dataset contains demographic information and what financial services are used by approximately 33,600 individuals across East Africa. 

The Machine Learning model(s) that will be built will be to predict which individuals are most likely to have or use a bank account.

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [3]:
# Loading the dataset
df = pd.read_csv("Financial_inclusion_dataset.csv")
df.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [4]:
# Basic Data Exploration
print(df.shape)

print("\n", df.info())

(23524, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 23524 non-null  object
 1   year                    23524 non-null  int64 
 2   uniqueid                23524 non-null  object
 3   bank_account            23524 non-null  object
 4   location_type           23524 non-null  object
 5   cellphone_access        23524 non-null  object
 6   household_size          23524 non-null  int64 
 7   age_of_respondent       23524 non-null  int64 
 8   gender_of_respondent    23524 non-null  object
 9   relationship_with_head  23524 non-null  object
 10  marital_status          23524 non-null  object
 11  education_level         23524 non-null  object
 12  job_type                23524 non-null  object
dtypes: int64(3), object(10)
memory usage: 2.3+ MB

 None


In [5]:
df.describe()

Unnamed: 0,year,household_size,age_of_respondent
count,23524.0,23524.0,23524.0
mean,2016.975939,3.797483,38.80522
std,0.847371,2.227613,16.520569
min,2016.0,1.0,16.0
25%,2016.0,2.0,26.0
50%,2017.0,3.0,35.0
75%,2018.0,5.0,49.0
max,2018.0,21.0,100.0


In [6]:
df.isnull().sum()

country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

In [7]:
df.duplicated().sum()

0

**There are no null or duplicated values in the dataset.**

In [8]:
# Encoding binary categorical variables
df['bank_account'] = df['bank_account'].map({'Yes': 1, 'No': 0})
df['cellphone_access'] = df['cellphone_access'].map({'Yes': 1, 'No': 0})
df['gender_of_respondent'] = df['gender_of_respondent'].map({'Female': 0, 'Male': 1})
df['location_type'] = df['location_type'].map({'Rural': 0, 'Urban': 1})

In [9]:
# One-Hot Encoding for multi-category variables
df = pd.get_dummies(df, columns=['job_type'], drop_first=True)


# Feature engineering
df['has_income'] = df[['job_type_Farming and Fishing', 'job_type_Formally employed Government', 
                       'job_type_Formally employed Private', 'job_type_Informally employed', 
                       'job_type_Other Income', 'job_type_Self employed']].sum(axis=1)

df['is_married'] = df['marital_status'].apply(lambda x: 1 if x == 'Married/Living together' else 0)
df['is_single'] = df['marital_status'].apply(lambda x: 1 if x == 'Single/Never Married' else 0)


# Binary encoding of education levels
edu_levels = ['Primary education', 'No formal education', 'Secondary education', 
              'Tertiary education', 'Vocational/Specialised training', 'Other/Dont know/RTA']

for level in edu_levels:
    col_name = level.split()[0].lower() + "_education"
    df[col_name] = df['education_level'].apply(lambda x: 1 if x == level else 0)


In [10]:
# Removing unused columns
df.drop(['uniqueid', 'country', 'year', 'relationship_with_head', 
         'marital_status', 'education_level'], axis=1, inplace=True)

In [11]:
# Feature & Target splitting
X = df.drop('bank_account', axis=1)
y = df['bank_account']

In [12]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [13]:
# Training Logistic Regression with L1 (Lasso)
model = LogisticRegression(penalty='l1', C=1.0, solver='liblinear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.8845908607863975
[[3970   93]
 [ 450  192]]
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      4063
           1       0.67      0.30      0.41       642

    accuracy                           0.88      4705
   macro avg       0.79      0.64      0.68      4705
weighted avg       0.87      0.88      0.86      4705



In [14]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
print(confusion_matrix(y_test, rf_pred))
print(classification_report(y_test, rf_pred))

Random Forest Accuracy: 0.8609989373007438
[[3801  262]
 [ 392  250]]
              precision    recall  f1-score   support

           0       0.91      0.94      0.92      4063
           1       0.49      0.39      0.43       642

    accuracy                           0.86      4705
   macro avg       0.70      0.66      0.68      4705
weighted avg       0.85      0.86      0.85      4705



---

For my Streamlit deployment, will be using the **Random Forest Classifier** model.

This is because, although it is slightly lower in accuracy than the Logistic Regression model;

- It performs better on the minority class (bank account holders) which is my target.

- It has a more balanced recall and precision, meaning it is a more inclusive predictor.

---

In [15]:
import joblib

# Save with compression
joblib.dump(rf_model, 'financial_inclusion_model.pkl', compress=3)

['financial_inclusion_model.pkl']