# Financial Inclusion in Africa


Financial inclusion remains one of the main obstacles to economic and human development in Africa. For example, across Kenya, Rwanda, Tanzania, and Uganda only 9.1 million adults (or 14% of adults) have access to or use a commercial bank account.

## Objectives

The objective of this competition is to create a machine learning model to predict which individuals are most likely to have or use a bank account. The models and solutions developed can provide an indication of the state of financial inclusion in Kenya, Rwanda, Tanzania and Uganda, while providing insights into some of the key factors driving individuals’ financial security.

## Data set

We are asked to predict the likelihood of the person having a bank account or not (Yes = 1, No = 0), for each unique id in the test dataset . We will train our model on 70% of the data and test your model on the final 30% of the data, across four East African countries - Kenya, Rwanda, Tanzania, and Uganda.

The main dataset contains demographic information and what financial services are used by approximately 33,600 individuals across East Africa. This data was extracted from various Finscope surveys ranging from 2016 to 2018, and more information about these surveys can be found here:

### Country	
Country interviewee is in.
### Year	
Year survey was done in.
### Uniqueid	
Unique identifier for each interviewee
### Location_type	
Type of location: Rural, Urban
### Cellphone_access	
If interviewee has access to a cellphone: Yes, No
### Household_size	
Number of people living in one house
### Age_of_respondent	
The age of the interviewee
### Gender_of_respondent	
Gender of interviewee: Male, Female
### Relationship_with_head	
The interviewee’s relationship with the head of the house:Head of Household, Spouse, Child, Parent, Other relative, Other non-relatives, Dont know
### Marital_status	
The martial status of the interviewee: Married/Living together, Divorced/Seperated, Widowed, Single/Never Married, Don’t know
### Education_level	
Highest level of education: No formal education, Primary education, Secondary education, Vocational/Specialised training, Tertiary education, Other/Dont know/RTA
### Job_type	
Type of job interviewee has: Farming and Fishing, Self employed, Formally employed Government, Formally employed Private, Informally employed, Remittance Dependent, Government Dependent, Other Income, No Income, Dont Know/Refuse to answer


## Approach

### Importing the dataset

In [262]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as stats


In [263]:
df = pd.read_csv("./train.csv")
# inspecting the dataset 
df.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [264]:
df.nunique()

country                      4
year                         3
uniqueid                  8735
bank_account                 2
location_type                2
cellphone_access             2
household_size              20
age_of_respondent           85
gender_of_respondent         2
relationship_with_head       6
marital_status               5
education_level              6
job_type                    10
dtype: int64

In [265]:
df.dtypes

country                   object
year                       int64
uniqueid                  object
bank_account              object
location_type             object
cellphone_access          object
household_size             int64
age_of_respondent          int64
gender_of_respondent      object
relationship_with_head    object
marital_status            object
education_level           object
job_type                  object
dtype: object

### Checking for Null Data

In [266]:
df.isna().sum()

country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

### Checking for duplicate data

In [267]:
df.duplicated().sum()

0

### Removing unnecessary data

In [268]:
df.drop("uniqueid",axis=1,inplace=True)

### Checking for correlation between variables

In order to use correlation between variables we need to implement one hot encoding on categorical data so that they can be used for analysis too

In [269]:
df1.corr().style.background_gradient("coolwarm")

Unnamed: 0,country_Rwanda,country_Tanzania,country_Uganda,bank_account,location_type_Urban,cellphone_access_Yes,gender_of_respondent_Male,relationship_with_head_Head of Household,relationship_with_head_Other non-relatives,relationship_with_head_Other relative,relationship_with_head_Parent,relationship_with_head_Spouse,marital_status_Dont know,marital_status_Married/Living together,marital_status_Single/Never Married,marital_status_Widowed,education_level_Other/Dont know/RTA,education_level_Primary education,education_level_Secondary education,education_level_Tertiary education,education_level_Vocational/Specialised training,job_type_Farming and Fishing,job_type_Formally employed Government,job_type_Formally employed Private,job_type_Government Dependent,job_type_Informally employed,job_type_No Income,job_type_Other Income,job_type_Remittance Dependent,job_type_Self employed,year,household_size,age_of_respondent
country_Rwanda,1.0,-0.480946,-0.240677,-0.057378,-0.389062,0.154415,-0.00915,-0.033474,0.015186,-0.084774,-0.072224,0.023583,-0.014175,0.212768,-0.203893,0.053055,0.011422,0.015264,-0.017674,-0.10278,-0.100385,0.378012,-0.011552,-0.09809,-0.000619,0.165238,-0.127176,-0.032802,-0.078517,-0.352309,-0.885158,0.235996,0.02269
country_Tanzania,-0.480946,1.0,-0.195978,-0.088345,0.431626,-0.206499,0.019064,0.016359,-0.006832,0.027895,0.148833,-0.00651,-0.011542,-0.447459,0.395795,-0.035267,-0.016799,0.11851,-0.186068,0.144431,-0.028621,-0.343272,-0.080935,0.033391,0.013439,-0.061068,0.119465,-0.121468,-0.010421,0.416131,0.01777,-0.408641,-0.012334
country_Uganda,-0.240677,-0.195978,1.0,-0.0492,-0.069356,-0.032318,-0.044125,-0.036513,0.005047,0.029917,0.066774,0.049845,-0.005776,0.120579,-0.105139,-0.023285,0.011114,-0.003412,0.053143,-0.049169,0.022395,-0.171782,-0.040502,0.00992,-0.022023,-0.174983,0.176754,0.335828,-0.108642,0.176895,0.378473,0.167451,-0.0634
bank_account,-0.057378,-0.088345,-0.0492,1.0,0.087288,0.209669,0.117234,0.114506,-0.009218,-0.020639,-0.051197,-0.060884,0.005791,0.086518,-0.040771,-0.052565,0.019255,-0.173702,0.123702,0.241958,0.232187,-0.037986,0.2359,0.249478,0.018255,-0.098456,-0.057121,0.025663,-0.045701,-0.015978,0.112318,-0.028326,0.019429
location_type_Urban,-0.389062,0.431626,-0.069356,0.087288,1.0,-0.085238,0.012924,0.017202,0.031979,0.028483,0.074399,-0.026213,0.008875,-0.23619,0.268959,-0.052119,0.005291,-0.017513,0.043332,0.084803,0.050682,-0.347103,0.047243,0.067894,-0.002052,-0.071098,0.045064,0.011032,0.054247,0.296278,0.214621,-0.257284,-0.047373
cellphone_access_Yes,0.154415,-0.206499,-0.032318,0.209669,-0.085238,1.0,0.10237,0.055966,0.023908,-0.0302,-0.049435,-0.030304,0.005608,0.15829,-0.065682,-0.12427,-0.007483,-0.009652,0.120163,0.099981,0.099093,0.111969,0.071687,0.105723,-0.058395,-0.008392,-0.103875,-0.006649,-0.083128,-0.056012,-0.066505,0.09136,-0.103611
gender_of_respondent_Male,-0.00915,0.019064,-0.044125,0.117234,0.012924,0.10237,1.0,0.413996,0.001045,0.011998,0.010972,-0.498336,0.017434,0.056201,0.086199,-0.220843,-0.003034,0.01935,0.057692,0.041775,0.025083,-0.001908,0.029416,0.062368,-0.02399,0.052309,-0.073582,-0.011521,-0.120942,0.037074,0.000317,0.014576,0.012745
relationship_with_head_Head of Household,-0.033474,0.016359,-0.036513,0.114506,0.017202,0.055966,0.413996,1.0,-0.098847,-0.18727,-0.240992,-0.678311,0.007577,-0.013705,-0.187566,0.289141,0.002014,-0.012471,-0.115764,-0.002794,0.013639,0.025965,0.037527,0.057979,0.05886,0.01567,-0.110244,-0.012677,-0.152559,0.060701,0.029492,-0.264926,0.420023
relationship_with_head_Other non-relatives,0.015186,-0.006832,0.005047,-0.009218,0.031979,0.023908,0.001045,-0.098847,1.0,-0.015427,-0.019852,-0.055877,-0.001664,-0.071332,0.064718,-0.026595,-0.003483,0.015913,0.004816,0.003635,0.003961,-0.039361,-0.004203,0.083707,-0.009295,0.037692,0.005708,0.007436,-0.000629,-0.03834,-0.013692,0.026966,-0.078885
relationship_with_head_Other relative,-0.084774,0.027895,0.029917,-0.020639,0.028483,-0.0302,0.011998,-0.18727,-0.015427,1.0,-0.037611,-0.105861,0.010727,-0.11212,0.073136,-0.034398,4.1e-05,-0.039163,0.070739,0.030942,0.032696,-0.066464,-0.016074,0.023546,-0.005057,-0.004168,0.0448,0.022416,0.092768,-0.028581,0.081872,0.027031,-0.134945


### Converting Categorical Data to Numerical Results using one hot encoding 

In [270]:
# One Hot Encoding
dummy = pd.get_dummies(df.select_dtypes(include="object"),drop_first=True)
dummy = dummy.astype(int)

In [271]:
df = pd.concat([dummy,df.select_dtypes(exclude=object)],axis=1)
# Returing bank_account to original name 
df =df.rename({"bank_account_Yes":"bank_account"},axis=1)
df.columns

Index(['country_Rwanda', 'country_Tanzania', 'country_Uganda', 'bank_account',
       'location_type_Urban', 'cellphone_access_Yes',
       'gender_of_respondent_Male', 'relationship_with_head_Head of Household',
       'relationship_with_head_Other non-relatives',
       'relationship_with_head_Other relative',
       'relationship_with_head_Parent', 'relationship_with_head_Spouse',
       'marital_status_Dont know', 'marital_status_Married/Living together',
       'marital_status_Single/Never Married', 'marital_status_Widowed',
       'education_level_Other/Dont know/RTA',
       'education_level_Primary education',
       'education_level_Secondary education',
       'education_level_Tertiary education',
       'education_level_Vocational/Specialised training',
       'job_type_Farming and Fishing', 'job_type_Formally employed Government',
       'job_type_Formally employed Private', 'job_type_Government Dependent',
       'job_type_Informally employed', 'job_type_No Income',
      

### Test/Train Split

In [272]:
train_df=df.sample(frac=0.7, random_state=99) #random state is a seed value
test_df=df.drop(train_df.index)

### Starting Analysis by using Linear Regression

#### Define our dependant and independant variable

In [273]:
Y_train = train_df["bank_account"]
X_train = stats.add_constant(train_df.drop(columns=["bank_account"],axis=1))

#### Create our model

In [274]:
model = stats.OLS(Y_train,X_train)
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,bank_account,R-squared:,0.264
Model:,OLS,Adj. R-squared:,0.262
Method:,Least Squares,F-statistic:,189.9
Date:,"Mon, 04 Mar 2024",Prob (F-statistic):,0.0
Time:,10:37:16,Log-Likelihood:,-3440.8
No. Observations:,16467,AIC:,6946.0
Df Residuals:,16435,BIC:,7192.0
Df Model:,31,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-0.0001,8.74e-06,-12.628,0.000,-0.000,-9.33e-05
country_Rwanda,-0.0456,0.006,-7.050,0.000,-0.058,-0.033
country_Tanzania,-0.1315,0.008,-15.519,0.000,-0.148,-0.115
country_Uganda,-0.1596,0.011,-14.672,0.000,-0.181,-0.138
location_type_Urban,0.0417,0.006,7.288,0.000,0.030,0.053
cellphone_access_Yes,0.0739,0.006,12.730,0.000,0.063,0.085
gender_of_respondent_Male,0.0375,0.006,5.998,0.000,0.025,0.050
relationship_with_head_Head of Household,0.0777,0.011,7.339,0.000,0.057,0.098
relationship_with_head_Other non-relatives,-0.0350,0.027,-1.282,0.200,-0.089,0.019

0,1,2,3
Omnibus:,4636.041,Durbin-Watson:,1.994
Prob(Omnibus):,0.0,Jarque-Bera (JB):,11588.333
Skew:,1.555,Prob(JB):,0.0
Kurtosis:,5.687,Cond. No.,3.01e+18


In [275]:
print('The sum of square residuals is {:.1f}'.format(result.ssr))

The sum of square residuals is 1464.3


In [276]:
Y_test = test_df["bank_account"]
X_test = stats.add_constant(test_df.drop(columns=["bank_account"],axis=1))

In [278]:
test_predictions = result.predict(X_test)
test_predictions

4        0.019594
8        0.313907
11       0.444112
13       0.106880
14       0.197063
           ...   
23512   -0.026173
23514    0.055130
23518   -0.105984
23520    0.169835
23521   -0.001175
Length: 7057, dtype: float64

In [284]:
# plt.scatter(test_predictions, Y_test)
# plt.plot([5000, 50000], [5000, 50000], c='k', ls='--')
# plt.xlabel('Predicted ')
# plt.ylabel('Observed')
# plt.show()
test_predictions 
# Y_test

4        0.019594
8        0.313907
11       0.444112
13       0.106880
14       0.197063
           ...   
23512   -0.026173
23514    0.055130
23518   -0.105984
23520    0.169835
23521   -0.001175
Length: 7057, dtype: float64

## Actual Test Data 

In [None]:
df_test = pd.read_csv("test.csv")
df_test.drop("uniqueid",axis=1,inplace=True)

In [None]:
# One Hot Encoding
dummy = pd.get_dummies(df_test.select_dtypes(include="object"),drop_first=True)
dummy = dummy.astype(int)
list1 = [x for x in list(dummy.columns) if "bank" in x]

In [None]:
df2 = pd.concat([dummy,df2.select_dtypes(exclude=object)],axis=1)
# Returing bank_account to original name 
df2 =df2.rename({"bank_account_Yes":"bank_account"},axis=1)
df2 = df2.drop(columns=["country_Tanzania","country_Rwanda"])


In [None]:
df1_test = pd.concat([dummy,df_test.select_dtypes(exclude=object)],axis=1)
df1_test


Unnamed: 0,country_Rwanda,country_Tanzania,country_Uganda,location_type_Urban,cellphone_access_Yes,gender_of_respondent_Male,relationship_with_head_Head of Household,relationship_with_head_Other non-relatives,relationship_with_head_Other relative,relationship_with_head_Parent,...,job_type_Formally employed Private,job_type_Government Dependent,job_type_Informally employed,job_type_No Income,job_type_Other Income,job_type_Remittance Dependent,job_type_Self employed,year,household_size,age_of_respondent
0,0,0,0,1,1,1,1,0,0,0,...,0,0,0,0,0,0,0,2018,3,30
1,0,0,0,1,1,1,1,0,0,0,...,1,0,0,0,0,0,0,2018,7,51
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,2018,3,77
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,2018,6,39
4,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,1,0,2018,3,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10081,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,2018,2,62
10082,0,0,1,1,1,1,1,0,0,0,...,0,0,0,0,0,0,1,2018,8,42
10083,0,0,1,1,1,1,1,0,0,0,...,0,0,0,0,1,0,0,2018,1,39
10084,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,2018,6,28


In [None]:
X_axis_test = stats.add_constant(df1_test)
X_axis_test

Unnamed: 0,const,country_Rwanda,country_Tanzania,country_Uganda,location_type_Urban,cellphone_access_Yes,gender_of_respondent_Male,relationship_with_head_Head of Household,relationship_with_head_Other non-relatives,relationship_with_head_Other relative,...,job_type_Formally employed Private,job_type_Government Dependent,job_type_Informally employed,job_type_No Income,job_type_Other Income,job_type_Remittance Dependent,job_type_Self employed,year,household_size,age_of_respondent
0,1.0,0,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,2018,3,30
1,1.0,0,0,0,1,1,1,1,0,0,...,1,0,0,0,0,0,0,2018,7,51
2,1.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,2018,3,77
3,1.0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,2018,6,39
4,1.0,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,1,0,2018,3,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10081,1.0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,2018,2,62
10082,1.0,0,0,1,1,1,1,1,0,0,...,0,0,0,0,0,0,1,2018,8,42
10083,1.0,0,0,1,1,1,1,1,0,0,...,0,0,0,0,1,0,0,2018,1,39
10084,1.0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,1,2018,6,28


In [None]:
Y_axis_test = model.predict(X_axis_test)
Y_axis_test

ValueError: shapes (16467,33) and (10086,33) not aligned: 33 (dim 1) != 10086 (dim 0)