## Financial Inclusion In Africa 

The objective of this competition is to create a machine learning model to predict which individuals are most likely to have or use a bank account. The models and solutions developed can provide an indication of the state of financial inclusion in Kenya, Rwanda, Tanzania and Uganda, while providing insights into some of the key factors driving individuals’ financial security.



In [2]:
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# data preprocessing 
from sklearn.preprocessing import LabelEncoder


In [3]:
# import our training and testing data 
df_train = pd.read_csv('Train.csv')
df_test = pd.read_csv('Test.csv')

df_train.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [4]:
df_train.shape

(23524, 13)

In [5]:
df_test.shape

(10086, 12)

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 23524 non-null  object
 1   year                    23524 non-null  int64 
 2   uniqueid                23524 non-null  object
 3   bank_account            23524 non-null  object
 4   location_type           23524 non-null  object
 5   cellphone_access        23524 non-null  object
 6   household_size          23524 non-null  int64 
 7   age_of_respondent       23524 non-null  int64 
 8   gender_of_respondent    23524 non-null  object
 9   relationship_with_head  23524 non-null  object
 10  marital_status          23524 non-null  object
 11  education_level         23524 non-null  object
 12  job_type                23524 non-null  object
dtypes: int64(3), object(10)
memory usage: 2.3+ MB


In [12]:
df_train.describe()

Unnamed: 0,year,household_size,age_of_respondent
count,23524.0,23524.0,23524.0
mean,2016.975939,3.797483,38.80522
std,0.847371,2.227613,16.520569
min,2016.0,1.0,16.0
25%,2016.0,2.0,26.0
50%,2017.0,3.0,35.0
75%,2018.0,5.0,49.0
max,2018.0,21.0,100.0


In [11]:
exclude = ['uniqued']

for cols in df_train.select_dtypes(include='object').columns:
    if cols not in exclude:
        print(df_train[cols].value_counts())
        print('\n')

country
Rwanda      8735
Tanzania    6620
Kenya       6068
Uganda      2101
Name: count, dtype: int64


uniqueid
uniqueid_1424    4
uniqueid_1423    4
uniqueid_1422    4
uniqueid_1421    4
uniqueid_1420    4
                ..
uniqueid_7264    1
uniqueid_7265    1
uniqueid_7266    1
uniqueid_7267    1
uniqueid_7275    1
Name: count, Length: 8735, dtype: int64


bank_account
No     20212
Yes     3312
Name: count, dtype: int64


location_type
Rural    14343
Urban     9181
Name: count, dtype: int64


cellphone_access
Yes    17454
No      6070
Name: count, dtype: int64


gender_of_respondent
Female    13877
Male       9647
Name: count, dtype: int64


relationship_with_head
Head of Household      12831
Spouse                  6520
Child                   2229
Parent                  1086
Other relative           668
Other non-relatives      190
Name: count, dtype: int64


marital_status
Married/Living together    10749
Single/Never Married        7983
Widowed                     2708
Divorc

In [7]:

# check for missing values
df_train.isnull().sum()
# the results show that we dont have missing values

country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

In [8]:
df_test.isnull().sum()


country                   0
year                      0
uniqueid                  0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

## Encoding

Since we have a number of categorical columns and our machine learning will operate with only numerical value swe do encoding 

In [None]:
df_train.columns
# from the list below this is how I decide to encode my columns 

#  - one hot encoding: 
            # 1. country 
            # 2. Relationship_with_head 
            # 3. marital_status 
            # 4. job_type

#  - label encoding: 
            # 1.location_type 
            # 2. cellphone_access 
            # 3. gender_of_respondent
            # 4. education_level 

#  Ordinal encoding:
            # 1. education_level 
            

Index(['country', 'year', 'uniqueid', 'bank_account', 'location_type',
       'cellphone_access', 'household_size', 'age_of_respondent',
       'gender_of_respondent', 'relationship_with_head', 'marital_status',
       'education_level', 'job_type'],
      dtype='object')

In [None]:
# we 1st create a copy of our data to avoid changing the original data


# here I do some one hot encoding 
df_encoded = df_train.copy()
one_hot_cols = ['country', 'relationship_with_head', 'marital_status', 'job_type']
df_encoded = pd.get_dummies(df_encoded , columns=one_hot_cols, drop_first=True)

# Label encoding
le = LabelEncoder()
label_cols = ['location_type', 'cellphone_access' , 'gender_of_respondent' ,'education_level', 'bank_account' ]
for cols in label_cols:
    df_encoded[cols] = le.fit_transform(df_encoded[cols])


In [22]:
df_encoded.head(20)

Unnamed: 0,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,education_level,country_Rwanda,...,marital_status_Widowed,job_type_Farming and Fishing,job_type_Formally employed Government,job_type_Formally employed Private,job_type_Government Dependent,job_type_Informally employed,job_type_No Income,job_type_Other Income,job_type_Remittance Dependent,job_type_Self employed
0,2018,uniqueid_1,Yes,0,1,3,24,0,3,False,...,False,False,False,False,False,False,False,False,False,True
1,2018,uniqueid_2,No,0,0,5,70,0,0,False,...,True,False,False,False,True,False,False,False,False,False
2,2018,uniqueid_3,Yes,1,1,5,26,1,5,False,...,False,False,False,False,False,False,False,False,False,True
3,2018,uniqueid_4,No,0,1,5,34,0,2,False,...,False,False,False,True,False,False,False,False,False,False
4,2018,uniqueid_5,No,1,0,8,26,1,2,False,...,False,False,False,False,False,True,False,False,False,False
5,2018,uniqueid_6,No,0,0,7,26,0,2,False,...,False,False,False,False,False,True,False,False,False,False
6,2018,uniqueid_7,No,0,1,7,32,0,2,False,...,False,False,False,False,False,False,False,False,False,True
7,2018,uniqueid_8,No,0,1,1,42,0,4,False,...,False,False,True,False,False,False,False,False,False,False
8,2018,uniqueid_9,Yes,0,1,3,54,1,3,False,...,False,True,False,False,False,False,False,False,False,False
9,2018,uniqueid_10,No,1,1,3,76,0,0,False,...,False,False,False,False,False,False,False,False,True,False


In [23]:
df_encoded.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 30 columns):
 #   Column                                      Non-Null Count  Dtype 
---  ------                                      --------------  ----- 
 0   year                                        23524 non-null  int64 
 1   uniqueid                                    23524 non-null  object
 2   bank_account                                23524 non-null  object
 3   location_type                               23524 non-null  int64 
 4   cellphone_access                            23524 non-null  int64 
 5   household_size                              23524 non-null  int64 
 6   age_of_respondent                           23524 non-null  int64 
 7   gender_of_respondent                        23524 non-null  int64 
 8   education_level                             23524 non-null  int64 
 9   country_Rwanda                              23524 non-null  bool  
 10  country_Tanzania      