## Financial Inclusion in Africa

Who is more likely to have a bank account?

## 1. Defining the problem statement
The objective is to create a machine learning model to predict
which individuals are most likely to have a bank account

## 2. Collecting data
https://zindi.africa/competitions/financial-inclusion-in-africa

# load the train and test datasets using Pandas

In [1]:
import pandas as pd

train = pd.read_csv('Train_v2.csv')
test = pd.read_csv('Test_v2.csv')

## 3. Exploratory data analysis

Printing first 5 rows of the train dataset.


In [2]:
train.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


## Data dictionary

- country ,country of the individual
- year, year
- uniqueid, id of the individual
- bank_account, Yes=has a bank account, No= doesnt have a bank account
- location_type, location of individual(urban or rural)
- cellphone_access, Yes=has cellphone access, No=no cellphone access
- household_size, family size
- age_of_respondent, age of individual
- gender_of_respondent, male or female
- relationship_with_head, head of household, spouse, child or other relative
- marital_status, single, married, divorced, widowed or dont know
- education_level, level of education attained
- job_type, nature of job

## Total rows and columns

- the train dataset has 23524 rows and 13 columns
- the test dataset has 10086 rows and 12 columns(does not contain the bank_account column)

In [3]:
test.head()

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_6056,Urban,Yes,3,30,Male,Head of Household,Married/Living together,Secondary education,Formally employed Government
1,Kenya,2018,uniqueid_6060,Urban,Yes,7,51,Male,Head of Household,Married/Living together,Vocational/Specialised training,Formally employed Private
2,Kenya,2018,uniqueid_6065,Rural,No,3,77,Female,Parent,Married/Living together,No formal education,Remittance Dependent
3,Kenya,2018,uniqueid_6072,Rural,No,6,39,Female,Head of Household,Married/Living together,Primary education,Remittance Dependent
4,Kenya,2018,uniqueid_6073,Urban,No,3,16,Male,Child,Single/Never Married,Secondary education,Remittance Dependent


In [4]:
train.shape

(23524, 13)

In [5]:
test.shape

(10086, 12)

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
country                   23524 non-null object
year                      23524 non-null int64
uniqueid                  23524 non-null object
bank_account              23524 non-null object
location_type             23524 non-null object
cellphone_access          23524 non-null object
household_size            23524 non-null int64
age_of_respondent         23524 non-null int64
gender_of_respondent      23524 non-null object
relationship_with_head    23524 non-null object
marital_status            23524 non-null object
education_level           23524 non-null object
job_type                  23524 non-null object
dtypes: int64(3), object(10)
memory usage: 2.3+ MB


In [7]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10086 entries, 0 to 10085
Data columns (total 12 columns):
country                   10086 non-null object
year                      10086 non-null int64
uniqueid                  10086 non-null object
location_type             10086 non-null object
cellphone_access          10086 non-null object
household_size            10086 non-null int64
age_of_respondent         10086 non-null int64
gender_of_respondent      10086 non-null object
relationship_with_head    10086 non-null object
marital_status            10086 non-null object
education_level           10086 non-null object
job_type                  10086 non-null object
dtypes: int64(3), object(9)
memory usage: 945.6+ KB


- we can see that the dataset has a lot of categorical variables,
- machine learning algorithms use numerical variables, so we'll convert the categorical variables to numerical

In [18]:
train1 = train.iloc[:,[0,1,4,5,8,9,10,11,12]]
columns = list(train1.columns)

we pick out the columns with categorical variables then use Label encoder to change it to numerical variables

In [9]:
# import the label encoder
from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()

In [12]:
def Label(feature):
    train[feature]=le.fit_transform(train[feature])
    test[feature]=le.transform(test[feature])
    return train[feature]

The Label method encodes the categorical variables to numerical values

In [19]:
# call the method
for item in columns:
    Label(item)

In [20]:
train.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,0,2,uniqueid_1,1,0,1,3,24,0,5,2,3,9
1,0,2,uniqueid_2,0,0,0,5,70,0,1,4,0,4
2,0,2,uniqueid_3,1,1,1,5,26,1,3,3,5,9
3,0,2,uniqueid_4,0,0,1,5,34,0,1,2,2,3
4,0,2,uniqueid_5,0,1,0,8,26,1,0,3,2,5


In [21]:
test.head()

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,0,2,uniqueid_6056,1,1,3,30,1,1,2,3,2
1,0,2,uniqueid_6060,1,1,7,51,1,1,2,5,3
2,0,2,uniqueid_6065,0,0,3,77,0,4,2,0,8
3,0,2,uniqueid_6072,0,0,6,39,0,1,2,2,8
4,0,2,uniqueid_6073,1,0,3,16,1,0,3,3,8


Now, lets convert the age_of_respondent into a categorical variable then a numerical variable.
Its like we have different age brackets 
- (16-37)-youth
- (37-58)-adult
- (58-79)-elder
- (79-100)-elder statesman

In [22]:
train['age_of_respondent'].min()

16

In [23]:
train['age_of_respondent'].max()

100

the youngest person is 16 years and the oldest is 100 years

In [24]:
train_test_data = [train,test]

for dataset in train_test_data:
    dataset.loc[(dataset['age_of_respondent'] >= 16) & (dataset['age_of_respondent'] <= 37), 'age_of_respondent'] = 0,
    dataset.loc[(dataset['age_of_respondent'] > 37) & (dataset['age_of_respondent'] <= 58), 'age_of_respondent'] = 1,
    dataset.loc[(dataset['age_of_respondent'] > 58) & (dataset['age_of_respondent'] <= 79), 'age_of_respondent'] = 2,
    dataset.loc[(dataset['age_of_respondent'] > 79) & (dataset['age_of_respondent'] <= 100), 'age_of_respondent'] = 3,

In [25]:
train.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,0,2,uniqueid_1,1,0,1,3,0,0,5,2,3,9
1,0,2,uniqueid_2,0,0,0,5,2,0,1,4,0,4
2,0,2,uniqueid_3,1,1,1,5,0,1,3,3,5,9
3,0,2,uniqueid_4,0,0,1,5,0,0,1,2,2,3
4,0,2,uniqueid_5,0,1,0,8,0,1,0,3,2,5


In [26]:
test.head()

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,0,2,uniqueid_6056,1,1,3,0,1,1,2,3,2
1,0,2,uniqueid_6060,1,1,7,1,1,1,2,5,3
2,0,2,uniqueid_6065,0,0,3,2,0,4,2,0,8
3,0,2,uniqueid_6072,0,0,6,1,0,1,2,2,8
4,0,2,uniqueid_6073,1,0,3,0,1,0,3,3,8


## 4.who has a bank account?
Our target is bank_account, while others are determine if the person has the bank account

In [27]:
target = train['bank_account']

In [29]:
features = train.drop('bank_account', axis=1)

The 'uniqueid' column is just a means of identification,it does not affect the possibility of the individual to have a bank account

In [30]:
X = features.drop('uniqueid', axis=1) 

## 5.Modeling

In [32]:
#import classifiers
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

# Cross Validation

In [33]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
k_fold = KFold(n_splits=10, shuffle=True, random_state=0)

In [34]:
pipeline_dt = Pipeline([('clf',DecisionTreeClassifier())])
pipeline_rf = Pipeline([('clf',RandomForestClassifier(n_estimators=13))])
pipeline_nb = Pipeline([('clf',GaussianNB())])
pipeline_svm = Pipeline([('clf',SVC())])

In [36]:
score_dt = cross_val_score(pipeline_dt, X,target,cv=k_fold, n_jobs=-1,scoring = 'accuracy')
score_rf = cross_val_score(pipeline_rf, X,target,cv=k_fold, n_jobs=-1,scoring = 'accuracy')
score_nb = cross_val_score(pipeline_nb, X,target,cv=k_fold, n_jobs=-1,scoring = 'accuracy')
score_svm = cross_val_score(pipeline_svm, X,target,cv=k_fold, n_jobs=-1,scoring = 'accuracy')

In [37]:
import numpy as np

print('Decision tree:',np.mean(score_dt)*100)
print('Random Forest:',np.mean(score_rf)*100)
print('naive bayes:',np.mean(score_nb)*100)
print('Support vector:',np.mean(score_svm)*100)

Decision tree: 85.7209785741751
Random Forest: 86.6859357427629
naive bayes: 83.30636674559327
Support vector: 88.27154002272393


Support vector machines has the best accuracy so we'll go with it

In [38]:
clf = SVC()
clf.fit(X,target)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [40]:
test_data = test.drop('uniqueid', axis=1)
prediction = clf.predict(test_data)

In [41]:
prediction

array([1, 1, 0, ..., 0, 0, 0])

Lets convert the 1, 0 back to yes and no

In [42]:
predict=['Yes' if e==1 else 'No' for e in prediction]

## Prediction data

In [43]:
test = pd.read_csv('Test_v2.csv')

In [44]:
submission = pd.DataFrame({
    'uniqueid':test['uniqueid'],
    'country':test['country'],
    'bank_account':predict
})

In [45]:
submission.to_csv('submit.csv', index=False)

In [46]:
submit = pd.read_csv('submit.csv')
submit.head()

Unnamed: 0,uniqueid,country,bank_account
0,uniqueid_6056,Kenya,Yes
1,uniqueid_6060,Kenya,Yes
2,uniqueid_6065,Kenya,No
3,uniqueid_6072,Kenya,No
4,uniqueid_6073,Kenya,No
