### Problem Statement

Financial inclusion is a challenge in East Africa, where many individuals remain unbanked. This project aims to predict whether a person has a bank account (Yes = 1, No = 0) using demographic and financial service data from Kenya, Rwanda, Tanzania, and Uganda. A machine learning model will be trained on 70% of the data and tested on the remaining 30% to support targeted financial inclusion strategies.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

In [4]:
df = pd.read_csv('Train.csv')

In [5]:
df.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


### Data Understanding/Profiling
The data provided represent personal informations on each individual in 4 given countries - Kenya, Rwanda, Tanzania and Uganda

In [6]:
df.shape

(23524, 13)

In [7]:
df.columns

Index(['country', 'year', 'uniqueid', 'bank_account', 'location_type',
       'cellphone_access', 'household_size', 'age_of_respondent',
       'gender_of_respondent', 'relationship_with_head', 'marital_status',
       'education_level', 'job_type'],
      dtype='object')

### Data Dictionary

| Column Name               | Description                                                                 | Data Type     |
|---------------------------|-----------------------------------------------------------------------------|---------------|
| `country`                 | Country where the individual resides (Kenya, Rwanda, Tanzania, Uganda)     | *Categorical* |
| `year`                    | Year the data was collected (2016–2018)                                     | *Integer*     |
| `uniqueid`                | Unique identifier for each individual                                       | *String*      |
| `bank_account`            | Whether the individual has a bank account (Yes = 1, No = 0)                 | *Binary* / *Integer* |
| `location_type`           | Type of residence (Urban or Rural)                                          | *Categorical* |
| `cellphone_access`        | Whether the individual has access to a cellphone (Yes/No)                   | *Categorical* |
| `household_size`          | Number of people living in the individual’s household                       | *Integer*     |
| `age_of_respondent`       | Age of the respondent (in years)                                            | *Integer*     |
| `gender_of_respondent`    | Gender of the respondent (Male/Female)                                      | *Categorical* |
| `relationship_with_head`  | Relationship of the respondent to the head of the household                 | *Categorical* |
| `marital_status`          | Marital status of the respondent (e.g., Married, Single, Divorced)          | *Categorical* |
| `education_level`         | Highest level of education attained by the respondent                       | *Categorical* |
| `job_type`                | Type of job or employment the respondent is engaged in                      | *Categorical* |

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 23524 non-null  object
 1   year                    23524 non-null  int64 
 2   uniqueid                23524 non-null  object
 3   bank_account            23524 non-null  object
 4   location_type           23524 non-null  object
 5   cellphone_access        23524 non-null  object
 6   household_size          23524 non-null  int64 
 7   age_of_respondent       23524 non-null  int64 
 8   gender_of_respondent    23524 non-null  object
 9   relationship_with_head  23524 non-null  object
 10  marital_status          23524 non-null  object
 11  education_level         23524 non-null  object
 12  job_type                23524 non-null  object
dtypes: int64(3), object(10)
memory usage: 2.3+ MB


### Data Preparation

In [9]:
df.isnull().sum()

country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

In [10]:
df['bank_account'].value_counts()

bank_account
No     20212
Yes     3312
Name: count, dtype: int64

In [11]:
df['bank_account'] = df['bank_account'].replace({'No' : 0, 'Yes' : 1})

  df['bank_account'] = df['bank_account'].replace({'No' : 0, 'Yes' : 1})


In [12]:
df.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,1,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,0,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,1,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,0,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,0,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [13]:
x_train = df.drop('bank_account', axis=1)
y_train = df.bank_account

In [14]:
x_train

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed
...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,Rural,Yes,4,48,Female,Head of Household,Divorced/Seperated,No formal education,Other Income
23520,Uganda,2018,uniqueid_2114,Rural,Yes,2,27,Female,Head of Household,Single/Never Married,Secondary education,Other Income
23521,Uganda,2018,uniqueid_2115,Rural,Yes,5,27,Female,Parent,Widowed,Primary education,Other Income
23522,Uganda,2018,uniqueid_2116,Urban,Yes,7,30,Female,Parent,Divorced/Seperated,Secondary education,Self employed


In [15]:
y_train

0        1
1        0
2        1
3        0
4        0
        ..
23519    0
23520    0
23521    0
23522    0
23523    0
Name: bank_account, Length: 23524, dtype: int64

In [16]:
x_train = pd.get_dummies(x_train, drop_first=True)

### Model Building

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

In [18]:
model = make_pipeline(StandardScaler(),
        LogisticRegression(max_iter=1000, solver='lbfgs'))

model.fit(x_train, y_train)

### Model Evaluation

In [31]:
test = pd.read_csv('Test.csv')
test.head()

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_6056,Urban,Yes,3,30,Male,Head of Household,Married/Living together,Secondary education,Formally employed Government
1,Kenya,2018,uniqueid_6060,Urban,Yes,7,51,Male,Head of Household,Married/Living together,Vocational/Specialised training,Formally employed Private
2,Kenya,2018,uniqueid_6065,Rural,No,3,77,Female,Parent,Married/Living together,No formal education,Remittance Dependent
3,Kenya,2018,uniqueid_6072,Rural,No,6,39,Female,Head of Household,Married/Living together,Primary education,Remittance Dependent
4,Kenya,2018,uniqueid_6073,Urban,No,3,16,Male,Child,Single/Never Married,Secondary education,Remittance Dependent


In [42]:
ids = test['uniqueid'].copy()

x_test = test.drop(['uniqueid'], axis=1)

x_test = pd.get_dummies(x_test, drop_first=True)

x_test = x_test.reindex(columns=x_train.columns, fill_value=0)

In [43]:
x_test.isnull().sum()

year                             0
household_size                   0
age_of_respondent                0
country_Rwanda                   0
country_Tanzania                 0
                                ..
job_type_Informally employed     0
job_type_No Income               0
job_type_Other Income            0
job_type_Remittance Dependent    0
job_type_Self employed           0
Length: 8766, dtype: int64

In [44]:
predictions = model.predict(x_test)

results = pd.DataFrame({'uniqueid': ids, 'bank_account': predictions})

In [45]:
results

Unnamed: 0,uniqueid,bank_account
0,uniqueid_6056,1
1,uniqueid_6060,1
2,uniqueid_6065,0
3,uniqueid_6072,0
4,uniqueid_6073,0
...,...,...
10081,uniqueid_2998,0
10082,uniqueid_2999,0
10083,uniqueid_3000,0
10084,uniqueid_3001,0


In [47]:
results.to_csv('bank_account_predictions.csv', index=False)