# Financial Inclusion in Africa

##### This project predicts the likelihood of a person having a bank account or not (Yes = 1, No = 0), for each unique id in the test dataset.
##### The model was trained  on 70% of the data and tested on the final 30% of the data, across four East African countries - Kenya, Rwanda, Tanzania, and Uganda.

##### This project involves:
##### 1. Importing the libaries needed and loading the data.
##### 2. Performing Explorative Data Analysis.
##### 3. Data Preprocessing and Data Wrangling.
##### 4. Creating a Model for the Prediction

# 1. Importing Libraries

In [1]:
# Importing necessary libraries for data handling, and visualization.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Importing libraries for machine learning 
#from lightgbm import LGBMClassifier
#from sklearn.model_selection import train_test_split
#from sklearn.metrics import accuracy_score

# Suppress warnings for better readability.
#import warnings
#warnings.filterwarnings('ignore')


#print("Libraries imported sucessfully.")

#  Loading the Dataset

In [2]:
# Load Datasets
train = pd.read_csv(r"C:\Users\LORDINA\Downloads\Train.csv")
test = pd.read_csv(r"C:\Users\LORDINA\Downloads\Test.csv")
ss = pd.read_csv(r"C:\Users\LORDINA\Downloads\SampleSubmission.csv")
variables = pd.read_csv(r"C:\Users\LORDINA\Downloads\VariableDefinitions.csv")



# 2. Performing Explorative Data Analysis

In [3]:
# Display first few rows to understand structure.
print("Training Data Preview:")
display(train.head(5))

print("Variables Definition Preview: ")
display(variables.head(5))

Training Data Preview:


Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


Variables Definition Preview: 


Unnamed: 0,Variable Definitions,Unnamed: 1
0,country,Country interviewee is in.
1,year,Year survey was done in.
2,uniqueid,Unique identifier for each interviewee
3,location_type,"Type of location: Rural, Urban"
4,cellphone_access,"If interviewee has access to a cellphone: Yes, No"


In [4]:
train.columns

Index(['country', 'year', 'uniqueid', 'bank_account', 'location_type',
       'cellphone_access', 'household_size', 'age_of_respondent',
       'gender_of_respondent', 'relationship_with_head', 'marital_status',
       'education_level', 'job_type'],
      dtype='object')

In [None]:
# Checking Dataset Shapes
print(f"Train dataset: {train.shape[0]} rows, {train.shape[1]} columns")
print(f"Test dataset: {test.shape[0]} rows, {test.shape[1]} columns")

##### The above output shows the number of rows and columns for train and test dataset. We have 13 variables in the train dataset, 12 independent variables and 1 dependent variable. In the test dataset, we have 12 independent variables.


In [None]:
# Checking for missing values.
print("missing values:", train.isnull().sum())

##### We do not have any missing values in our dataset.

In [None]:
# Showing some information about our dataset.
print(train.info())

##### The output shows the list of variables/features, sizes, if it contains missing values and data type for each variable. From the dataset, we don't have any missing values and we have 3 features of integer data type and 10 features of the object data type.

### Univariate Analysis
### Here we are analyzing data by examining each variable individually.

In [None]:
# Exploring Target Distribution
sns.catplot(x="bank_account", hue="bank_account", kind="count", data=train)

##### The data shows that we have a higher number of no class than yes class in our target variable showing a majority of people don't have bank accounts.

In [None]:
# Exploring Country Distribution
sns.catplot(x="country", hue="country", kind="count", data=train, palette="tab10")


#### The country feature in the above graph shows that most of the data were collected in Rwanda and lesser data were collected in Uganda.

In [None]:
num_cols = ['household_size', 'age_of_respondent']
train[num_cols].hist(bins=25, figsize=(12,6)); plt.tight_layout()

#### Household_size is not normally distributed and the most common number of people living in a house is 2.
#### In our last variable called age_of_respondent, most of the respondent’s age is between 25 and 35.

## Bivariate Analysis

#### Here we are exploring the relationship between our target variable and the independent variables and assess the relationship between them.

In [None]:
# Exploring location type with bank account.

plt.figure(figsize=(16, 6))
sns.countplot(x='location_type', hue= 'bank_account', data=train)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)

##### From the above plot, you can realize that the majority of people living in rural areas don't have bank accounts. 

In [None]:
#Exploring gender_of_respondent with bank account 
plt.figure(figsize=(16, 6))
sns.countplot(x='gender_of_respondent', hue= 'bank_account', data=train)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)

#### In the above plot, the plot shows there is a small difference between males and females who have bank accounts, however the number of males with bank accounts are greater than females.

In [None]:
# Exploring cellphone_access with bank account.
plt.figure(figsize=(16, 6))
sns.countplot(x='cellphone_access', hue= 'bank_account', data=train)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)

#### The above plot shows the majority of people who have cellphone access, don't have bank accounts. 

In [None]:
# Exploring education_level with bank account 
plt.figure(figsize=(16, 6))
sns.countplot(x='education_level', hue= 'bank_account', data=train)
plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='x-large'  
)

#### The education_level plot shows that the majority of people have primary education and most of them don't have bank accounts.

In [None]:
# Exploring job_type with bank account 
plt.figure(figsize=(16, 6))
sns.countplot(x='job_type', hue= 'bank_account', data=train)
plt.xticks(
    rotation=45, 
    horizontalalignment='right',
    fontweight='light',
    fontsize='x-large'  
)

#### The job_type plot shows that the majority of people who are self-employed don't have access to the bank accounts, followed by informally employed and farming and fishing.