# Zindi Project - Financial Inclusion in Africa

Financial inclusion remains one of the main obstacles to economic and human development in Africa. For example, across Kenya, Rwanda, Tanzania, and Uganda only 9.1 million adults (or 14% of adults) have access to or use a commercial bank account.

Traditionally, access to bank accounts has been regarded as an indicator of financial inclusion. Despite the proliferation of mobile money in Africa, and the growth of innovative fintech solutions, banks still play a pivotal role in facilitating access to financial services. Access to bank accounts enable households to save and make payments while also helping businesses build up their credit-worthiness and improve their access to loans, insurance, and related services. Therefore, access to bank accounts is an essential contributor to long-term economic growth.

__The objective of this project is to create a machine learning model to predict which individuals are most likely to have or use a bank account.__ The models and solutions developed can provide an indication of the state of financial inclusion in Kenya, Rwanda, Tanzania and Uganda, while providing insights into some of the key factors driving individuals’ financial security.

# Our Goal

Our Goal is to predict values for our NaNs in our target column bank_account. 

## Data Overview

| column | additional information |
|--------|------------------------|
| country | Country interviewee is in |
| year | Year survey was done in  |
| uniqueid | Unique identifier for each interviewee | 
| location_type | Type of location: Rural, Urban |
| cellphone_access | If interviewee has access to a cellphone: Yes, No |
| household_size | Number of people living in one house |
| age_of_respondent | The age of the interviewee |
| gender_of_respondent | Gender of interviewee: Male, Female | 
| relationship_with_head | The interviewee’s relationship with the head of the house:Head of Household, Spouse, Child, Parent, Other relative, Other non-relatives, Dont know |
| marital_status | The martial status of the interviewee: Married/Living together, Divorced/Seperated, Widowed, Single/Never Married, Don’t know |
| education_level | Highest level of education: No formal education, Primary education, Secondary education, Vocational/Specialised training, Tertiary education, Other/Dont know/RTA |
| job_type | Type of job interviewee has: Farming and Fishing, Self employed, Formally employed Government, Formally employed Private, Informally employed, Remittance Dependent, Government Dependent, Other Income, No Income, Dont Know/Refuse to answer |

Importing Modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy.stats import chi2_contingency
from matplotlib.ticker import PercentFormatter

Importing the data

In [None]:
# Import the data into a dataframe
test = pd.read_csv('data/Test.csv')
train = pd.read_csv('data/train.csv')

# Make a new Dataframe with all the data
df = pd.concat([test, train])

Exporting Dataframe into a csv

In [None]:
# Export the dataframe into a csv:

# Define the path to the folder in your repository
folder_path = 'data/'

# Define the file name and extension
file_name = 'data.csv'

# Concatenate the folder path and file name
file_path = f'{folder_path}/{file_name}'

# Export the DataFrame to the specified folder
df.to_csv(file_path, index=False)


## EDA: Exploring the data

In this part of the notebook we look and analyze our financial inclusion data we got from Zindi.

In [None]:
# Print the shape of the data
print('Financial Inclusion dataset')
print('==================')
print('# observations: {}'.format(df.shape[0]))
print('# features:     {}'.format(df.shape[1]-1))

In [None]:
# Display first 5 rows
df.head()

In [None]:
# Print a concise summary of a DataFrame
df.info()

In [None]:
# Generate descriptive statistics
df.describe()

In [None]:
# The column labels of the DataFrame.
df.columns

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# check for unique values in column bank_account
df['bank_account'].unique()

In [None]:
# Checking for data imbalance
df['bank_account'].value_counts()

We see that we have NaNs in the column bank account. Our goal is to fill this Nan values with values that our model (hopefully) predicts right.

What we need to do is:
* Create a data frame without NaNs. This will be the data we than will split into train and test data.
* Create a data frame with all the NaN values. This will be the data we than will have our model predict with.

In [None]:
# Create a new data frame without the NaN in our target feature
df_wo_target_nan = df.dropna()

# set new index for our dataframe without the NaNs
df_wo_target_nan = df_wo_target_nan.reset_index(drop=True)
df_wo_target_nan.isnull().value_counts()
df_wo_target_nan.head()

In [None]:
# create a new data frame with only the NaN in our target feature
df_with_target_nan = df.where(df['bank_account'].isnull())
df_with_target_nan.head()

# Set a new index for our dataframe with the NaNs
df_with_target_nan = df_with_target_nan.reset_index(drop=True)
df_with_target_nan.head()

At this point we now focus on our dataframe without the NaNs. 
Quick look at the dataframe:

In [None]:
df_wo_target_nan.info()

In [None]:
df_wo_target_nan.describe()

In [None]:
df_wo_target_nan.head().T

In [None]:
# Show overview of all the unique values of the dataframe:
for column in df_wo_target_nan.columns:
    unique_values = df_wo_target_nan[column].unique()
    print(f"Column '{column}' has {len(unique_values)} unique value(s):")
    print(unique_values)
    print()

In [None]:
# Show overview of all the unique values of the dataframe:
for column in df_wo_target_nan.columns:
    unique_values = df_wo_target_nan[column].unique()
    print(f"Column '{column}' has {len(unique_values)} unique value(s):")
    print(unique_values)
    print()

In [None]:
# Checking for duplicate values
print(f"duplicate values in columns")

display(df_wo_target_nan.duplicated().value_counts())

print('No duplicates found.')
print("______"*30)

In [None]:
# Plot distribution of features 
'''
features = df_wo_target_nan.columns.tolist()
features.remove('bank_account')

fig,ax = plt.subplots(4,3,figsize=(34,30))
count = 0
for item in features:
    sns.histplot(df_wo_target_nan[item], kde=True, ax=ax[int(count/3)][count%3], color='#33658A').set(title=item, xlabel='')
    count += 1
ax.flat[-1].set_visible(False)
fig.tight_layout(pad=3)
'''

In [None]:
# Plotting correlation between numeric columns

import plotly.graph_objects as go

numeric_df = df_wo_target_nan.select_dtypes(include='number')
correlation_matrix = numeric_df.corr()

fig = go.Figure(data=go.Heatmap(z=correlation_matrix.values, x=correlation_matrix.columns, y=correlation_matrix.index))

fig.show()

Find out if there is a relation between the features (which contains objects) and our target feature, using the "Cramers V".

* Small Effect:
Cramér's V values close to 0 indicate a weak or negligible association between the categorical variables.

* Medium Effect:
Cramér's V values around 0.1 to 0.3 suggest a moderate association. This indicates that the variables have some degree of dependency, but the association may not be very strong.

* Large Effect:
Cramér's V values close to 0.3 or higher indicate a relatively strong association between the categorical variables. This suggests a notable dependency or relationship between the variables.


In [None]:
# Check each column vs. the target column if there is a correlation by creating a function using the Cramér's V:

# make a list with each column name 
column_names = df_wo_target_nan.columns.tolist()
# delete bank_account from the list
column_names.remove('bank_account')
# create target value
target_column = 'bank_account'

def cramers_v(list):
    
    for name in column_names:
        # Create a contingency table
        contingency_table = pd.crosstab(df_wo_target_nan[name], df_wo_target_nan[target_column])

        # Perform chi-square test
        chi2, p, *_ = chi2_contingency(contingency_table)

        # Calculate Cramér's V
        n = len(df_wo_target_nan)
        cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))

        # only print output is cramers_v is bigger than 0.1
        if cramers_v >= 0.1:

            print('-----------------------------')
            print(f'{name} vs. {target_column}')
            # print("Chi-square:", chi2)
            # print("p-value:", p)
            print("Cramér's V:", round(cramers_v, 3))
        

cramers_v(column_names)

## Cleaning the data

We now drop columns if they:

* are an ID
* have no to negligible correlation to the target feature


In [None]:
df_wo_target_nan.columns

Renaming the columns for better readability:

In [None]:
# Renaming the column names
df_wo_target_nan.rename(columns = {'country': 'country',
        'year': 'year',
        'uniqueid': 'id',
        'location_type': 'location',
        'cellphone_access': 'cellphone',
        'household_size': 'household_size',
        'age_of_respondent': 'age',
        'gender_of_respondent': 'gender',
        'relationship_with_head': 'relationship_with_head', 
        'marital_status': 'marital_status', 
        'education_level': 'education',
        'job_type': 'job',
        'bank_account': 'bank_account'},
        inplace = True)

df_wo_target_nan.head().T

In [None]:
# drop id column
df_wo_target_nan = df_wo_target_nan.drop('id', axis=1)
df_wo_target_nan.head().T

In [None]:
df_wo_target_nan.nunique()

In [None]:
# Plotting the target variable
plt.title('Bank Account Count')
sns.countplot(x=df_wo_target_nan.bank_account)
;

In [None]:
# plot which shows percentage of people with and without a bank account
data = df_wo_target_nan['bank_account']

plt.hist(data, weights=np.ones(len(data)) / len(data))

plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
plt.show()

In the two histogramm plots we can see that the data isn't well balanced. Out of 23.524 people in our dataset only 3.312 people (~18%) have a bank account. 20.212 don't.
<br>

In [None]:
# Checking for Outliers without bank_account column
column_names = df_wo_target_nan.columns.tolist()
column_names.remove('bank_account')
print(column_names)
df_outliers = df_wo_target_nan[column_names].copy()



df_outliers.plot(kind='box', subplots=True, layout=(4,3), figsize=(34,30))
plt.show() 

In this plot we can see that we have outliers in the household_size column and also in the age column.

In [None]:
df_wo_target_nan['bank_account'].value_counts()

In [None]:
# Plotting a pairplot to see how the variables differ depending on our target variable - 'bank_account'
sns.pairplot(df_wo_target_nan, hue='bank_account', height=2);

In [None]:
# Correlation heatmap

fig, ax = plt.subplots(figsize=(12,8))

# Create a new DataFrame that only includes the numerical variables
df_numeric = df_wo_target_nan.select_dtypes(include=['float64', 'int64'])

# Compute correlations
correlations = df_numeric.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(correlations)
mask[np.triu_indices_from(mask)] = True

# Generate a custom diverging colormap
cmap = sns.diverging_palette(140, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio

sns.heatmap(correlations, mask=mask, cmap="viridis", vmax=1, annot=True,
            linewidths=1, cbar_kws={"shrink": .7}, ax=ax);


In [None]:
%config InlineBackend.figure_format = 'retina'
sns.set_context('notebook')
sns.set(rc={'figure.figsize':(15, 6)})


Columns with categorical values are:

* country
* year
* location
* cellphone
* gender
* relationship_with_head
* marital_status
* education
* job
* bank_account

Because we have many categorical values we have to use hot-one encoding. 
We are now creating our dummy variables.

In [None]:
# Function to make dummy variables for our categorical columns

cat_feats = ['country', 'year', 'location', 'cellphone', 'gender', 'relationship_with_head', 'marital_status', 'education', 'job', 'bank_account']
df_wo_target_nan = pd.get_dummies(df_wo_target_nan, columns=cat_feats, drop_first=True)

df_wo_target_nan.columns