# Loan Default Prediction for African Mobile Lenders

## Business Problem

In the growing fintech industry in Kenya, mobile lenders provide quick and accessible loans to millions of customers. However, loan default poses a significant financial risk to these lenders, leading to substantial monetary losses and limiting their ability to extend credit to new customers.

The goal of this project is to build a predictive model that classifies whether a borrower will default on their loan or repay it on time, based on their demographic information, historical loan performance, and current loan details. By accurately identifying high-risk borrowers before loan disbursement, the lender can make more informed decisions, reduce default rates, and improve overall profitability.

This classification model will support stakeholders such as loan officers, risk managers, and product teams in:
- Assessing borrower creditworthiness,
- Designing targeted risk mitigation strategies,
- Allocating credit more efficiently,
- Ultimately enhancing the financial sustainability of mobile lending platforms.

The Zindi loan default dataset is well-suited for addressing our business problem of predicting whether a loan 
will default or not. It combines three key data sources:

- **Demographic data** helps us understand the background and socioeconomic profile of the borrower.
- **Performance data** gives us insight into the current loan we want to classify — including amount, term, and repayment expectations.
- **Previous loan history** allows us to identify behavioral trends such as past defaults, frequency of borrowing, and repayment patterns.

By merging and analyzing these datasets, we can build a classification model that learns from past borrower behavior and demographic patterns to predict future loan performance. This enables the lender to make more informed, data-driven decisions and reduce the risk of default.


In [None]:
import pandas as pd

In [None]:
# Load the datasets
demographics_df = pd.read_csv('./Data/traindemographics.csv')
performance_df = pd.read_csv('./Data/trainperf.csv')
previous_loans_df = pd.read_csv('./Data/trainprevloans.csv')

In [14]:
# Merge demographic and performance data on 'customerid'
merged_df = pd.merge(performance_df, demographics_df, on='customerid', how='left')

# Merge previous loans data (aggregated) to the above merged data
# First, engineer relevant historical features from the previous loans
# For example: number of previous loans, total amount borrowed, etc.

# Create historical summary features
prev_loan_summary = previous_loans_df.groupby('customerid').agg({
    'systemloanid': 'count',  # Number of previous loans
    'loanamount': ['sum', 'mean', 'max'],  # Aggregated loan amounts
    'totaldue': ['sum', 'mean'],  # Aggregated repayment amounts
    'termdays': 'mean'  # Average loan term
})

# Flatten the multi-index columns
prev_loan_summary.columns = ['_'.join(col) for col in prev_loan_summary.columns]
prev_loan_summary.reset_index(inplace=True)

# Merge the summarized previous loans data with the merged dataset
final_df = pd.merge(merged_df, prev_loan_summary, on='customerid', how='left')

# Display shape and a few rows
print("Final merged dataset shape:", final_df.shape)
final_df.head()


Final merged dataset shape: (4376, 25)


Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,referredby,good_bad_flag,...,bank_branch_clients,employment_status_clients,level_of_education_clients,systemloanid_count,loanamount_sum,loanamount_mean,loanamount_max,totaldue_sum,totaldue_mean,termdays_mean
0,8a2a81a74ce8c05d014cfb32a0da1049,301994762,12,2017-07-25 08:22:56.000000,2017-07-25 07:22:47.000000,30000.0,34500.0,30,,Good,...,,Permanent,Post-Graduate,11.0,200000.0,18181.818182,30000.0,242900.0,22081.818182,30.0
1,8a85886e54beabf90154c0a29ae757c0,301965204,2,2017-07-05 17:04:41.000000,2017-07-05 16:04:18.000000,15000.0,17250.0,30,,Good,...,"DUGBE,IBADAN",Permanent,Graduate,,,,,,,
2,8a8588f35438fe12015444567666018e,301966580,7,2017-07-06 14:52:57.000000,2017-07-06 13:52:51.000000,20000.0,22250.0,15,,Good,...,,Permanent,,6.0,60000.0,10000.0,10000.0,70500.0,11750.0,17.5
3,8a85890754145ace015429211b513e16,301999343,3,2017-07-27 19:00:41.000000,2017-07-27 18:00:35.000000,10000.0,11500.0,15,,Good,...,,Permanent,,2.0,20000.0,10000.0,10000.0,24500.0,12250.0,22.5
4,8a858970548359cc0154883481981866,301962360,9,2017-07-03 23:42:45.000000,2017-07-03 22:42:39.000000,40000.0,44000.0,30,,Good,...,,Permanent,Primary,8.0,150000.0,18750.0,30000.0,188400.0,23550.0,37.5


In [16]:
# Calculate missing value percentages
missing_percent = (final_df.isnull().sum() / final_df.shape[0]) * 100

# Filter and sort columns with missing values
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)

# Display
missing_percent


bank_branch_clients           99.245887
level_of_education_clients    89.876600
referredby                    86.540219
employment_status_clients     36.380256
longitude_gps                 25.114260
bank_account_type             25.114260
bank_name_clients             25.114260
birthdate                     25.114260
latitude_gps                  25.114260
systemloanid_count             0.205667
loanamount_sum                 0.205667
loanamount_mean                0.205667
loanamount_max                 0.205667
totaldue_sum                   0.205667
totaldue_mean                  0.205667
termdays_mean                  0.205667
dtype: float64