# **M1 Exam Submission part 2**

## **Table of Contents**

#### I. Import Libraries and Data
#### II. Data Preparation
#### III. Model Development
#### IV. Model Evaluation

### Funding Duration Prediction:
### Target: funding_duration_days (Time to Secure Loan)

### Problem Statement:

The objective is to predict how long it will take for a loan request to be fully funded, based on factors such as the borrower's country, sector, activity type, loan amount, and number of lenders. By understanding these relationships, the model can provide insights into which factors lead to faster or slower funding times.
Type of Model:
The model will use regression, as the target variable (funding_duration_days) represents a continuous numerical value. The goal is to predict the number of days required to secure full funding for a loan based on the given features.
### Objective:
This model aims to assist lending platforms and financial institutions in forecasting how long a loan will take to be fully funded after a request is made. This can help in optimizing loan approval strategies, better managing borrower expectations, and allocating resources efficiently to improve funding success rates.

In [5]:
# install all requried libraries
!pip install -r requirements.txt





In [6]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
import geopandas as gpd
import altair as alt
from vega_datasets import data

# library used for gender mapping - section 4
import re

In [7]:
!wget -nc "https://github.com/aaubs/ds-master/raw/main/data/assignments_datasets/KIVA/kiva_loans_part_0.csv.zip"
!wget -nc "https://github.com/aaubs/ds-master/raw/main/data/assignments_datasets/KIVA/kiva_loans_part_1.csv.zip"
!wget -nc "https://github.com/aaubs/ds-master/raw/main/data/assignments_datasets/KIVA/kiva_loans_part_2.csv.zip"

# Unzip to csv
!unzip -o kiva_loans_part_0.csv.zip
!unzip -o kiva_loans_part_1.csv.zip
!unzip -o kiva_loans_part_2.csv.zip

# Loading datasets
data_part1 = pd.read_csv("kiva_loans_part_0.csv")
data_part2 = pd.read_csv("kiva_loans_part_1.csv")
data_part3 = pd.read_csv("kiva_loans_part_2.csv")

File ‘kiva_loans_part_0.csv.zip’ already there; not retrieving.

File ‘kiva_loans_part_1.csv.zip’ already there; not retrieving.

File ‘kiva_loans_part_2.csv.zip’ already there; not retrieving.

Archive:  kiva_loans_part_0.csv.zip
  inflating: kiva_loans_part_0.csv   
  inflating: __MACOSX/._kiva_loans_part_0.csv  
Archive:  kiva_loans_part_1.csv.zip
  inflating: kiva_loans_part_1.csv   
  inflating: __MACOSX/._kiva_loans_part_1.csv  
Archive:  kiva_loans_part_2.csv.zip
  inflating: kiva_loans_part_2.csv   
  inflating: __MACOSX/._kiva_loans_part_2.csv  


In [8]:
# We can see, that the imported loan dataset consists of 3 parts. We will like to combine these to one big dataset
data = pd.concat([data_part1, data_part2, data_part3])

In [9]:
data = data.drop(['tags', 'use', 'currency', 'country_code'], axis=1)

In [10]:
#Storing length of rows for comparing, >> before dropna.. <<
data_rows = len(data)

#Dropping missing values
data.dropna(inplace=True)

#Storing the now cleaned dataset
cleaned_rows = len(data)

#Check..
drops = data_rows - cleaned_rows

print(f"Number of dropped rows: {drops}")
print(f'In percentage {(drops / data_rows) * 100:.2f} % of the data was removed')

Number of dropped rows: 97078
In percentage 14.46 % of the data was removed


In [11]:
# Loans never funded
data['funded_time'] = data['funded_time'].where(data['funded_time'].notna(), None)

In [12]:
# Calculate Z-scores
z_scores = zscore(data['loan_amount'])

# Get boolean array indicating the presence of outliers
# Using 2 & -2 z_scores to get 95% of data within 2 standard deviations
data['outlier_loan_amount'] = (z_scores > 2) | (z_scores < -2)


#Removing outliers
data_clean = data[~data['outlier_loan_amount']]

# Check amount of outliers (if any?)
data['outlier_loan_amount'].sum()

np.int64(23129)

In [13]:
male = data_clean[data_clean['borrower_genders'] == 'male']
female = data_clean[data_clean['borrower_genders'] == 'female']
print ('Total Male loans',male.shape[0])
print ('Total Female loans',female.shape[0])

Total Male loans 101344
Total Female loans 374745


## SML Preparation

In [14]:
# Libs needed for section 7 (some already imported)
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.compose import ColumnTransformer

# Import the confusion matrix plotter module
from mlxtend.plotting import plot_confusion_matrix

#  Model selection & Regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import cross_val_score

# pipeline for the different models
from sklearn.pipeline import Pipeline

# decision Tree
from sklearn.tree import DecisionTreeRegressor

# tabular data explanation with LIME
import lime.lime_tabular  

# install shap & import
!pip install lime shap pdpbox -qqq

In [None]:
# it's a good idea to check the correlation matrix, to check if relevant columns may correlate to much with others
numeric_columns = loans.select_dtypes(include=[np.number]).columns

# Beregn korrelationsmatrix
corr_matrix = loans[numeric_columns].corr()

# Plot korrelationsmatrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1, center=0)
plt.title('Correlation Matrix for Numeric Columns in Loans Dataset')
plt.tight_layout()
plt.show()