# Standard Bank Tech Impact Project

## 1. Exploratory data analysis 

### Understanding the problem

According to the Uganda FinScope 2018 survey findings, 46% (8.5 million) adults borrowed money during the 12-month
period. The majority borrowing to cover regular living expenses (such as education) during low-income periods. 

Further, the largest source of borrowing is from informal lenders:

        1. Savings groups, 
        
        2. Burial societies,
        
        3. Community-based money lenders etc.
        
Xente is a Ugandan e-commerce startup that makes it easy for consumers to make payments, get loans,and shop using
simply a mobile phone.

The objective of this project is to create a machine learning model to predict which individuals are most 
likely to default on their loans, based on their loan repayment behaviour and ecommerce transaction activity.

### Type of the problem

Type of the Problem

It is a classification problem where we have to predict which individuals are most likely to default on their loans,
based on their loan repayment behaviour and ecommerce transaction activity.



### Hypothesis Generation

1. Few people paid back their loans within specified time.


2. Most of the money they had loaned was not used for business purposes.


3. The people who use their loans for non_business activities are unlikely to payback their loans.


4. The people who use their loans wisely are able to pay back their loans within specified time.


5. The banks are unable to improve individual loans because of failure to meet the rules.

In [None]:
# Loading important packages

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams["axes.labelsize"] = 18
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline 

In [None]:
# importing the dataset

data = pd.read_csv('/home/valentineiscoming/Pictures/lateness/lateness assignment/Train.csv')

In [None]:
# viewing first five rows in the dataset

data.head()

In [None]:
# Understanding the columns from variable definitions

variable_definitions = pd.read_csv('/home/valentineiscoming/Pictures/lateness/lateness assignment/VariableDefinitions.csv')

variable_definitions

In [None]:
# view its shape

print(f'the shape is: {data.shape}')

In [None]:
# viewing all columns

list(data.columns)

In [None]:
# more information about columns - datatypes

data.info()

In [None]:
# checking missing values

data.isna().sum()

### Univariate analysis

### Dealing with Categorical features

In [None]:
# view the target

data['IsDefaulted'].value_counts()

In [None]:
# Explore target distribution

sns.catplot('IsDefaulted', kind='count', data=data)

where :
    
    1.0 - means defaulted to agreed payback time
    
    0.0 - means not defaulted to agreed payback time
    
Few people payback their loans within specified time.
    

In [None]:
# Explore IsThirdPartyConfirmed distribution

sns.catplot('IsThirdPartyConfirmed', kind='count', data=data)

where :

1.0 - means loan order succeeded on platform

0.0 - means loan order not succeeded on platform

Greater number of loan order succeeded on platform.

In [None]:
# Explore IsThirdPartyConfirmed distribution

sns.catplot('SubscriptionId', kind='count', data=data)

plt.xticks(rotation = 90)

SubscriptionId_7 recorded highest subscriptions compared t0 SubscriptionId_2


In [None]:
# Explore CurrencyCode Distribution

sns.catplot('CurrencyCode', kind='count', data=data)

The people surveyed recorded the use of Ugandan shilings(UGX) during the payment of their loans.

In [None]:
# Explore CountryCode distribution

sns.catplot('CountryCode', kind='count', data=data)

The people surveyed use the Ugandan Country code to access mobile services.

In [None]:
# Explore ProviderId distribution

sns.catplot('ProviderId', kind='count', data=data)

There is only one provider of the services.

In [None]:
# Explore ProductId distribution

sns.catplot('ProductId', kind='count', data=data)

plt.xticks(rotation=90)

Product_3 was highly purchased compared to others.

In [None]:
# Explore ProductCategory distribution

sns.catplot('ProductCategory', kind='count', data=data)

plt.xticks(rotation=90)

The surveyed people mostly accessed airtime services.

In [None]:
# Explore ChannelId distribution

sns.catplot('ChannelId', kind='count', data=data)

The people surved use the Xente Paylater on any other channel.

In [None]:
# Explore TransactionStatus distribution

sns.catplot('TransactionStatus', kind='count', data=data)

where

    1- Loan accepted status accepted
    
    0- Loan accepted status rejected

    Most people were granted loans.

In [None]:
# Explore Currency distribution

sns.catplot('Currency', kind='count', data=data)

It shows Ugandan Shillings Denominations.

In [None]:
# Explore IsFinalPayBack distribution

sns.catplot('IsFinalPayBack', kind='count', data=data)

where:
    
    1-have done their last payback installment
    
    0-have not done their last payback installment
    
Most people had done the last payback installment.

In [None]:
# Explore InvestorId distribution

sns.catplot('InvestorId', kind='count', data=data)

InvestorId_1 has issued loans to many customers.

### Dealing with Numerical Features

In [None]:
# Explore Amount distribution 

plt.figure(figsize=(10, 6))
data.AmountLoan.hist() 
plt.xlabel('AmountLoan')

Most people borrowed between 0 to 0.25e6 Ugandan Shilling

In [None]:
# Explore Value distribution 

plt.figure(figsize=(10, 6))
data.Value.hist() 
plt.xlabel('Value')

The value of transactions recorded most was between 0 to 0.25e6 Ugandan Shillings.

In [None]:
# Explore Amount distribution 

plt.figure(figsize=(10, 6))
data.Amount.hist() 
plt.xlabel('Amount')

The value of Transactions with Charges recorded most was between -0.25e6 to 0 Ugandan Shillings.

### Bivariate Analysis

In [None]:
# IsThirdPartyConfirmed  vs IsDefaulted

plt.figure(figsize=(16, 6))
sns.countplot('IsThirdPartyConfirmed', hue= 'IsDefaulted', data=data)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)


where:
    
    0-Loan order succeeded on platform.
    
    1-Loan order not succeeded on platform.
    
    There were few people who defaulted to the agreed payback time after their loan order succeeded on platform.

In [None]:
#   SubscriptionId vs IsDefaulted

plt.figure(figsize=(16, 6))
sns.countplot('SubscriptionId', hue= 'IsDefaulted', data=data)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)


SubscriptionId_6 recorded many people who defaulted to the agreed payback time.

In [None]:
#   ProductId vs IsDefaulted

plt.figure(figsize=(16, 6))
sns.countplot('ProductId', hue= 'IsDefaulted', data=data)
plt.xticks(
    rotation=90,
    fontweight='light',
    fontsize='x-large'  
)

The people who bought productId_18 defaulted to the agreed payback time.

In [None]:
# ProductCategory vs IsDefaulted

plt.figure(figsize=(16, 6))
sns.countplot('ProductCategory', hue= 'IsDefaulted', data=data)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)

The people who purchased retail type of product defaulted to the agreed payback time.

In [None]:
#   TransactionStatus vs IsDefaulted

plt.figure(figsize=(16, 6))
sns.countplot('TransactionStatus', hue= 'IsDefaulted', data=data)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)

Few people who received load defaulted to the agreed payback time.

In [None]:
#   IsFinalPayBack vs IsDefaulted

plt.figure(figsize=(16, 6))
sns.countplot('IsFinalPayBack', hue= 'IsDefaulted', data=data)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)

where:

0-Not last payment installment.

1-Last payment installment.

Very few had their final payment installments.

In [None]:
#   InvestorId vs IsDefaulted

plt.figure(figsize=(16, 6))
sns.countplot('InvestorId', hue= 'IsDefaulted', data=data)
plt.xticks(
    fontweight='light',
    fontsize='x-large'  
)

InvestorId_2 recorded highest number of people who defaulted to agreed payback time.

## Results

1. Few people paid back their loans within specified time - True


2. Most of the money they had loaned was not used for business purposes - True


3. The people who use their loans for non_business activities are unlikely to payback their loans - True


4. The people who use their loans wisely are able to pay back their loans within specified time - True


5. The banks are unable to improve individual loans because of failure to meet the rules-True 

## Feature Engineering 

### Handling missing values

In [None]:
from sklearn.impute import SimpleImputer

# max fill function for categorical columns 

data['IssuedDateLoan'].fillna(data['IssuedDateLoan'].value_counts().idxmax(),inplace=True)
data['Currency'].fillna(data['Currency'].value_counts().idxmax(),inplace=True)
data['LoanId'].fillna(data['LoanId'].value_counts().idxmax(),inplace=True)
data['PaidOnDate'].fillna(data['PaidOnDate'].value_counts().idxmax(),inplace=True)
data['InvestorId'].fillna(data['InvestorId'].value_counts().idxmax(),inplace=True)
data['DueDate'].fillna(data['DueDate'].value_counts().idxmax(),inplace=True)
data['LoanApplicationId'].fillna(data['LoanApplicationId'].value_counts().idxmax(),inplace=True)
data['PayBackId'].fillna(data['PayBackId'].value_counts().idxmax(),inplace=True)
data['ThirdPartyId'].fillna(data['ThirdPartyId'].value_counts().idxmax(),inplace=True)
data['IsThirdPartyConfirmed'].fillna(data['IsThirdPartyConfirmed'].value_counts().idxmax(),inplace=True)
data['IsDefaulted'].fillna(data['IsDefaulted'].value_counts().idxmax(),inplace=True)
data['IsFinalPayBack'].fillna(data['IsFinalPayBack'].value_counts().idxmax(),inplace=True)

# filling missing values with mean of the column(numerical features)

data=data.fillna(data['AmountLoan'].mean())



In [None]:
# checking if missing values have remained

data.isna().sum().sum()

In [None]:
data.head()

In [None]:
# dropping features

data = data.drop('ChannelId', axis=1)
data = data.drop('Currency', axis=1)
data = data.drop('CurrencyCode', axis=1)
data = data.drop('CountryCode', axis=1)
data = data.drop('ProviderId', axis=1)
data = data.drop('CustomerId', axis=1)
data = data.drop('TransactionId',axis=1)
data = data.drop('TransactionStartTime',axis=1)
data = data.drop('BatchId',axis=1)
data = data.drop('IssuedDateLoan', axis=1)
data = data.drop('LoanId', axis=1)
data = data.drop('PaidOnDate', axis=1)
data = data.drop('DueDate', axis=1)
data = data.drop('PayBackId', axis=1)
data = data.drop('LoanApplicationId', axis=1)
data = data.drop('ThirdPartyId', axis=1)

In [None]:
data.shape

In [None]:
# convert categorical features to numerical features


categorical_features = ['SubscriptionId','ProductId',
                        'ProductCategory','InvestorId']

# One Hot Encoding conversion
data = pd.get_dummies(data, prefix_sep='_', columns = categorical_features)


#show the shape of the data
data.shape

In [None]:
list(data.columns)

In [None]:
data

In [None]:
# import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# feature scalling by using minmaxscaler method 
scaler = MinMaxScaler(feature_range=(0, 1))


data['Value'] = scaler.fit_transform(data['Value'].values.reshape(-1,1))
data['Amount'] = scaler.fit_transform(data['Amount'].values.reshape(-1,1))
data['AmountLoan'] = scaler.fit_transform(data['AmountLoan'].values.reshape(-1,1))

#show shape 
data.shape  

In [None]:
#Checking first five rows

data.head()

In [None]:
# Checking the first row

data[:1].values

# Feature Selection

### Univariate Analysis

In [None]:
# import packages 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
#split dataset into features and target
target = data['IsDefaulted']
features = data.drop('IsDefaulted', axis =1)

target
features

In [None]:
list(data.columns)

In [None]:
#apply SelectKBest class to extract top 20 best features
bestfeatures = SelectKBest(score_func=chi2, k=20)

#train to find best features
fit = bestfeatures.fit(features,target)

#save in the dataframe 
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(features.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)

#naming the dataframe columns
featureScores.columns = ['Specs','Score'] 

#print 20 best features 
print(featureScores.nlargest(20,'Score'))  

In [None]:
# fit and tranform into the 20 best features 
transformer = SelectKBest(chi2, k=20)

#transform from 41 features into top 20 features
top_20_features = transformer.fit_transform(features, target)

#show the shape 
top_20_features.shape 

### Feature Importance

In [None]:
#import package 
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
#create model for training 
model = ExtraTreesClassifier()
model.fit(features,target)

#use inbuilt class feature_importances of tree based classifiers
print(model.feature_importances_) 

#plot graph of feature importances for better visualization
feature_importances = pd.Series(model.feature_importances_, index=features.columns)

# show the first 30 important features 

fig= plt.figure(figsize=(25,25))
sns.set(font_scale = 3)
feature_importances.nlargest(30).plot(kind='barh')
plt.show() 

### Correlation Matrix with Heatmap

In [None]:
#get correlations of each features in dataset
plt.figure(figsize=(30,30))

#plot heat map
sns.set(font_scale = 3)
# to show number set annot=True
d = sns.heatmap(data.corr(),annot=False, cmap="RdYlGn")

#save the figure 
figure = d.get_figure()
figure.savefig("heatmap_output.png")

# show the heatamp graph 
d   

In [None]:
# SHOW CORRELATION OF DATA TO THE TARGET COLUMN 
features_corr = pd.DataFrame(abs(data.corr()['IsDefaulted']).sort_values(ascending = False)) 

features_corr 

# Machine Learning Model

### Random Forest Classifier

In [None]:
# splitting the dataset

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(features,target,stratify=target,test_size=0.25,random_state=42)

In [None]:
# importing RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier


In [None]:
# creating a RF Classifier

r_Classifier= RandomForestClassifier(n_estimators = 100)


# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters

r_Classifier.fit(x_train,y_train)

# performing prediction on the test dataset
y_predicts = r_Classifier.predict(x_test)


# metrics are used to find accuracy or error
from sklearn import metrics

#using metrics module for accuracy calculation
print("Accuracy of the model: ", metrics.roc_auc_score(y_test,y_predicts))

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test,y_predicts))

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(r_Classifier,x_test,y_test)
plt.show()

## Since, It is imbalanced we emplore

In [None]:
# import SMOTE to balance

from imblearn.over_sampling import SMOTE

In [None]:
# Creating a oversample 

oversample = SMOTE()

In [None]:
# fiting

re_features,re_target = oversample.fit_resample(features,target)

In [None]:
# Split the dataset

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(re_features,re_target,stratify=re_target,test_size=0.25,random_state=42)

# importing RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier


In [None]:
# creating a RF Classifier

r_Classifier= RandomForestClassifier(n_estimators = 100)


# Training the model on the training dataset
# fit function is used to train the model using the training sets as parameters

r_Classifier.fit(x_train,y_train)

# performing prediction on the test dataset
y_predicts = r_Classifier.predict(x_test)


# metrics are used to find accuracy or error
from sklearn import metrics

#using metrics module for accuracy calculation
print("Accuracy of the model: ", metrics.roc_auc_score(y_test,y_predicts))


In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test,y_predicts))

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(r_Classifier,x_test,y_test)
plt.show()

Accuracy is improved because the closer the AUC is to 1, the better the model.
It is balanced.
