#  LTFS Top-up loan Up-sell prediction

A loan is when you receive the money from a financial institution in exchange for future repayment of the principal, plus interest. Financial institutions provide loans to the industries, corporates and individuals. The interest received on these loans is one among the main sources of income for the financial institutions.

A top-up loan, true to its name, is a facility of availing further funds on an existing loan. When you have a loan that has already been disbursed and under repayment and if you need more funds then, you can simply avail additional funding on the same loan thereby minimizing time, effort and cost related to applying again.

LTFS provides it’s loan services to its customers and is interested in selling more of its Top-up loan services to its existing customers so they have decided to identify when to pitch a Top-up during the original loan tenure.  If they correctly identify the most suitable time to offer a top-up, this will ultimately lead to more disbursals and can also help them beat competing offerings from other institutions.

To understand this behaviour, LTFS has provided data for its customers containing the information whether that particular customer took the Top-up service and when he took such Top-up service, represented by the target variable Top-up Month.


You are provided with two types of information: 


1. Customer’s Demographics: The demography table along with the target variable & demographic information contains variables related to Frequency of the loan, Tenure of the loan, Disbursal Amount for a loan & LTV.

2. Bureau data:  Bureau data contains the behavioural and transactional attributes of the customers like current balance, Loan Amount, Overdue etc. for various tradelines of a given customer

As a data scientist, LTFS  has tasked you with building a model given the Top-up loan bucket of 128655 customers along with demographic and bureau data, predict the right bucket/period for 14745 customers in the test data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.preprocessing import OneHotEncoder
from catboost import CatBoostClassifier
from sklearn.metrics import classification_report
import datetime
sns.set()
pd.options.display.max_columns =200

In [2]:
train_data_path = Path(Path.cwd(),'Train','train_Data.xlsx')
train_bureau_path = Path(Path.cwd(),'Train','train_bureau.xlsx')

In [3]:
data = pd.read_excel(train_data_path)
data.head(3)

Unnamed: 0,ID,Frequency,InstlmentMode,LoanStatus,PaymentMode,BranchID,Area,Tenure,AssetCost,AmountFinance,DisbursalAmount,EMI,DisbursalDate,MaturityDAte,AuthDate,AssetID,ManufacturerID,SupplierID,LTV,SEX,AGE,MonthlyIncome,City,State,ZiPCODE,Top-up Month
0,1,Monthly,Arrear,Closed,PDC_E,1,,48,450000,275000.0,275000.0,24000.0,2012-02-10,2016-01-15,2012-02-10,4022465,1568,21946,61.11,M,49.0,35833.33,RAISEN,MADHYA PRADESH,464993.0,> 48 Months
1,2,Monthly,Advance,Closed,PDC,333,BHOPAL,47,485000,350000.0,350000.0,10500.0,2012-03-31,2016-02-15,2012-03-31,4681175,1062,34802,70.0,M,23.0,666.67,SEHORE,MADHYA PRADESH,466001.0,No Top-up Service
2,3,Quatrly,Arrear,Active,Direct Debit,1,,68,690000,519728.0,519728.0,38300.0,2017-06-17,2023-02-10,2017-06-17,25328146,1060,127335,69.77,M,39.0,45257.0,BHOPAL,MADHYA PRADESH,462030.0,12-18 Months


In [4]:
bureau = pd.read_excel(train_bureau_path)
bureau.head(3)

Unnamed: 0,ID,SELF-INDICATOR,MATCH-TYPE,ACCT-TYPE,CONTRIBUTOR-TYPE,DATE-REPORTED,OWNERSHIP-IND,ACCOUNT-STATUS,DISBURSED-DT,CLOSE-DT,LAST-PAYMENT-DATE,CREDIT-LIMIT/SANC AMT,DISBURSED-AMT/HIGH CREDIT,INSTALLMENT-AMT,CURRENT-BAL,INSTALLMENT-FREQUENCY,OVERDUE-AMT,WRITE-OFF-AMT,ASSET_CLASS,REPORTED DATE - HIST,DPD - HIST,CUR BAL - HIST,AMT OVERDUE - HIST,AMT PAID - HIST,TENURE
0,1,False,PRIMARY,Overdraft,NAB,2018-04-30,Individual,Delinquent,2015-10-05,,2018-02-27,,37352,,37873,,37873.0,0.0,Standard,2018043020180331,030000,3787312820,"37873,,",",,",
1,1,False,PRIMARY,Auto Loan (Personal),NAB,2019-12-31,Individual,Active,2018-03-19,,2019-12-19,,44000,"1,405/Monthly",20797,F03,,0.0,Standard,"20191231,20191130,20191031,20190930,20190831,2...",0000000000000000000000000000000000000000000000...,"20797,21988,23174,24341,25504,26648,27780,2891...",",,,,,,,,,,,,,,,,,,,,1452,,",",,,,,,,,,,,,,,,,,,,,,,",36.0
2,1,True,PRIMARY,Tractor Loan,NBF,2020-01-31,Individual,Active,2019-08-30,,NaT,,145000,,116087,,0.0,0.0,,"20200131,20191231,20191130,20191031,20190930,2...",000000000000000000,116087116087145000145000145000145000,000000,",,,,,,",


In [5]:
def clean_bureau(df):
    clean_col = ['CREDIT-LIMIT/SANC AMT','DISBURSED-AMT/HIGH CREDIT','CURRENT-BAL','OVERDUE-AMT']
    for col in clean_col:
        df[col] = df[col].str.replace(",","").astype(float)
    df['DISBURSED-DT'] = df['DISBURSED-DT'].fillna(datetime.datetime(1970,1,1))
    return df

In [6]:
def summarise_bureau(data,data_bureau,train_bureau):
    
    sorted_bureau = data_bureau.sort_values(['ID','DISBURSED-DT'])
    sorted_bureau.drop_duplicates(subset=['ID','DISBURSED-DT'],keep='first',inplace=True)
    
    count_columns = ['SELF-INDICATOR','MATCH-TYPE','OWNERSHIP-IND','ACCOUNT-STATUS']
    select_col = ['ID','DISBURSED-DT','SELF-INDICATOR','MATCH-TYPE','ACCT-TYPE','CONTRIBUTOR-TYPE','OWNERSHIP-IND','ACCOUNT-STATUS',
                 'ASSET_CLASS','CURRENT-BAL','OVERDUE-AMT','WRITE-OFF-AMT']
    
    enc = OneHotEncoder(handle_unknown='ignore')
    enc.fit(train_bureau[count_columns])
    hist = pd.DataFrame(enc.transform(sorted_bureau[count_columns]).toarray(),columns=enc.get_feature_names())

    for i in range(len(count_columns)):
        hist.columns = hist.columns.str.replace(f"x{i}","hist")
        
    hist_features = hist.columns
    
    hist['ID'] = sorted_bureau['ID']
    hist['DISBURSED-DT'] = sorted_bureau['DISBURSED-DT']
    
    hist_summary = hist.groupby(['ID','DISBURSED-DT']).sum().groupby(level=0).shift().groupby(level=0).cumsum().reset_index().fillna(0)
    loan_summary = hist.groupby(['ID','DISBURSED-DT']).size().groupby(level=0).shift().groupby(level=0).cumsum().reset_index().fillna(0)
    loan_summary.rename({0:'hist_loan'},axis=1,inplace=True)
    
    bureau_merge = sorted_bureau[select_col].merge(hist_summary,on=['ID','DISBURSED-DT'],how='left')
    bureau_merge = bureau_merge.merge(loan_summary,on=['ID','DISBURSED-DT'],how='left')
    
    data['DISBURSED-DT'] = data['DisbursalDate']
    bureau_merge = data.merge(bureau_merge,on=['ID','DISBURSED-DT'],how='left')
        
    return bureau_merge

In [7]:
bureau = clean_bureau(bureau)
bureau_merge = summarise_bureau(data,bureau,bureau)
bureau_merge.head(3)

Unnamed: 0,ID,Frequency,InstlmentMode,LoanStatus,PaymentMode,BranchID,Area,Tenure,AssetCost,AmountFinance,DisbursalAmount,EMI,DisbursalDate,MaturityDAte,AuthDate,AssetID,ManufacturerID,SupplierID,LTV,SEX,AGE,MonthlyIncome,City,State,ZiPCODE,Top-up Month,DISBURSED-DT,SELF-INDICATOR,MATCH-TYPE,ACCT-TYPE,CONTRIBUTOR-TYPE,OWNERSHIP-IND,ACCOUNT-STATUS,ASSET_CLASS,CURRENT-BAL,OVERDUE-AMT,WRITE-OFF-AMT,hist_False,hist_True,hist_PRIMARY,hist_SECONDARY,hist_Guarantor,hist_Individual,hist_Joint,hist_Primary,hist_Supl Card Holder,hist_Active,hist_Cancelled,hist_Closed,hist_Delinquent,hist_Restructured,hist_SUIT FILED (WILFUL DEFAULT),hist_Settled,hist_Sold/Purchased,hist_Suit Filed,hist_WILFUL DEFAULT,hist_Written Off,hist_loan
0,1,Monthly,Arrear,Closed,PDC_E,1,,48,450000,275000.0,275000.0,24000.0,2012-02-10,2016-01-15,2012-02-10,4022465,1568,21946,61.11,M,49.0,35833.33,RAISEN,MADHYA PRADESH,464993.0,> 48 Months,2012-02-10,True,PRIMARY,Tractor Loan,NBF,Individual,Closed,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Monthly,Advance,Closed,PDC,333,BHOPAL,47,485000,350000.0,350000.0,10500.0,2012-03-31,2016-02-15,2012-03-31,4681175,1062,34802,70.0,M,23.0,666.67,SEHORE,MADHYA PRADESH,466001.0,No Top-up Service,2012-03-31,True,PRIMARY,Tractor Loan,NBF,Individual,Closed,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Quatrly,Arrear,Active,Direct Debit,1,,68,690000,519728.0,519728.0,38300.0,2017-06-17,2023-02-10,2017-06-17,25328146,1060,127335,69.77,M,39.0,45257.0,BHOPAL,MADHYA PRADESH,462030.0,12-18 Months,2017-06-17,True,PRIMARY,Tractor Loan,NBF,Individual,Active,,37637.0,0.0,0.0,7.0,1.0,8.0,0.0,0.0,7.0,1.0,0.0,0.0,2.0,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0


In [1]:
def compute_features(df):
    df['disburse_diff']= (df['AmountFinance'] != df['DisbursalAmount']).astype(int)
    df['per_loan'] = df['DisbursalAmount']/df['AssetCost']
    df['total_int'](df['Tenure']* df['EMI']) - df['DisbursalAmount']
    df['emi_salary_per'] = df['EMI']/df['MonthlyIncome']
    df['asset_value'] = df['DisbursalAmount']/ df['LTV']
    df['emi_asset'] = df['EMI']/df['asset_value']
    return df

In [8]:
bureau_merge.shape,data.shape

((128655, 58), (128655, 27))

In [9]:
features = ['Frequency', 'InstlmentMode', 'LoanStatus', 'PaymentMode','Tenure', 'AssetCost', 'AmountFinance',
            'AssetID', 'ManufacturerID', 'SupplierID', 'LTV', 'SEX', 'AGE','MonthlyIncome', 'City', 'State',
            'SELF-INDICATOR', 'MATCH-TYPE', 'ACCT-TYPE','CONTRIBUTOR-TYPE', 'OWNERSHIP-IND', 'ACCOUNT-STATUS', 
            'ASSET_CLASS','CURRENT-BAL', 'OVERDUE-AMT', 'WRITE-OFF-AMT', 'hist_False', 'hist_True', 
            'hist_PRIMARY', 'hist_SECONDARY', 'hist_Guarantor','hist_Individual', 'hist_Joint', 'hist_Primary',
            'hist_Supl Card Holder', 'hist_Active', 'hist_Cancelled', 'hist_Closed','hist_Delinquent', 
            'hist_Restructured','hist_SUIT FILED (WILFUL DEFAULT)', 'hist_Settled','hist_Sold/Purchased', 
            'hist_Suit Filed', 'hist_WILFUL DEFAULT','hist_Written Off', 'hist_loan']

x_train = bureau_merge[features].copy()
y_train = bureau_merge['Top-up Month']
categorical_var = x_train.columns[np.where(x_train.dtypes != np.float)].tolist()
x_train[categorical_var] = x_train[categorical_var].fillna('na')
x_train.sample(3)

Unnamed: 0,Frequency,InstlmentMode,LoanStatus,PaymentMode,Tenure,AssetCost,AmountFinance,AssetID,ManufacturerID,SupplierID,LTV,SEX,AGE,MonthlyIncome,City,State,SELF-INDICATOR,MATCH-TYPE,ACCT-TYPE,CONTRIBUTOR-TYPE,OWNERSHIP-IND,ACCOUNT-STATUS,ASSET_CLASS,CURRENT-BAL,OVERDUE-AMT,WRITE-OFF-AMT,hist_False,hist_True,hist_PRIMARY,hist_SECONDARY,hist_Guarantor,hist_Individual,hist_Joint,hist_Primary,hist_Supl Card Holder,hist_Active,hist_Cancelled,hist_Closed,hist_Delinquent,hist_Restructured,hist_SUIT FILED (WILFUL DEFAULT),hist_Settled,hist_Sold/Purchased,hist_Suit Filed,hist_WILFUL DEFAULT,hist_Written Off,hist_loan
14904,Half Yearly,Arrear,Closed,PDC_E,24,560000,301797.0,12969359,1060,23173,37.73,M,45.0,25000.0,TONK,RAJASTHAN,True,PRIMARY,Tractor Loan,NBF,Individual,Closed,na,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24243,Half Yearly,Arrear,Closed,Billed,60,550000,539000.0,15882285,1046,60896,98.0,M,54.0,33333.33,KARIMNAGAR,ANDHRA PRADESH,True,PRIMARY,Tractor Loan,NBF,Individual,Active,na,78339.0,0.0,0.0,2.0,0.0,2.0,0.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
24927,Monthly,Arrear,Closed,Billed,36,745295,350000.0,2530757,1187,25749,46.96,M,37.0,62500.0,CHAMRAJNAGAR,KARNATAKA,True,PRIMARY,Tractor Loan,NBF,Individual,Closed,na,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [16]:
model = CatBoostClassifier(iterations=100,auto_class_weights="Balanced",custom_metric='F1',depth=7,rsm=1.0,
                           eval_metric='AUC',random_seed=424)
model.fit(x_train,y_train,cat_features = categorical_var)

Learning rate set to 0.5
0:	total: 2.42s	remaining: 4m
1:	total: 3.43s	remaining: 2m 48s
2:	total: 6.05s	remaining: 3m 15s
3:	total: 8.62s	remaining: 3m 26s
4:	total: 11.6s	remaining: 3m 40s
5:	total: 15.1s	remaining: 3m 57s
6:	total: 16.8s	remaining: 3m 43s
7:	total: 20.7s	remaining: 3m 57s
8:	total: 23.7s	remaining: 3m 59s
9:	total: 27.3s	remaining: 4m 5s
10:	total: 30.7s	remaining: 4m 8s
11:	total: 32.7s	remaining: 3m 59s
12:	total: 34.1s	remaining: 3m 48s
13:	total: 36s	remaining: 3m 41s
14:	total: 39.6s	remaining: 3m 44s
15:	total: 40.6s	remaining: 3m 33s
16:	total: 41.6s	remaining: 3m 22s
17:	total: 45s	remaining: 3m 25s
18:	total: 46.1s	remaining: 3m 16s
19:	total: 51.3s	remaining: 3m 25s
20:	total: 54.9s	remaining: 3m 26s
21:	total: 59.1s	remaining: 3m 29s
22:	total: 1m 3s	remaining: 3m 31s
23:	total: 1m 7s	remaining: 3m 32s
24:	total: 1m 11s	remaining: 3m 35s
25:	total: 1m 15s	remaining: 3m 35s
26:	total: 1m 20s	remaining: 3m 36s
27:	total: 1m 24s	remaining: 3m 36s
28:	total: 

<catboost.core.CatBoostClassifier at 0x244e7b6dbe0>

In [11]:
train_predict = model.predict(x_train)
train_predict = np.squeeze(train_predict)
print(classification_report(y_train, train_predict))

                   precision    recall  f1-score   support

      > 48 Months       0.16      0.65      0.25      8366
     12-18 Months       0.06      0.59      0.11      1034
     18-24 Months       0.12      0.57      0.20      2368
     24-30 Months       0.16      0.44      0.24      3492
     30-36 Months       0.12      0.43      0.19      3062
     36-48 Months       0.09      0.40      0.15      3656
No Top-up Service       0.94      0.33      0.48    106677

         accuracy                           0.36    128655
        macro avg       0.24      0.49      0.23    128655
     weighted avg       0.81      0.36      0.44    128655



In [12]:
test_data_path = Path(Path.cwd(),'Test','test_Data.xlsx')
test_bureau_path = Path(Path.cwd(),'Test','test_bureau.xlsx')

test_data = pd.read_excel(test_data_path)
test_bureau = pd.read_excel(test_bureau_path)

test_bureau = clean_bureau(test_bureau)
test_bureau_merge = summarise_bureau(test_data,test_bureau,bureau)

x_test = test_bureau_merge[features].copy()
x_test[categorical_var] = x_train[categorical_var].fillna('na')

In [13]:
test_data.shape

(14745, 26)

In [14]:
x_train.shape,data.shape

((128655, 47), (128655, 27))

In [15]:
test_predict = model.predict(x_test)
test_predict = np.squeeze(test_predict)
test_data['Top-up Month'] = test_predict
test_data[['ID','Top-up Month']].to_csv('submit_1.csv',index=False)