#  LTFS Top-up loan Up-sell prediction

A loan is when you receive the money from a financial institution in exchange for future repayment of the principal, plus interest. Financial institutions provide loans to the industries, corporates and individuals. The interest received on these loans is one among the main sources of income for the financial institutions.

A top-up loan, true to its name, is a facility of availing further funds on an existing loan. When you have a loan that has already been disbursed and under repayment and if you need more funds then, you can simply avail additional funding on the same loan thereby minimizing time, effort and cost related to applying again.

LTFS provides it’s loan services to its customers and is interested in selling more of its Top-up loan services to its existing customers so they have decided to identify when to pitch a Top-up during the original loan tenure.  If they correctly identify the most suitable time to offer a top-up, this will ultimately lead to more disbursals and can also help them beat competing offerings from other institutions.

To understand this behaviour, LTFS has provided data for its customers containing the information whether that particular customer took the Top-up service and when he took such Top-up service, represented by the target variable Top-up Month.


You are provided with two types of information: 


1. Customer’s Demographics: The demography table along with the target variable & demographic information contains variables related to Frequency of the loan, Tenure of the loan, Disbursal Amount for a loan & LTV.

2. Bureau data:  Bureau data contains the behavioural and transactional attributes of the customers like current balance, Loan Amount, Overdue etc. for various tradelines of a given customer

As a data scientist, LTFS  has tasked you with building a model given the Top-up loan bucket of 128655 customers along with demographic and bureau data, predict the right bucket/period for 14745 customers in the test data.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from sklearn.preprocessing import OneHotEncoder
# explicitly require this experimental feature
from sklearn.experimental import enable_hist_gradient_boosting  # noqa
# now you can import normally from ensemble
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.utils import class_weight
from xgboost import plot_importance
import datetime
sns.set()
pd.options.display.max_columns =200

In [2]:
train_data_path = Path(Path.cwd(),'Train','train_Data.xlsx')
train_bureau_path = Path(Path.cwd(),'Train','train_bureau.xlsx')

In [3]:
data = pd.read_excel(train_data_path)
data.head(3)

Unnamed: 0,ID,Frequency,InstlmentMode,LoanStatus,PaymentMode,BranchID,Area,Tenure,AssetCost,AmountFinance,DisbursalAmount,EMI,DisbursalDate,MaturityDAte,AuthDate,AssetID,ManufacturerID,SupplierID,LTV,SEX,AGE,MonthlyIncome,City,State,ZiPCODE,Top-up Month
0,1,Monthly,Arrear,Closed,PDC_E,1,,48,450000,275000.0,275000.0,24000.0,2012-02-10,2016-01-15,2012-02-10,4022465,1568,21946,61.11,M,49.0,35833.33,RAISEN,MADHYA PRADESH,464993.0,> 48 Months
1,2,Monthly,Advance,Closed,PDC,333,BHOPAL,47,485000,350000.0,350000.0,10500.0,2012-03-31,2016-02-15,2012-03-31,4681175,1062,34802,70.0,M,23.0,666.67,SEHORE,MADHYA PRADESH,466001.0,No Top-up Service
2,3,Quatrly,Arrear,Active,Direct Debit,1,,68,690000,519728.0,519728.0,38300.0,2017-06-17,2023-02-10,2017-06-17,25328146,1060,127335,69.77,M,39.0,45257.0,BHOPAL,MADHYA PRADESH,462030.0,12-18 Months


In [4]:
bureau = pd.read_excel(train_bureau_path)
bureau.head(3)

Unnamed: 0,ID,SELF-INDICATOR,MATCH-TYPE,ACCT-TYPE,CONTRIBUTOR-TYPE,DATE-REPORTED,OWNERSHIP-IND,ACCOUNT-STATUS,DISBURSED-DT,CLOSE-DT,LAST-PAYMENT-DATE,CREDIT-LIMIT/SANC AMT,DISBURSED-AMT/HIGH CREDIT,INSTALLMENT-AMT,CURRENT-BAL,INSTALLMENT-FREQUENCY,OVERDUE-AMT,WRITE-OFF-AMT,ASSET_CLASS,REPORTED DATE - HIST,DPD - HIST,CUR BAL - HIST,AMT OVERDUE - HIST,AMT PAID - HIST,TENURE
0,1,False,PRIMARY,Overdraft,NAB,2018-04-30,Individual,Delinquent,2015-10-05,,2018-02-27,,37352,,37873,,37873.0,0.0,Standard,2018043020180331,030000,3787312820,"37873,,",",,",
1,1,False,PRIMARY,Auto Loan (Personal),NAB,2019-12-31,Individual,Active,2018-03-19,,2019-12-19,,44000,"1,405/Monthly",20797,F03,,0.0,Standard,"20191231,20191130,20191031,20190930,20190831,2...",0000000000000000000000000000000000000000000000...,"20797,21988,23174,24341,25504,26648,27780,2891...",",,,,,,,,,,,,,,,,,,,,1452,,",",,,,,,,,,,,,,,,,,,,,,,",36.0
2,1,True,PRIMARY,Tractor Loan,NBF,2020-01-31,Individual,Active,2019-08-30,,NaT,,145000,,116087,,0.0,0.0,,"20200131,20191231,20191130,20191031,20190930,2...",000000000000000000,116087116087145000145000145000145000,000000,",,,,,,",


In [5]:
copy_bureau = bureau.copy()

In [6]:
def dpd_hist(x):
    out = []
    for i in range(0,len(x),3):
        try:
            out.append(float(x[i:i + 3]))
        except:
               pass
    if out:
        ret = np.mean(out)
    else:
        ret = 0
    return ret


def clean_bureau(df):
    clean_col = ['CREDIT-LIMIT/SANC AMT','DISBURSED-AMT/HIGH CREDIT','CURRENT-BAL','OVERDUE-AMT']
    for col in clean_col:
        df[col] = df[col].str.replace(",","").astype(float).fillna(0)
    
    df['WRITE-OFF-AMT'] = df['WRITE-OFF-AMT'].fillna(0)
    df['DISBURSED-DT'] = df['DISBURSED-DT'].fillna(datetime.datetime(1970,1,1))
    df['instal_freq'] = df['INSTALLMENT-AMT'].fillna('0').apply(lambda x: x.split('/')[-1] if '/' in x else x)
    df['instal_amt'] =  df['INSTALLMENT-AMT'].fillna('0').apply(lambda x: x.split('/')[0].replace(',','') if '/' in x else x.replace(',',''))
    df['instal_amt'] = df['instal_amt'].astype(float)
    df['n_report'] = df['REPORTED DATE - HIST'].fillna('').apply(lambda x: len(x.split(',')))
    df['dpd_hist'] = df['DPD - HIST'].fillna('0').apply(dpd_hist)
    df['curr_bal_hist'] = df['CUR BAL - HIST'].fillna('0').apply(lambda x: np.mean([float(i) if i != '' else 0 for i in x.split(',')]))
    df['amt_overdue_hist'] =  df['AMT OVERDUE - HIST'].fillna('0').apply(lambda x: np.mean([float(i) if i != '' else 0 for i in x.split(',')]))
    df['amt_paid_hist'] = df['AMT PAID - HIST'].fillna('0').apply(lambda x: np.mean([float(i) if i != '' else 0 for i in x.split(',')]))
    df['TENURE'] = df['TENURE'].fillna(0)
    df['INSTALLMENT-FREQUENCY'] = df['INSTALLMENT-FREQUENCY'].fillna('na')
    df['ASSET_CLASS'] = df['ASSET_CLASS'].fillna('na')
    df['credit_dt'] = (df['DATE-REPORTED']-df['DISBURSED-DT']).dt.days.fillna(0)
    df['closed_dt'] = (df['DATE-REPORTED']-pd.to_datetime(df['CLOSE-DT'],errors='coerce')).dt.days.fillna(0)
    df['pay_dt'] = (df['LAST-PAYMENT-DATE'] - df['DISBURSED-DT']).dt.days.fillna(0)
    df['bal_due_hist'] = df['amt_overdue_hist']/df['curr_bal_hist']
    
    df['bal_credit'] =   df['CURRENT-BAL']/df['DISBURSED-AMT/HIGH CREDIT']
    df['credit_dis'] =   df['CREDIT-LIMIT/SANC AMT']/df['DISBURSED-AMT/HIGH CREDIT']
    df['credit_bal'] =   df['CREDIT-LIMIT/SANC AMT']/df['CURRENT-BAL']
    df['over_bal']  =    df['OVERDUE-AMT']/df['CURRENT-BAL']
    df['over_credit'] = df['OVERDUE-AMT']/df['DISBURSED-AMT/HIGH CREDIT']
    df['write_credit'] = df['WRITE-OFF-AMT']/df['DISBURSED-AMT/HIGH CREDIT']
    df['write_bal']  =   df['WRITE-OFF-AMT']/df['CURRENT-BAL']
    df['instal_credit'] =    df['instal_amt']/df['DISBURSED-AMT/HIGH CREDIT']
    df['instal_bal'] =    df['instal_amt']/df['CURRENT-BAL']
    
    return df

In [7]:
def compute_features(df):
    df['disburse_diff']= (df['AmountFinance'] != df['DisbursalAmount']).astype(int)
    df['per_loan'] = df['DisbursalAmount']/df['AssetCost']
    df['total_int']= (df['Tenure']* df['EMI']) - df['DisbursalAmount']
    df['emi_salary_per'] = df['EMI']/df['MonthlyIncome']
    df['asset_value'] = df['DisbursalAmount']/ df['LTV']
    df['emi_asset'] = df['EMI']/df['asset_value']
    
    df['curr_post_TEN'] = df['post_TENURE'] - df['curr_TENURE']
    df['curr_post_BAL'] = df['post_curr_bal_hist'] - df['post_curr_bal_hist']
    df['curr_post_rep'] = df['post_n_report'] - df['curr_n_report']

    return df

def hot_encode(data,cat_col,enc):
    data = data.copy()

    df= pd.DataFrame(enc.transform(data[cat_col]).toarray(),columns=enc.get_feature_names())
    
    for idx,col in enumerate(cat_col):
        df.columns = df.columns.str.replace(f"x{idx}",col)
    
    for col in df.columns:
        data[col] = df[col]
    
    data.drop(cat_col,axis=1,inplace=True)
    
    return data

In [8]:
def merge_bureau(data,data_bureau,cat_columns,num_col,enc):
    data = data.copy()
    
    pre = pd.DataFrame(enc.transform(data_bureau[cat_columns]).toarray(),columns=enc.get_feature_names())
    for i in range(len(cat_columns)):
            pre.columns = pre.columns.str.replace(f"x{i}",f"pre_{cat_columns[i]}")


    pre['ID'] = data_bureau['ID']
    pre['DISBURSED-DT'] = data_bureau['DISBURSED-DT']
    pre_cat = pre.groupby(['ID','DISBURSED-DT']).mean().groupby(level=0).shift().reset_index().fillna(0)

    pre_num = data_bureau.groupby(['ID','DISBURSED-DT'])[num_col].mean().groupby(level=0).shift().reset_index().fillna(0)
    pre_num = pre_num.add_prefix('pre_')
    pre_num = pre_num.rename({'pre_ID':'ID','pre_DISBURSED-DT':'DISBURSED-DT'},axis=1)

    pre_summary = pd.merge(pre_cat,pre_num,how='outer',on=['ID','DISBURSED-DT'])

    max_dt = pre_summary.groupby(['ID'])['DISBURSED-DT'].max().reset_index()
    post_summary = max_dt.merge(pre_summary,how='left',on=['ID','DISBURSED-DT'])
    post_summary.columns = post_summary.columns.str.replace('pre','post')
    post_summary.drop(['DISBURSED-DT'],axis=1,inplace=True)


    curr_cat = pre.groupby(['ID','DISBURSED-DT']).mean().reset_index().fillna(0)
    curr_cat.columns = curr_cat.columns.str.replace('pre','curr')

    curr_num = data_bureau.groupby(['ID','DISBURSED-DT'])[num_col].mean().reset_index().fillna(0)
    curr_num = curr_num.add_prefix('curr_')
    curr_num = curr_num.rename({'curr_ID':'ID','curr_DISBURSED-DT':'DISBURSED-DT'},axis=1)

    curr_summary = pd.merge(curr_cat,curr_num,how='outer',on=['ID','DISBURSED-DT'])
    
    
    
    overall = pre.groupby(['ID']).mean().reset_index().fillna(0)
    overall.columns = overall.columns.str.replace('pre','overall')

    overall_num = data_bureau.groupby(['ID'])[num_col].mean().reset_index().fillna(0)
    overall_num = overall_num.add_prefix('overall_')
    overall_num = overall_num.rename({'overall_ID':'ID'},axis=1)

    overall_summary = pd.merge(overall,overall_num,how='outer',on=['ID'])
    
    data['DISBURSED-DT'] = data['DisbursalDate']
    data = data.merge(pre_summary,how='left',on=['ID','DISBURSED-DT'])
    data=data.merge(curr_summary,how='left',on=['ID','DISBURSED-DT'])
    data=data.merge(post_summary,how='left',on=['ID'])
    data=data.merge(overall_summary,how='left',on=['ID'])
    
    data['mat_days'] = (data['MaturityDAte']-data['DisbursalDate']).dt.days.fillna(0)
    data['mat_month'] = data['MaturityDAte'].dt.month.fillna(0)
    data['db_month'] = data['DisbursalDate'].dt.month.fillna(0)

    return data

In [9]:
bureau = copy_bureau.copy()

In [10]:
cat_columns = ['SELF-INDICATOR','MATCH-TYPE','OWNERSHIP-IND','ACCOUNT-STATUS','INSTALLMENT-FREQUENCY','ASSET_CLASS']

num_col = ['CREDIT-LIMIT/SANC AMT','DISBURSED-AMT/HIGH CREDIT','CURRENT-BAL','OVERDUE-AMT','WRITE-OFF-AMT',
           'instal_amt','n_report','dpd_hist','curr_bal_hist','amt_overdue_hist','amt_paid_hist','TENURE',
          'credit_dt','closed_dt','pay_dt','bal_due_hist','bal_credit','credit_dis','credit_bal','over_bal',
          'over_credit','write_credit','write_bal','instal_credit','instal_bal']

bureau = clean_bureau(bureau)
bureau[cat_columns] = bureau[cat_columns].fillna('na')

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(bureau[cat_columns])

merged_data = merge_bureau(data,bureau,cat_columns,num_col,enc)

In [11]:
merged_data.shape,data.shape

((128655, 286), (128655, 26))

In [12]:
merged_data.head(1)

Unnamed: 0,ID,Frequency,InstlmentMode,LoanStatus,PaymentMode,BranchID,Area,Tenure,AssetCost,AmountFinance,DisbursalAmount,EMI,DisbursalDate,MaturityDAte,AuthDate,AssetID,ManufacturerID,SupplierID,LTV,SEX,AGE,MonthlyIncome,City,State,ZiPCODE,Top-up Month,DISBURSED-DT,pre_SELF-INDICATOR_False,pre_SELF-INDICATOR_True,pre_MATCH-TYPE_PRIMARY,pre_MATCH-TYPE_SECONDARY,pre_OWNERSHIP-IND_Guarantor,pre_OWNERSHIP-IND_Individual,pre_OWNERSHIP-IND_Joint,pre_OWNERSHIP-IND_Primary,pre_OWNERSHIP-IND_Supl Card Holder,pre_ACCOUNT-STATUS_Active,pre_ACCOUNT-STATUS_Cancelled,pre_ACCOUNT-STATUS_Closed,pre_ACCOUNT-STATUS_Delinquent,pre_ACCOUNT-STATUS_Restructured,pre_ACCOUNT-STATUS_SUIT FILED (WILFUL DEFAULT),pre_ACCOUNT-STATUS_Settled,pre_ACCOUNT-STATUS_Sold/Purchased,pre_ACCOUNT-STATUS_Suit Filed,pre_ACCOUNT-STATUS_WILFUL DEFAULT,pre_ACCOUNT-STATUS_Written Off,pre_INSTALLMENT-FREQUENCY_F01,pre_INSTALLMENT-FREQUENCY_F02,pre_INSTALLMENT-FREQUENCY_F03,pre_INSTALLMENT-FREQUENCY_F04,pre_INSTALLMENT-FREQUENCY_F05,pre_INSTALLMENT-FREQUENCY_F06,pre_INSTALLMENT-FREQUENCY_F07,pre_INSTALLMENT-FREQUENCY_F08,pre_INSTALLMENT-FREQUENCY_F10,pre_INSTALLMENT-FREQUENCY_na,pre_ASSET_CLASS_01,pre_ASSET_CLASS_1,pre_ASSET_CLASS_2,pre_ASSET_CLASS_Doubtful,pre_ASSET_CLASS_Loss,pre_ASSET_CLASS_Special Mention Account,pre_ASSET_CLASS_Standard,pre_ASSET_CLASS_SubStandard,pre_ASSET_CLASS_na,pre_CREDIT-LIMIT/SANC AMT,pre_DISBURSED-AMT/HIGH CREDIT,pre_CURRENT-BAL,pre_OVERDUE-AMT,pre_WRITE-OFF-AMT,pre_instal_amt,pre_n_report,pre_dpd_hist,pre_curr_bal_hist,pre_amt_overdue_hist,pre_amt_paid_hist,pre_TENURE,pre_credit_dt,pre_closed_dt,pre_pay_dt,pre_bal_due_hist,pre_bal_credit,pre_credit_dis,pre_credit_bal,pre_over_bal,pre_over_credit,pre_write_credit,pre_write_bal,pre_instal_credit,pre_instal_bal,curr_SELF-INDICATOR_False,curr_SELF-INDICATOR_True,curr_MATCH-TYPE_PRIMARY,curr_MATCH-TYPE_SECONDARY,curr_OWNERSHIP-IND_Guarantor,curr_OWNERSHIP-IND_Individual,curr_OWNERSHIP-IND_Joint,curr_OWNERSHIP-IND_Primary,curr_OWNERSHIP-IND_Supl Card Holder,...,post_ASSET_CLASS_1,post_ASSET_CLASS_2,post_ASSET_CLASS_Doubtful,post_ASSET_CLASS_Loss,post_ASSET_CLASS_Special Mention Account,post_ASSET_CLASS_Standard,post_ASSET_CLASS_SubStandard,post_ASSET_CLASS_na,post_CREDIT-LIMIT/SANC AMT,post_DISBURSED-AMT/HIGH CREDIT,post_CURRENT-BAL,post_OVERDUE-AMT,post_WRITE-OFF-AMT,post_instal_amt,post_n_report,post_dpd_hist,post_curr_bal_hist,post_amt_overdue_hist,post_amt_paid_hist,post_TENURE,post_credit_dt,post_closed_dt,post_pay_dt,post_bal_due_hist,post_bal_credit,post_credit_dis,post_credit_bal,post_over_bal,post_over_credit,post_write_credit,post_write_bal,post_instal_credit,post_instal_bal,overall_SELF-INDICATOR_False,overall_SELF-INDICATOR_True,overall_MATCH-TYPE_PRIMARY,overall_MATCH-TYPE_SECONDARY,overall_OWNERSHIP-IND_Guarantor,overall_OWNERSHIP-IND_Individual,overall_OWNERSHIP-IND_Joint,overall_OWNERSHIP-IND_Primary,overall_OWNERSHIP-IND_Supl Card Holder,overall_ACCOUNT-STATUS_Active,overall_ACCOUNT-STATUS_Cancelled,overall_ACCOUNT-STATUS_Closed,overall_ACCOUNT-STATUS_Delinquent,overall_ACCOUNT-STATUS_Restructured,overall_ACCOUNT-STATUS_SUIT FILED (WILFUL DEFAULT),overall_ACCOUNT-STATUS_Settled,overall_ACCOUNT-STATUS_Sold/Purchased,overall_ACCOUNT-STATUS_Suit Filed,overall_ACCOUNT-STATUS_WILFUL DEFAULT,overall_ACCOUNT-STATUS_Written Off,overall_INSTALLMENT-FREQUENCY_F01,overall_INSTALLMENT-FREQUENCY_F02,overall_INSTALLMENT-FREQUENCY_F03,overall_INSTALLMENT-FREQUENCY_F04,overall_INSTALLMENT-FREQUENCY_F05,overall_INSTALLMENT-FREQUENCY_F06,overall_INSTALLMENT-FREQUENCY_F07,overall_INSTALLMENT-FREQUENCY_F08,overall_INSTALLMENT-FREQUENCY_F10,overall_INSTALLMENT-FREQUENCY_na,overall_ASSET_CLASS_01,overall_ASSET_CLASS_1,overall_ASSET_CLASS_2,overall_ASSET_CLASS_Doubtful,overall_ASSET_CLASS_Loss,overall_ASSET_CLASS_Special Mention Account,overall_ASSET_CLASS_Standard,overall_ASSET_CLASS_SubStandard,overall_ASSET_CLASS_na,overall_CREDIT-LIMIT/SANC AMT,overall_DISBURSED-AMT/HIGH CREDIT,overall_CURRENT-BAL,overall_OVERDUE-AMT,overall_WRITE-OFF-AMT,overall_instal_amt,overall_n_report,overall_dpd_hist,overall_curr_bal_hist,overall_amt_overdue_hist,overall_amt_paid_hist,overall_TENURE,overall_credit_dt,overall_closed_dt,overall_pay_dt,overall_bal_due_hist,overall_bal_credit,overall_credit_dis,overall_credit_bal,overall_over_bal,overall_over_credit,overall_write_credit,overall_write_bal,overall_instal_credit,overall_instal_bal,mat_days,mat_month,db_month
0,1,Monthly,Arrear,Closed,PDC_E,1,,48,450000,275000.0,275000.0,24000.0,2012-02-10,2016-01-15,2012-02-10,4022465,1568,21946,61.11,M,49.0,35833.33,RAISEN,MADHYA PRADESH,464993.0,> 48 Months,2012-02-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,500000.0,443769.0,0.0,0.0,7934.0,15.0,0.0,441747.066667,0.0,0.0,84.0,411.0,0.0,395.0,0.0,0.887538,0.0,0.0,0.0,0.0,0.0,0.0,0.015868,0.017879,0.444444,0.555556,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.333333,0.0,0.555556,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.777778,0.0,0.0,0.0,0.0,0.0,0.0,0.444444,0.0,0.555556,5555.555556,244594.666667,68725.111111,4208.111111,0.0,1037.666667,22.222222,7.738095,123417.273925,1700.477456,0.0,13.333333,935.888889,13.777778,212.333333,0.085538,0.396843,inf,inf,0.25,0.126744,0.0,0.0,0.005975,0.021359,1435.0,1.0,2


In [13]:
merged_data['BranchID'].nunique()

189

In [14]:
drop_col = ['ID','DisbursalDate','MaturityDAte','AuthDate','AssetID','DISBURSED-DT','ZiPCODE','Top-up Month']
features  = list(set(merged_data.columns) - set(drop_col) )

In [15]:
x_train = merged_data[features].copy()
x_train.head(3)

Unnamed: 0,overall_ASSET_CLASS_2,post_over_bal,SEX,post_instal_bal,overall_CREDIT-LIMIT/SANC AMT,mat_days,post_INSTALLMENT-FREQUENCY_F02,overall_bal_due_hist,PaymentMode,post_OWNERSHIP-IND_Primary,curr_OWNERSHIP-IND_Individual,pre_ACCOUNT-STATUS_SUIT FILED (WILFUL DEFAULT),post_write_credit,overall_OWNERSHIP-IND_Primary,overall_INSTALLMENT-FREQUENCY_F04,curr_ACCOUNT-STATUS_Closed,pre_bal_due_hist,curr_dpd_hist,curr_WRITE-OFF-AMT,post_DISBURSED-AMT/HIGH CREDIT,curr_ASSET_CLASS_Special Mention Account,pre_ACCOUNT-STATUS_Closed,pre_curr_bal_hist,curr_DISBURSED-AMT/HIGH CREDIT,curr_ACCOUNT-STATUS_Suit Filed,post_credit_bal,pre_over_bal,overall_INSTALLMENT-FREQUENCY_F06,overall_ACCOUNT-STATUS_Settled,overall_ACCOUNT-STATUS_Cancelled,curr_bal_due_hist,overall_OWNERSHIP-IND_Supl Card Holder,curr_INSTALLMENT-FREQUENCY_F04,curr_n_report,overall_INSTALLMENT-FREQUENCY_na,overall_WRITE-OFF-AMT,LTV,overall_ACCOUNT-STATUS_Written Off,overall_ASSET_CLASS_Doubtful,overall_write_bal,overall_INSTALLMENT-FREQUENCY_F07,curr_CURRENT-BAL,curr_OVERDUE-AMT,post_pay_dt,curr_ASSET_CLASS_1,pre_ASSET_CLASS_1,curr_instal_bal,post_INSTALLMENT-FREQUENCY_F01,post_TENURE,pre_ASSET_CLASS_na,overall_ASSET_CLASS_Special Mention Account,overall_MATCH-TYPE_SECONDARY,overall_INSTALLMENT-FREQUENCY_F05,curr_instal_credit,pre_ASSET_CLASS_01,overall_MATCH-TYPE_PRIMARY,overall_pay_dt,pre_INSTALLMENT-FREQUENCY_F08,curr_TENURE,post_bal_credit,curr_ACCOUNT-STATUS_Settled,overall_ACCOUNT-STATUS_Suit Filed,curr_ASSET_CLASS_Doubtful,pre_credit_dis,curr_INSTALLMENT-FREQUENCY_F03,curr_bal_credit,curr_ACCOUNT-STATUS_Delinquent,pre_amt_paid_hist,overall_OWNERSHIP-IND_Individual,overall_SELF-INDICATOR_True,post_OWNERSHIP-IND_Joint,post_ACCOUNT-STATUS_Written Off,post_SELF-INDICATOR_False,curr_ACCOUNT-STATUS_Active,post_OVERDUE-AMT,overall_OWNERSHIP-IND_Guarantor,post_INSTALLMENT-FREQUENCY_F07,post_INSTALLMENT-FREQUENCY_F10,post_CURRENT-BAL,pre_DISBURSED-AMT/HIGH CREDIT,curr_ASSET_CLASS_2,pre_dpd_hist,post_ASSET_CLASS_01,curr_amt_paid_hist,curr_ACCOUNT-STATUS_Written Off,post_ACCOUNT-STATUS_Closed,curr_SELF-INDICATOR_False,post_ASSET_CLASS_2,overall_amt_paid_hist,post_ACCOUNT-STATUS_Cancelled,pre_ASSET_CLASS_Standard,overall_ACCOUNT-STATUS_Closed,post_ASSET_CLASS_SubStandard,overall_INSTALLMENT-FREQUENCY_F01,post_write_bal,pre_ASSET_CLASS_2,overall_CURRENT-BAL,overall_write_credit,post_ASSET_CLASS_Standard,post_dpd_hist,...,post_ACCOUNT-STATUS_SUIT FILED (WILFUL DEFAULT),pre_ACCOUNT-STATUS_WILFUL DEFAULT,overall_instal_amt,pre_write_credit,post_ACCOUNT-STATUS_WILFUL DEFAULT,overall_instal_bal,overall_credit_bal,curr_write_bal,pre_bal_credit,overall_ACCOUNT-STATUS_WILFUL DEFAULT,overall_over_bal,post_ASSET_CLASS_Loss,pre_CREDIT-LIMIT/SANC AMT,pre_ACCOUNT-STATUS_Suit Filed,curr_over_bal,MonthlyIncome,pre_ACCOUNT-STATUS_Written Off,curr_over_credit,post_OWNERSHIP-IND_Guarantor,post_ACCOUNT-STATUS_Settled,overall_DISBURSED-AMT/HIGH CREDIT,overall_bal_credit,curr_CREDIT-LIMIT/SANC AMT,overall_ASSET_CLASS_01,post_WRITE-OFF-AMT,overall_OVERDUE-AMT,overall_INSTALLMENT-FREQUENCY_F02,pre_ACCOUNT-STATUS_Restructured,post_credit_dt,post_ASSET_CLASS_na,overall_OWNERSHIP-IND_Joint,curr_MATCH-TYPE_SECONDARY,curr_ASSET_CLASS_01,pre_SELF-INDICATOR_False,overall_ACCOUNT-STATUS_SUIT FILED (WILFUL DEFAULT),curr_ASSET_CLASS_Loss,db_month,post_curr_bal_hist,curr_ASSET_CLASS_SubStandard,pre_INSTALLMENT-FREQUENCY_F02,post_ASSET_CLASS_Doubtful,pre_instal_credit,overall_INSTALLMENT-FREQUENCY_F10,overall_n_report,EMI,pre_amt_overdue_hist,pre_OWNERSHIP-IND_Joint,Frequency,curr_ACCOUNT-STATUS_Restructured,pre_write_bal,curr_credit_dt,overall_dpd_hist,pre_OWNERSHIP-IND_Primary,curr_INSTALLMENT-FREQUENCY_F07,pre_ACCOUNT-STATUS_Cancelled,curr_INSTALLMENT-FREQUENCY_F02,pre_OWNERSHIP-IND_Individual,post_INSTALLMENT-FREQUENCY_na,post_INSTALLMENT-FREQUENCY_F04,pre_closed_dt,overall_INSTALLMENT-FREQUENCY_F08,pre_ASSET_CLASS_SubStandard,post_ACCOUNT-STATUS_Delinquent,pre_ACCOUNT-STATUS_Sold/Purchased,curr_pay_dt,pre_INSTALLMENT-FREQUENCY_F07,post_instal_amt,Area,State,overall_ACCOUNT-STATUS_Sold/Purchased,AmountFinance,pre_INSTALLMENT-FREQUENCY_na,curr_ACCOUNT-STATUS_Cancelled,post_ASSET_CLASS_1,overall_ASSET_CLASS_1,Tenure,post_credit_dis,overall_ASSET_CLASS_Loss,pre_ASSET_CLASS_Doubtful,BranchID,pre_TENURE,post_OWNERSHIP-IND_Supl Card Holder,pre_INSTALLMENT-FREQUENCY_F03,pre_over_credit,pre_ASSET_CLASS_Special Mention Account,curr_INSTALLMENT-FREQUENCY_na,overall_curr_bal_hist,overall_TENURE,pre_SELF-INDICATOR_True,overall_ASSET_CLASS_na,City,curr_write_credit,curr_credit_bal,curr_INSTALLMENT-FREQUENCY_F01,post_INSTALLMENT-FREQUENCY_F08,post_amt_paid_hist,curr_INSTALLMENT-FREQUENCY_F08,pre_credit_bal,post_OWNERSHIP-IND_Individual,post_over_credit
0,0.0,0.0,M,0.017879,5555.555556,1435.0,0.0,0.085538,PDC_E,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,500000.0,0.0,0.0,0.0,275000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.0,0.777778,0.0,61.11,0.0,0.0,0.0,0.0,0.0,0.0,395.0,0.0,0.0,0.0,0.0,84.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,212.333333,0.0,0.0,0.887538,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.555556,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,443769.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.555556,0.0,0.0,0.0,0.0,68725.111111,0.0,1.0,0.0,...,0.0,0.0,1037.666667,0.0,0.0,0.021359,inf,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,35833.33,0.0,0.0,0.0,0.0,244594.7,0.396843,0.0,0.0,0.0,4208.111111,0.0,0.0,411.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,441747.1,0.0,0.0,0.0,0.0,0.0,22.222222,24000.0,0.0,0.0,Monthly,0.0,0.0,1480.0,7.738095,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7934.0,,MADHYA PRADESH,0.0,275000.0,0.0,0.0,0.0,0.0,48,0.0,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,1.0,123417.273925,13.333333,0.0,0.555556,RAISEN,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,M,0.008422,0.0,1416.0,0.0,0.000998,PDC,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,15.722222,0.0,3800000.0,0.0,0.0,0.0,350000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01197,0.0,0.0,37.0,0.615385,0.0,70.0,0.0,0.0,0.0,0.0,0.0,0.0,122.0,0.0,0.0,0.0,0.0,300.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,638.153846,0.0,0.0,0.996473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.692308,0.076923,0.0,0.0,1.0,0.0,0.0,0.307692,0.0,0.0,3786598.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,123143.986097,0.0,0.0,0.461538,0.0,0.0,0.0,0.0,796112.076923,0.0,1.0,0.0,...,0.0,0.0,13404.846154,0.0,0.0,0.106392,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,666.67,0.0,0.0,1.0,0.0,1393622.0,0.338578,0.0,0.0,0.0,0.0,0.0,0.0,123.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3,3160948.0,0.0,0.0,0.0,0.0,0.0,18.769231,10500.0,0.0,0.0,Monthly,0.0,0.0,1430.0,1.209402,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,31889.0,BHOPAL,MADHYA PRADESH,0.0,350000.0,0.0,0.0,0.0,0.0,47,0.0,0.0,0.0,333,0.0,0.0,0.0,0.0,0.0,1.0,843753.761673,43.461538,0.0,0.538462,SEHORE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,M,0.0,3290.322581,2064.0,0.0,0.003648,Direct Debit,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,9.066667,0.0,8703.0,0.0,1.0,0.0,519728.0,0.0,50.0,0.0,0.0,0.0,0.0,0.006267,0.0,0.0,33.0,0.806452,0.0,69.77,0.0,0.0,0.0,0.0,37637.0,0.0,170.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,3009.16129,0.0,0.0,0.022981,0.0,0.0,0.0,0.0,0.0,0.072417,0.0,0.0,0.83871,0.064516,0.0,0.0,1.0,1.0,0.0,0.032258,0.0,0.0,200.0,35700.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3971.695485,0.0,0.0,0.612903,0.0,0.0,0.0,0.0,66618.225806,0.0,0.0,0.0,...,0.0,0.0,3078.580645,0.0,0.0,inf,5.112978,0.0,0.0,0.0,inf,0.0,0.0,0.0,0.0,45257.0,0.0,0.0,0.0,0.0,119624.8,0.935873,0.0,0.0,0.0,3338.387097,0.0,0.0,203.0,1.0,0.129032,0.0,0.0,1.0,0.0,0.0,6,3412.333,0.0,0.0,0.0,0.0,0.0,12.806452,38300.0,0.0,0.0,Quatrly,0.0,0.0,958.0,4.58327,0.0,0.0,0.0,0.0,1.0,1.0,0.0,714.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,MADHYA PRADESH,0.0,519728.0,1.0,0.0,0.0,0.0,68,1.149029,0.0,0.0,1,0.0,0.0,0.0,0.0,0.0,1.0,134674.009949,14.645161,0.0,0.741935,BHOPAL,0.0,0.0,0.0,0.0,4769.166667,0.0,0.0,1.0,0.0


In [16]:
y_train = merged_data['Top-up Month']

categorical_var = ['Frequency', 'InstlmentMode', 'LoanStatus', 'PaymentMode','SEX', 'State','Area','BranchID','City']

x_train[categorical_var] = x_train[categorical_var].fillna('na')
x_train = compute_features(x_train)
x_train_copy = x_train.copy()


cat_enc = OneHotEncoder(handle_unknown='ignore')
cat_enc.fit(x_train[categorical_var])

x_train = hot_encode(x_train,categorical_var,cat_enc)

x_train.fillna(0,inplace=True)
x_train.sample(3)

Unnamed: 0,overall_ASSET_CLASS_2,post_over_bal,post_instal_bal,overall_CREDIT-LIMIT/SANC AMT,mat_days,post_INSTALLMENT-FREQUENCY_F02,overall_bal_due_hist,post_OWNERSHIP-IND_Primary,curr_OWNERSHIP-IND_Individual,pre_ACCOUNT-STATUS_SUIT FILED (WILFUL DEFAULT),post_write_credit,overall_OWNERSHIP-IND_Primary,overall_INSTALLMENT-FREQUENCY_F04,curr_ACCOUNT-STATUS_Closed,pre_bal_due_hist,curr_dpd_hist,curr_WRITE-OFF-AMT,post_DISBURSED-AMT/HIGH CREDIT,curr_ASSET_CLASS_Special Mention Account,pre_ACCOUNT-STATUS_Closed,pre_curr_bal_hist,curr_DISBURSED-AMT/HIGH CREDIT,curr_ACCOUNT-STATUS_Suit Filed,post_credit_bal,pre_over_bal,overall_INSTALLMENT-FREQUENCY_F06,overall_ACCOUNT-STATUS_Settled,overall_ACCOUNT-STATUS_Cancelled,curr_bal_due_hist,overall_OWNERSHIP-IND_Supl Card Holder,curr_INSTALLMENT-FREQUENCY_F04,curr_n_report,overall_INSTALLMENT-FREQUENCY_na,overall_WRITE-OFF-AMT,LTV,overall_ACCOUNT-STATUS_Written Off,overall_ASSET_CLASS_Doubtful,overall_write_bal,overall_INSTALLMENT-FREQUENCY_F07,curr_CURRENT-BAL,curr_OVERDUE-AMT,post_pay_dt,curr_ASSET_CLASS_1,pre_ASSET_CLASS_1,curr_instal_bal,post_INSTALLMENT-FREQUENCY_F01,post_TENURE,pre_ASSET_CLASS_na,overall_ASSET_CLASS_Special Mention Account,overall_MATCH-TYPE_SECONDARY,overall_INSTALLMENT-FREQUENCY_F05,curr_instal_credit,pre_ASSET_CLASS_01,overall_MATCH-TYPE_PRIMARY,overall_pay_dt,pre_INSTALLMENT-FREQUENCY_F08,curr_TENURE,post_bal_credit,curr_ACCOUNT-STATUS_Settled,overall_ACCOUNT-STATUS_Suit Filed,curr_ASSET_CLASS_Doubtful,pre_credit_dis,curr_INSTALLMENT-FREQUENCY_F03,curr_bal_credit,curr_ACCOUNT-STATUS_Delinquent,pre_amt_paid_hist,overall_OWNERSHIP-IND_Individual,overall_SELF-INDICATOR_True,post_OWNERSHIP-IND_Joint,post_ACCOUNT-STATUS_Written Off,post_SELF-INDICATOR_False,curr_ACCOUNT-STATUS_Active,post_OVERDUE-AMT,overall_OWNERSHIP-IND_Guarantor,post_INSTALLMENT-FREQUENCY_F07,post_INSTALLMENT-FREQUENCY_F10,post_CURRENT-BAL,pre_DISBURSED-AMT/HIGH CREDIT,curr_ASSET_CLASS_2,pre_dpd_hist,post_ASSET_CLASS_01,curr_amt_paid_hist,curr_ACCOUNT-STATUS_Written Off,post_ACCOUNT-STATUS_Closed,curr_SELF-INDICATOR_False,post_ASSET_CLASS_2,overall_amt_paid_hist,post_ACCOUNT-STATUS_Cancelled,pre_ASSET_CLASS_Standard,overall_ACCOUNT-STATUS_Closed,post_ASSET_CLASS_SubStandard,overall_INSTALLMENT-FREQUENCY_F01,post_write_bal,pre_ASSET_CLASS_2,overall_CURRENT-BAL,overall_write_credit,post_ASSET_CLASS_Standard,post_dpd_hist,curr_OWNERSHIP-IND_Joint,overall_ASSET_CLASS_SubStandard,...,City_RATLAM,City_RAVER,City_RAYAGADA,City_REPALLE,City_REWA,City_REWARI,City_ROHTAK,City_ROHTAS,City_ROPAR,City_RUDRAPRAYAG,City_RUPNAGAR,City_SABARKANTHA,City_SAGAR,City_SAGAR CANTT,City_SAGARDHIGHI,City_SAHARANPUR,City_SAHARSA,City_SAHASPUR,City_SAMASTIPUR,City_SAMBALPUR,City_SANGLI,City_SANGRUR,City_SANT KABIR NAGAR,City_SANT RAVIDAS NAGAR,City_SARAN,City_SATARA,City_SATNA,City_SAVANUR,City_SAWAI MADHOPUR,City_SECUNDARABAD,City_SEHORE,City_SEONI,City_SERAIKELA-KHARSAWAN,City_SHAHADOL,City_SHAHDOL,City_SHAHJAHANPUR,City_SHAJAPUR,City_SHEIKHPURA,City_SHEOPUR,City_SHIMOGA,City_SHIVPURI,City_SHRAWASTI,City_SIDDHARTHNAGAR,City_SIDHI,City_SIKAR,City_SIRMAUR,City_SIRSA,City_SITAMARHI,City_SITAPUR,City_SIVAGANGA,City_SIWAN,City_SOLAN,City_SOLAPUR,City_SONAPUR,City_SONBHADRA,City_SONEPAT,City_SONEPUR,City_SONIPAT,City_SOTAKANAL,City_SOUTH DINAJPUR,City_SRIKAKULAM,City_SULTANPUR,City_SUPAUL,City_SURAT,City_SURENDRA NAGAR,City_SURGUJA,City_SURI,City_TARANTARN,City_THANE,City_THANESAR,City_TIKAMGARH,City_TIRUPATI,City_TONK,City_TUMKUR,City_UDHAM SINGH NAGAR,City_UJJAIN,City_UMARIA,City_UNNAO,City_UTTAR DINAJPUR,City_UTTARA KANNADA,City_VADODARA,City_VAISHALI,City_VALSAD,City_VARANASI,City_VIDISHA,City_VIJAYWADA,City_VISAKHAPATNAM,City_VISHAKAPATNAM,City_VIZIANAGARAM,City_WARANGAL,City_WARDHA,City_WASHIM,City_WEST CHAMPARAN,City_WEST GODAVARI,City_WEST MIDNAPORE,City_WEST SINGHBHUM,City_YADGIR,City_YAMUNA NAGAR,City_YAVATMAL,City_na
119866,0.0,0.0,0.0,0.0,761.0,0.0,0.000373,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.416667,0.0,421000.0,0.0,0.0,0.0,421000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000745,0.0,0.0,37.0,1.0,0.0,73.22,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,66925.5,0.0,0.0,1.416667,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
123465,0.0,0.0,0.0,0.0,1369.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,265000.0,0.0,0.0,180005.459459,402770.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,0.428571,0.0,61.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.000306,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10595.108108,1.0,0.142857,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,265081.0,265000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2084.186873,0.0,1.0,0.857143,0.0,0.0,0.0,0.0,37868.714286,0.0,1.0,0.0,0.0,0.142857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
93150,0.0,0.0,0.0,0.0,1629.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.0,1.0,0.0,74.47,0.0,0.0,0.0,0.0,161993.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.049825,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,161993.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
log_col = ['AssetCost', 'AmountFinance',
           'DisbursalAmount','MonthlyIncome','EMI','pre_CREDIT-LIMIT/SANC AMT','pre_DISBURSED-AMT/HIGH CREDIT',
           'pre_CURRENT-BAL','pre_OVERDUE-AMT','pre_WRITE-OFF-AMT','pre_curr_bal_hist','pre_amt_overdue_hist','pre_amt_paid_hist',
           'curr_CURRENT-BAL','curr_OVERDUE-AMT','curr_WRITE-OFF-AMT','curr_curr_bal_hist','curr_amt_overdue_hist','curr_amt_paid_hist',
           'post_CURRENT-BAL','post_OVERDUE-AMT','post_WRITE-OFF-AMT','post_curr_bal_hist','post_amt_overdue_hist','post_amt_paid_hist',
           'overall_CURRENT-BAL','overall_OVERDUE-AMT','overall_WRITE-OFF-AMT','overall_curr_bal_hist','overall_amt_overdue_hist','overall_amt_paid_hist',
           'disburse_diff','total_int','emi_salary_per','asset_value','emi_asset']

x_train[log_col] = np.log(x_train[log_col]+1)
x_train[log_col] = x_train[log_col].replace([-np.inf,np.inf,np.nan],[0,0,0])

  if __name__ == '__main__':
  if __name__ == '__main__':


In [18]:
# k_train = class_weight.compute_class_weight("balanced", np.unique(y_train), y_train)
# wt = dict(zip(np.unique(y_train), k_train))
# w_array = y_train.map(wt)
# w_array = w_array.values

In [26]:
xgb = HistGradientBoostingClassifier(max_iter=10,random_state=45,validation_fraction=0.2,scoring='f1_macro')
xgb.fit(x_train, y_train)

HistGradientBoostingClassifier(max_iter=10, random_state=45, scoring='f1_macro',
                               validation_fraction=0.2)

In [27]:
train_predict = xgb.predict(x_train)
train_predict = np.squeeze(train_predict)
print(classification_report(y_train, train_predict))

                   precision    recall  f1-score   support

      > 48 Months       0.74      0.15      0.25      8366
     12-18 Months       0.55      0.22      0.31      1034
     18-24 Months       0.70      0.22      0.33      2368
     24-30 Months       0.82      0.16      0.27      3492
     30-36 Months       0.78      0.18      0.29      3062
     36-48 Months       0.78      0.08      0.15      3656
No Top-up Service       0.86      0.99      0.92    106677

         accuracy                           0.85    128655
        macro avg       0.75      0.29      0.36    128655
     weighted avg       0.84      0.85      0.81    128655



In [28]:
test_data_path = Path(Path.cwd(),'Test','test_Data.xlsx')
test_bureau_path = Path(Path.cwd(),'Test','test_bureau.xlsx')

test_data = pd.read_excel(test_data_path)
test_bureau = pd.read_excel(test_bureau_path)

test_bureau_copy = test_bureau.copy()

In [22]:
test_bureau = clean_bureau(test_bureau)

test_bureau[cat_columns] = test_bureau[cat_columns].fillna('na')


test_merged_data = merge_bureau(test_data,test_bureau,cat_columns,num_col,enc)

x_test = test_merged_data[features].copy()
x_test[categorical_var] = x_test[categorical_var].fillna('na')

x_test = compute_features(x_test)
x_test = hot_encode(x_test,categorical_var,cat_enc)

x_test.fillna(0,inplace=True)

x_test[log_col] = np.log(x_test[log_col]+1)
x_test[log_col] = x_test[log_col].replace([-np.inf,np.inf,np.nan],[0,0,0])

  app.launch_new_instance()
  app.launch_new_instance()


In [23]:
x_test.shape,test_data.shape

((14745, 1093), (14745, 25))

In [24]:
test_predict = xgb.predict(x_test)
test_predict = np.squeeze(test_predict)
test_data['Top-up Month'] = test_predict
test_data[['ID','Top-up Month']].to_csv('submit_5.csv',index=False)

In [25]:
xgb.validation_score_

array([0.12951418, 0.21707285, 0.23032   , 0.25133456, 0.26687683,
       0.27999189, 0.29070583, 0.30258111, 0.31079294, 0.31765174,
       0.32224022, 0.32731643, 0.32944174, 0.333377  , 0.33482488,
       0.34057666, 0.34226479, 0.34454481, 0.3456263 , 0.34733465,
       0.34990548, 0.35467874, 0.35791615, 0.35765678, 0.35871955,
       0.36047232, 0.36007776, 0.36157483, 0.36164926, 0.3632492 ,
       0.36514727, 0.36652636, 0.36707025, 0.3668752 , 0.36817284,
       0.36847742, 0.36850019, 0.36954691, 0.36929015, 0.36950396,
       0.3697151 , 0.37148404, 0.3729017 , 0.37383034, 0.37527434,
       0.37593635, 0.37683498, 0.37656349, 0.37744777, 0.37712599,
       0.37692011, 0.37739583, 0.37744523, 0.3763338 , 0.37747355,
       0.37799112, 0.37773218, 0.37716315, 0.37740502, 0.37815698,
       0.37815902, 0.37831815, 0.37922966, 0.37961123, 0.37987855,
       0.38023299, 0.38166318, 0.38225901, 0.38248677, 0.38260954,
       0.38203096, 0.38131513, 0.38265814, 0.38342303, 0.38385