# Problem statement:
We have to predict whether to give a loan or not to a customer based on the dataset.

# Terminologies: With respect to bank
- Asset: Loan Product
  - Housing Loan
  - Personal Loan
  - Vehicle Loan
  - Group Loan
  - Educational Loan
  - Credit Card
    
- Liability:
  - Current Account
  - Savings Account
  - Fixed Account
  - Recurring Deposit

## NPA (Non Performing Asset)
Loan that is defaulted

## Disbursed Amount
Loan amount given to the customer.

## OSP (Out Standing Principle)
Balance amount of loan that is left and has to be paid to the bank by the customer.

## Amortization
Amortization is the process of paying off a loan over time through regular, fixed-amount payments that include both principal and interest. A bank's amortization schedule details how each payment is split, with early payments favoring interest and later payments increasing principal repayment, gradually reducing the loan's outstanding balance until it's fully paid. This system allows both borrowers and lenders to manage and track debt repayment over a fixed period.

## DPD (Days past due)
If I paid the emi 2 days after the due date, then DPD would be 2. Means I have defaulted.
Ideally DPD should be zero.

## PAR (Portfolio At Risk)
OSP when DPD>0

## NPA
Loan account when DPD > 90 becomes NPA account

# Credit Risk Types in Banking

- DPD (zero) : NDA (Non Delinquint Account)
- DPD (0 to 30) : SMA1 (Standard Monitoring Account)
- DPD (31 to 60) : SMA2 (Standard Monitoring Account)
- DPD (61 to 90) : SMA3 (Standard Monitoring Account)
- DPD (90 to 180) : NPA
- DPD (>180) : Written-off (Loan which is not present)

Improve NPA -> Loan Portfolio quality of the bank will be better -> Market Sentiment will be good -> Stock price will improve

## GNPA (3-5%): 
GNPA stands for gross non-performing assets. GNPA is an absolute amount. It tells you the total value of gross non-performing assets for the bank in a particular quarter or financial year, as the case may be.

## NNPA (0.01 - 0.06%): 
NNPA stands for net non-performing assets. NNPA subtracts the provisions made by the bank from the gross NPA. Therefore net NPA gives you the exact value of non-performing assets after the bank has made specific provisions.

- Always look for GNPA for bank quality assess

## Target
- P1 is too good
- P2 is good
- P3 is bad
- P4 is too bad

## Does a Credit Card Affect CIBIL Score? How Does It Do So?

Credit cards affect your credit score in the following ways:


- Card repayment history:

While using a credit card, you must be mindful of how and when to repay the borrowed amount. For instance, your credit score will be positively impacted if you pay the entire amount due on time. However, your credit score will deteriorate if you consistently pay only the minimum amount due or tend to miss payments. Defaulting your credit card payments will negatively impact your credit score more than late payments. Remember that your repayment history significanly towards your credit score, so repay what you borrow on time.


- Credit Utilisation Ratio:

Another factor considered while calculating your credit score is the credit utilisation ratio. But what is it? Essentially, your credit utilisation ratio is calculated by taking into account your total outstanding debt and dividing it by the total credit available to you. The resultant value is presented as a percentage. How does credit utilisation impact credit score? Usually, it is advisable that you keep your credit utilisation ratio under 30%. Failing to do so has a negative impact on your credit score.


- Length of Credit History:

If you have not been using a credit card and are considering closing it, you might affect your credit score. Your credit score is impacted by the length of your credit history. Since an old credit card is instrumental in building your credit history, it can help your credit score. It can help a lender gauge how your creditworthiness has evolved over the course of holding the card.


- The Number of Credit Cards:

On the one hand, having multiple cards is advantageous. It helps extend the credit available to you, thereby reducing your credit utilisation ratio. On the other hand, having too many credit cards have an adverse effect on your credit score. As a thumb rule, you should try to avoid having more than three active credit cards at any given time. Too many cards can cause difficulty in repayments, causing your credit score to drop. Plus, it may also indicate that you require too much credit to get by

In [1]:
import numpy as np
import pandas as pd

In [2]:
a1 = pd.read_csv('Datasets/case_study1.csv')
a2 = pd.read_csv('Datasets/case_study2.csv')

In [3]:
a1.head()

Unnamed: 0,PROSPECTID,Total_TL,Tot_Closed_TL,Tot_Active_TL,Total_TL_opened_L6M,Tot_TL_closed_L6M,pct_tl_open_L6M,pct_tl_closed_L6M,pct_active_tl,pct_closed_tl,...,CC_TL,Consumer_TL,Gold_TL,Home_TL,PL_TL,Secured_TL,Unsecured_TL,Other_TL,Age_Oldest_TL,Age_Newest_TL
0,1,5,4,1,0,0,0.0,0.0,0.2,0.8,...,0,0,1,0,4,1,4,0,72,18
1,2,1,0,1,0,0,0.0,0.0,1.0,0.0,...,0,1,0,0,0,0,1,0,7,7
2,3,8,0,8,1,0,0.125,0.0,1.0,0.0,...,0,6,1,0,0,2,6,0,47,2
3,4,1,0,1,1,0,1.0,0.0,1.0,0.0,...,0,0,0,0,0,0,1,1,5,5
4,5,3,2,1,0,0,0.0,0.0,0.333,0.667,...,0,0,0,0,0,3,0,2,131,32


In [4]:
a2.head()

Unnamed: 0,PROSPECTID,time_since_recent_payment,time_since_first_deliquency,time_since_recent_deliquency,num_times_delinquent,max_delinquency_level,max_recent_level_of_deliq,num_deliq_6mts,num_deliq_12mts,num_deliq_6_12mts,...,pct_CC_enq_L6m_of_L12m,pct_PL_enq_L6m_of_ever,pct_CC_enq_L6m_of_ever,max_unsec_exposure_inPct,HL_Flag,GL_Flag,last_prod_enq2,first_prod_enq2,Credit_Score,Approved_Flag
0,1,549,35,15,11,29,29,0,0,0,...,0.0,0.0,0.0,13.333,1,0,PL,PL,696,P2
1,2,47,-99999,-99999,0,-99999,0,0,0,0,...,0.0,0.0,0.0,0.86,0,0,ConsumerLoan,ConsumerLoan,685,P2
2,3,302,11,3,9,25,25,1,9,8,...,0.0,0.0,0.0,5741.667,1,0,ConsumerLoan,others,693,P2
3,4,-99999,-99999,-99999,0,-99999,0,0,0,0,...,0.0,0.0,0.0,9.9,0,0,others,others,673,P2
4,5,583,-99999,-99999,0,-99999,0,0,0,0,...,0.0,0.0,0.0,-99999.0,0,0,AL,AL,753,P1


In [5]:
a1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51336 entries, 0 to 51335
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   PROSPECTID            51336 non-null  int64  
 1   Total_TL              51336 non-null  int64  
 2   Tot_Closed_TL         51336 non-null  int64  
 3   Tot_Active_TL         51336 non-null  int64  
 4   Total_TL_opened_L6M   51336 non-null  int64  
 5   Tot_TL_closed_L6M     51336 non-null  int64  
 6   pct_tl_open_L6M       51336 non-null  float64
 7   pct_tl_closed_L6M     51336 non-null  float64
 8   pct_active_tl         51336 non-null  float64
 9   pct_closed_tl         51336 non-null  float64
 10  Total_TL_opened_L12M  51336 non-null  int64  
 11  Tot_TL_closed_L12M    51336 non-null  int64  
 12  pct_tl_open_L12M      51336 non-null  float64
 13  pct_tl_closed_L12M    51336 non-null  float64
 14  Tot_Missed_Pmnt       51336 non-null  int64  
 15  Auto_TL            

In [6]:
a2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51336 entries, 0 to 51335
Data columns (total 62 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   PROSPECTID                    51336 non-null  int64  
 1   time_since_recent_payment     51336 non-null  int64  
 2   time_since_first_deliquency   51336 non-null  int64  
 3   time_since_recent_deliquency  51336 non-null  int64  
 4   num_times_delinquent          51336 non-null  int64  
 5   max_delinquency_level         51336 non-null  int64  
 6   max_recent_level_of_deliq     51336 non-null  int64  
 7   num_deliq_6mts                51336 non-null  int64  
 8   num_deliq_12mts               51336 non-null  int64  
 9   num_deliq_6_12mts             51336 non-null  int64  
 10  max_deliq_6mts                51336 non-null  int64  
 11  max_deliq_12mts               51336 non-null  int64  
 12  num_times_30p_dpd             51336 non-null  int64  
 13  n

In [7]:
a1.describe()

Unnamed: 0,PROSPECTID,Total_TL,Tot_Closed_TL,Tot_Active_TL,Total_TL_opened_L6M,Tot_TL_closed_L6M,pct_tl_open_L6M,pct_tl_closed_L6M,pct_active_tl,pct_closed_tl,...,CC_TL,Consumer_TL,Gold_TL,Home_TL,PL_TL,Secured_TL,Unsecured_TL,Other_TL,Age_Oldest_TL,Age_Newest_TL
count,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,...,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0
mean,25668.5,4.858598,2.770415,2.088184,0.736754,0.428919,0.184574,0.089095,0.577542,0.422458,...,0.124981,1.136084,1.561847,0.070146,0.282511,2.844904,2.013694,1.089762,-32.575639,-62.149525
std,14819.571046,7.177116,5.94168,2.290774,1.296717,0.989972,0.297414,0.205635,0.379867,0.379867,...,0.505201,2.227997,5.376434,0.340861,0.858168,6.187177,3.198322,2.417496,2791.869609,2790.818622
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-99999.0,-99999.0
25%,12834.75,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.25,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,4.0
50%,25668.5,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.556,0.444,...,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,33.0,8.0
75%,38502.25,5.0,3.0,3.0,1.0,1.0,0.308,0.053,1.0,0.75,...,0.0,1.0,1.0,0.0,0.0,3.0,2.0,1.0,64.0,17.0
max,51336.0,235.0,216.0,47.0,27.0,19.0,1.0,1.0,1.0,1.0,...,27.0,41.0,235.0,10.0,29.0,235.0,55.0,80.0,392.0,392.0


In [8]:
a2.describe()

Unnamed: 0,PROSPECTID,time_since_recent_payment,time_since_first_deliquency,time_since_recent_deliquency,num_times_delinquent,max_delinquency_level,max_recent_level_of_deliq,num_deliq_6mts,num_deliq_12mts,num_deliq_6_12mts,...,PL_utilization,PL_Flag,pct_PL_enq_L6m_of_L12m,pct_CC_enq_L6m_of_L12m,pct_PL_enq_L6m_of_ever,pct_CC_enq_L6m_of_ever,max_unsec_exposure_inPct,HL_Flag,GL_Flag,Credit_Score
count,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,...,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0,51336.0
mean,25668.5,-8129.961314,-70020.09132,-70022.375838,1.573749,-70003.987085,13.521953,0.184977,0.480053,0.295076,...,-86556.225194,0.167874,0.190414,0.065182,0.170492,0.056302,-45127.943635,0.271116,0.052887,679.859222
std,14819.571046,27749.328514,45823.312757,45819.820741,4.165012,45847.9761,53.336976,0.71024,1.52221,1.027471,...,34111.41475,0.373758,0.376218,0.235706,0.350209,0.213506,49795.784556,0.44454,0.22381,20.502764
min,1.0,-99999.0,-99999.0,-99999.0,0.0,-99999.0,0.0,0.0,0.0,0.0,...,-99999.0,0.0,0.0,0.0,0.0,0.0,-99999.0,0.0,0.0,469.0
25%,12834.75,46.0,-99999.0,-99999.0,0.0,-99999.0,0.0,0.0,0.0,0.0,...,-99999.0,0.0,0.0,0.0,0.0,0.0,-99999.0,0.0,0.0,669.0
50%,25668.5,70.0,-99999.0,-99999.0,0.0,-99999.0,0.0,0.0,0.0,0.0,...,-99999.0,0.0,0.0,0.0,0.0,0.0,0.333,0.0,0.0,680.0
75%,38502.25,161.0,8.0,3.0,1.0,15.0,10.0,0.0,0.0,0.0,...,-99999.0,0.0,0.0,0.0,0.0,0.0,2.16425,1.0,0.0,691.0
max,51336.0,6065.0,35.0,35.0,74.0,900.0,900.0,12.0,28.0,20.0,...,1.708,1.0,1.0,1.0,1.0,1.0,173800.0,1.0,1.0,811.0


In [9]:
df1 = a1.copy()

In [10]:
df2 = a2.copy()

### -99999 is nan value in the datasets

In [11]:
df1.columns

Index(['PROSPECTID', 'Total_TL', 'Tot_Closed_TL', 'Tot_Active_TL',
       'Total_TL_opened_L6M', 'Tot_TL_closed_L6M', 'pct_tl_open_L6M',
       'pct_tl_closed_L6M', 'pct_active_tl', 'pct_closed_tl',
       'Total_TL_opened_L12M', 'Tot_TL_closed_L12M', 'pct_tl_open_L12M',
       'pct_tl_closed_L12M', 'Tot_Missed_Pmnt', 'Auto_TL', 'CC_TL',
       'Consumer_TL', 'Gold_TL', 'Home_TL', 'PL_TL', 'Secured_TL',
       'Unsecured_TL', 'Other_TL', 'Age_Oldest_TL', 'Age_Newest_TL'],
      dtype='object')

In [12]:
df2.columns

Index(['PROSPECTID', 'time_since_recent_payment',
       'time_since_first_deliquency', 'time_since_recent_deliquency',
       'num_times_delinquent', 'max_delinquency_level',
       'max_recent_level_of_deliq', 'num_deliq_6mts', 'num_deliq_12mts',
       'num_deliq_6_12mts', 'max_deliq_6mts', 'max_deliq_12mts',
       'num_times_30p_dpd', 'num_times_60p_dpd', 'num_std', 'num_std_6mts',
       'num_std_12mts', 'num_sub', 'num_sub_6mts', 'num_sub_12mts', 'num_dbt',
       'num_dbt_6mts', 'num_dbt_12mts', 'num_lss', 'num_lss_6mts',
       'num_lss_12mts', 'recent_level_of_deliq', 'tot_enq', 'CC_enq',
       'CC_enq_L6m', 'CC_enq_L12m', 'PL_enq', 'PL_enq_L6m', 'PL_enq_L12m',
       'time_since_recent_enq', 'enq_L12m', 'enq_L6m', 'enq_L3m',
       'MARITALSTATUS', 'EDUCATION', 'AGE', 'GENDER', 'NETMONTHLYINCOME',
       'Time_With_Curr_Empr', 'pct_of_active_TLs_ever',
       'pct_opened_TLs_L6m_of_L12m', 'pct_currentBal_all_TL', 'CC_utilization',
       'CC_Flag', 'PL_utilization', 'PL_Fla

In [13]:
df1.shape

(51336, 26)

In [14]:
df2.shape

(51336, 62)

In [15]:
df1['Age_Oldest_TL'].unique()

array([    72,      7,     47,      5,    131,    150,     17,     36,
           16,     66,     64,     96,     49,     38,      9,      6,
          110,    138,      8,     92,     40,     11,     51,     59,
           37,    159,     10,     20,      4,     26,     19,     41,
           73,     45,     32,     33,     48,     18,     60,     14,
           83,     44,     24,     42,     39,     12,     27,     70,
           76,    120,    115,     46,     93,     56,     61,    113,
           67,     74,     22,    191,      3,     65,    192,     43,
           13,     29,    193,     98,     63,     58,     30,     23,
           69,     53,    145,     31,     77,    104,     87,     15,
           62,     21,     97,     34,     28,    137,     86,    124,
          129,     50,     35,      2,    102,    154,    148,    128,
           94,    107,    135,     81,     68,     78,    130,     91,
           71,     89,    123,    213,     88,     52,    175,      1,
      

In [16]:
df1.replace(-99999, np.nan)['Age_Oldest_TL'].unique()

array([ 72.,   7.,  47.,   5., 131., 150.,  17.,  36.,  16.,  66.,  64.,
        96.,  49.,  38.,   9.,   6., 110., 138.,   8.,  92.,  40.,  11.,
        51.,  59.,  37., 159.,  10.,  20.,   4.,  26.,  19.,  41.,  73.,
        45.,  32.,  33.,  48.,  18.,  60.,  14.,  83.,  44.,  24.,  42.,
        39.,  12.,  27.,  70.,  76., 120., 115.,  46.,  93.,  56.,  61.,
       113.,  67.,  74.,  22., 191.,   3.,  65., 192.,  43.,  13.,  29.,
       193.,  98.,  63.,  58.,  30.,  23.,  69.,  53., 145.,  31.,  77.,
       104.,  87.,  15.,  62.,  21.,  97.,  34.,  28., 137.,  86., 124.,
       129.,  50.,  35.,   2., 102., 154., 148., 128.,  94., 107., 135.,
        81.,  68.,  78., 130.,  91.,  71.,  89., 123., 213.,  88.,  52.,
       175.,   1., 174., 111.,  90.,  57., 183., 108.,  54.,  79., 166.,
        55., 160., 146.,  84., 167.,  75.,  80., 116., 100.,  nan, 197.,
       181., 155., 126.,  25., 168.,  95., 149., 136., 172., 134., 133.,
       179., 184., 144., 177., 117., 176., 114., 18

In [17]:
df1 = df1.replace(-99999, np.nan)

In [18]:
df2 = df2.replace(-99999, np.nan)

In [19]:
(df1.isna().sum() / df1.shape[0])*100

PROSPECTID              0.000000
Total_TL                0.000000
Tot_Closed_TL           0.000000
Tot_Active_TL           0.000000
Total_TL_opened_L6M     0.000000
Tot_TL_closed_L6M       0.000000
pct_tl_open_L6M         0.000000
pct_tl_closed_L6M       0.000000
pct_active_tl           0.000000
pct_closed_tl           0.000000
Total_TL_opened_L12M    0.000000
Tot_TL_closed_L12M      0.000000
pct_tl_open_L12M        0.000000
pct_tl_closed_L12M      0.000000
Tot_Missed_Pmnt         0.000000
Auto_TL                 0.000000
CC_TL                   0.000000
Consumer_TL             0.000000
Gold_TL                 0.000000
Home_TL                 0.000000
PL_TL                   0.000000
Secured_TL              0.000000
Unsecured_TL            0.000000
Other_TL                0.000000
Age_Oldest_TL           0.077918
Age_Newest_TL           0.077918
dtype: float64

In [20]:
(df2.isna().sum() / df2.shape[0])*100

PROSPECTID                       0.000000
time_since_recent_payment        8.358657
time_since_first_deliquency     70.026882
time_since_recent_deliquency    70.026882
num_times_delinquent             0.000000
                                  ...    
GL_Flag                          0.000000
last_prod_enq2                   0.000000
first_prod_enq2                  0.000000
Credit_Score                     0.000000
Approved_Flag                    0.000000
Length: 62, dtype: float64

In [21]:
# nan in df1 is too less, we directly remove the rows from it.
df1 = df1.dropna()

In [22]:
nan_percent = (df2.isna().sum() / df2.shape[0]) * 100
nan_cols = nan_percent[nan_percent > 10].index

In [23]:
df2 = df2.drop(columns=nan_cols)

In [24]:
nan_cols

Index(['time_since_first_deliquency', 'time_since_recent_deliquency',
       'max_delinquency_level', 'max_deliq_6mts', 'max_deliq_12mts', 'tot_enq',
       'CC_enq', 'CC_enq_L6m', 'CC_enq_L12m', 'PL_enq', 'PL_enq_L6m',
       'PL_enq_L12m', 'time_since_recent_enq', 'enq_L12m', 'enq_L6m',
       'enq_L3m', 'CC_utilization', 'PL_utilization',
       'max_unsec_exposure_inPct'],
      dtype='object')

In [25]:
(df2.isna().sum() / df2.shape[0])*100

PROSPECTID                    0.000000
time_since_recent_payment     8.358657
num_times_delinquent          0.000000
max_recent_level_of_deliq     0.000000
num_deliq_6mts                0.000000
num_deliq_12mts               0.000000
num_deliq_6_12mts             0.000000
num_times_30p_dpd             0.000000
num_times_60p_dpd             0.000000
num_std                       0.000000
num_std_6mts                  0.000000
num_std_12mts                 0.000000
num_sub                       0.000000
num_sub_6mts                  0.000000
num_sub_12mts                 0.000000
num_dbt                       0.000000
num_dbt_6mts                  0.000000
num_dbt_12mts                 0.000000
num_lss                       0.000000
num_lss_6mts                  0.000000
num_lss_12mts                 0.000000
recent_level_of_deliq         0.000000
MARITALSTATUS                 0.000000
EDUCATION                     0.000000
AGE                           0.000000
GENDER                   

In [26]:
df2 = df2.dropna()

In [27]:
df2.shape

(46981, 43)

In [28]:
df1.shape

(51296, 26)

In [29]:
df = pd.merge(df1, df2, on="PROSPECTID", how="inner")

In [30]:
# merge rows = common(df1,df2), col = df1+df2-1
df.shape

(46978, 68)

In [31]:
df.to_csv('Datasets/processed_data.csv')

In [32]:
df.isna().sum().sum()

np.int64(0)

In [33]:
df = df.drop(columns=['PROSPECTID'])

In [34]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46978 entries, 0 to 46977
Data columns (total 67 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Total_TL                    46978 non-null  int64  
 1   Tot_Closed_TL               46978 non-null  int64  
 2   Tot_Active_TL               46978 non-null  int64  
 3   Total_TL_opened_L6M         46978 non-null  int64  
 4   Tot_TL_closed_L6M           46978 non-null  int64  
 5   pct_tl_open_L6M             46978 non-null  float64
 6   pct_tl_closed_L6M           46978 non-null  float64
 7   pct_active_tl               46978 non-null  float64
 8   pct_closed_tl               46978 non-null  float64
 9   Total_TL_opened_L12M        46978 non-null  int64  
 10  Tot_TL_closed_L12M          46978 non-null  int64  
 11  pct_tl_open_L12M            46978 non-null  float64
 12  pct_tl_closed_L12M          46978 non-null  float64
 13  Tot_Missed_Pmnt             469

In [35]:
y = df['Approved_Flag']
X = df.drop(columns=['Approved_Flag'])

In [36]:
object_cols = X.select_dtypes(include=['object']).columns.tolist()

## Hypothesis Testing
---
- H0 : Null Hypothesis
    - NOT Associated
- H1 : Alternate Hypothesis
    - Associated
---
- Alpha: Significance Level
    - Less risky projects: High Alpha
    - More risky projects: Less Alpha
- Confidence level:
    - 1 - alpha
---
- Calculate the evidence against H0:
    - p - value
        - Calculated using tests:
            - T - test: cat( 2 classes) vs numerical
            - Chi square: cat vs cat
            - Anova: cat(>2classes) vs numerical
---
- p - value <= alpha:
    - Reject H0
- otherwise:
    - Fail to reject H0
---

- MARITALSTATUS vs Approved_Flag -> chi square
- Age vs Approved_Flag -> Anova
- Age vs Gender -> T-test

## Are MARITALSTATUS and Approved_Flag associated or NOT?
- Can we say 'Married'=0 and 'Single'=1?
    - No, because there is no order of preference in them.
- Chi square test is used to find associatative in categorical columns.
---

In [37]:
from scipy.stats import chi2_contingency 

In [38]:
object_cols

['MARITALSTATUS', 'EDUCATION', 'GENDER', 'last_prod_enq2', 'first_prod_enq2']

In [39]:
for col in object_cols:
    chi2, pval, _, _ = chi2_contingency(pd.crosstab(X[col], y))
    print(col, ":", pval)

MARITALSTATUS : 4.79135689965816e-246
EDUCATION : 2.9557177109133494e-35
GENDER : 0.00014018705965762203
last_prod_enq2 : 0.0
first_prod_enq2 : 0.0


---
### Since all the categorical features have pval <= 0.05, we have fail to reject all
---

In [40]:
num_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

In [41]:
len(num_cols) + len(object_cols)

66

In [42]:
X.shape

(46978, 66)

---
## Correlation VS Multicollinearity
1. Correlation

Definition: Measures the strength and direction of a linear relationship between two variables.

Metric: Pearson’s r (–1 to +1).

Example: Income and credit limit → r = 0.75 (strong positive correlation).

Can exist between independent variables, dependent variables, or both.

2. Multicollinearity

Definition: When two or more independent variables in a regression model are highly correlated with each other.

Problem: Makes it difficult to separate out the individual effect of predictors → unstable coefficients, inflated standard errors.

It’s not just pairwise correlation — it can be a combination effect (e.g., X₁ is explained by X₂ + X₃ together).

Detected using:

Variance Inflation Factor (VIF)
VIF = 1 / (1-R^2)

Condition number of X matrix

---
### Multicollinearity: Predictability of each feature using other features
### Correlation: It is specific to linear relationships between cols
- In convex functions, correlation gives misleading values.
---

In [43]:
vif_data = X[num_cols]
total_cols = vif_data.shape[1]
cols_to_be_kept = []
col_index = 0

In [44]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [45]:
for i in range(total_cols):
    vif_value = variance_inflation_factor(vif_data, col_index)
    print(col_index, ' : ', vif_value)

    if vif_value <= 6:
        cols_to_be_kept.append(num_cols[i])
        col_index += 1
    else:
        vif_data = vif_data.drop(columns=[num_cols[i]])

  vif = 1. / (1. - r_squared_i)


0  :  inf


  vif = 1. / (1. - r_squared_i)


0  :  inf
0  :  10.769812791073365
0  :  8.056281194701139
0  :  6.394370818790994
0  :  4.963326839610698
1  :  2.528881596862871


  vif = 1. / (1. - r_squared_i)


2  :  inf
2  :  896.101144246411
2  :  7.702117484626237
2  :  3.6725507364876733
3  :  5.031432985277633
4  :  5.228932196819313
5  :  1.9436253777775145


  vif = 1. / (1. - r_squared_i)


6  :  inf
6  :  3.723607309597259
7  :  21.324303275274634
7  :  26.27682915798789
7  :  4.380454497311235
8  :  2.6988115681289653
9  :  2.818566431017511
10  :  3.6456992505283896
11  :  2.1822512980561686
12  :  4.974990288702898
13  :  5.471823087680818
14  :  3.636352332412496
15  :  7.84574911550266
15  :  5.367126269807443


  vif = 1. / (1. - r_squared_i)


16  :  inf
16  :  7.292450531094948
16  :  1.4013805528399852
17  :  8.355742999843102
17  :  1.630557047046741
18  :  7.108536914261136
18  :  15.651326928784401
18  :  1.8254868553896395
19  :  1.5823241338363074
20  :  2.550243603183263
21  :  3.1235309746350097
22  :  2.253304119094641
23  :  6.972476818755195
23  :  2.121811199551098
24  :  2.740411299101044
25  :  6.293724893485941
25  :  2.7238832013930905
26  :  4.874649089486998
27  :  21.959962166849063
27  :  2.84316419265331
28  :  3.414575442349153
29  :  9.78641243659599
29  :  6.609435998831425
29  :  1.0009437131684922
30  :  3.0404082204167855
31  :  2.744207469342665
32  :  20.121702888873564
32  :  15.492980488114862
32  :  1.4354946487687503
33  :  1.2190139349690248
34  :  2.0075813400113023
35  :  4.34464163208555
36  :  13.103383592760453


In [46]:
len(cols_to_be_kept)

36

## VIF Parallel vs VIF Sequential
- In parallel, take all features one by one, calculate vif with other features. After calculating all vifs, then remove the cols. ** Wrong method. **
- In sequential, take one feature and calculate vif value with other features, then remove or keep the feature depending on the vif value, then do this same thing with every feature left.

## Anova Test

In [47]:
from scipy.stats import f_oneway

In [48]:
cols_to_be_kept_after_anova = []

for col in cols_to_be_kept:
    a = list(X[col])
    b = list(y)

    grp_p1 = [value for value,group in zip(a,b) if group == 'P1']
    grp_p2 = [value for value,group in zip(a,b) if group == 'P2']
    grp_p3 = [value for value,group in zip(a,b) if group == 'P3']
    grp_p4 = [value for value,group in zip(a,b) if group == 'P4']

    f_statistic, p_value = f_oneway(grp_p1, grp_p2, grp_p3, grp_p4)

    if p_value <= 0.05:
        cols_to_be_kept_after_anova.append(col)

In [49]:
len(cols_to_be_kept_after_anova)

33

In [113]:
print(len(cols_to_be_kept_after_anova))
print(len(X.columns))

33
66


In [50]:
y.value_counts()

Approved_Flag
P2    29143
P3     6771
P1     5657
P4     5407
Name: count, dtype: int64

In [51]:
for col in object_cols:
    print(col)
    print(X[col].unique())

MARITALSTATUS
['Married' 'Single']
EDUCATION
['12TH' 'GRADUATE' 'SSC' 'POST-GRADUATE' 'UNDER GRADUATE' 'OTHERS'
 'PROFESSIONAL']
GENDER
['M' 'F']
last_prod_enq2
['PL' 'ConsumerLoan' 'AL' 'others' 'CC' 'HL']
first_prod_enq2
['PL' 'ConsumerLoan' 'others' 'AL' 'HL' 'CC']


In [52]:
nominal_cols = ['MARITALSTATUS', 'GENDER', 'last_prod_enq2', 'first_prod_enq2']

In [54]:
ordinal_cols = ['EDUCATION']
order = {'OTHERS':1, 'SSC':2, '12TH':3, 'UNDER GRADUATE':4, 'GRADUATE':5, 'POST-GRADUATE':6, 'PROFESSIONAL':7}

In [55]:
X['EDUCATION'].value_counts()

EDUCATION
GRADUATE          15394
12TH              13184
SSC                8381
UNDER GRADUATE     5029
OTHERS             2666
POST-GRADUATE      2079
PROFESSIONAL        245
Name: count, dtype: int64

In [56]:
X['EDUCATION'] = X['EDUCATION'].replace(order)

  X['EDUCATION'] = X['EDUCATION'].replace(order)


In [57]:
X['EDUCATION'].value_counts()

EDUCATION
5    15394
3    13184
2     8381
4     5029
1     2666
6     2079
7      245
Name: count, dtype: int64

In [58]:
X_encoded = pd.get_dummies(X, columns=nominal_cols, drop_first=True)

In [59]:
X_encoded

Unnamed: 0,Total_TL,Tot_Closed_TL,Tot_Active_TL,Total_TL_opened_L6M,Tot_TL_closed_L6M,pct_tl_open_L6M,pct_tl_closed_L6M,pct_active_tl,pct_closed_tl,Total_TL_opened_L12M,...,last_prod_enq2_CC,last_prod_enq2_ConsumerLoan,last_prod_enq2_HL,last_prod_enq2_PL,last_prod_enq2_others,first_prod_enq2_CC,first_prod_enq2_ConsumerLoan,first_prod_enq2_HL,first_prod_enq2_PL,first_prod_enq2_others
0,5,4,1,0,0,0.000,0.00,0.200,0.800,0,...,False,False,False,True,False,False,False,False,True,False
1,1,0,1,0,0,0.000,0.00,1.000,0.000,1,...,False,True,False,False,False,False,True,False,False,False
2,8,0,8,1,0,0.125,0.00,1.000,0.000,2,...,False,True,False,False,False,False,False,False,False,True
3,3,2,1,0,0,0.000,0.00,0.333,0.667,0,...,False,False,False,False,False,False,False,False,False,False
4,6,5,1,0,0,0.000,0.00,0.167,0.833,0,...,False,True,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46973,3,0,3,1,0,0.333,0.00,1.000,0.000,1,...,False,True,False,False,False,False,True,False,False,False
46974,4,2,2,0,1,0.000,0.25,0.500,0.500,2,...,False,False,False,False,True,False,False,False,False,True
46975,2,1,1,1,1,0.500,0.50,0.500,0.500,2,...,False,True,False,False,False,False,False,False,False,True
46976,2,1,1,0,0,0.000,0.00,0.500,0.500,1,...,False,True,False,False,False,False,False,False,False,True


In [60]:
X_encoded.columns

Index(['Total_TL', 'Tot_Closed_TL', 'Tot_Active_TL', 'Total_TL_opened_L6M',
       'Tot_TL_closed_L6M', 'pct_tl_open_L6M', 'pct_tl_closed_L6M',
       'pct_active_tl', 'pct_closed_tl', 'Total_TL_opened_L12M',
       'Tot_TL_closed_L12M', 'pct_tl_open_L12M', 'pct_tl_closed_L12M',
       'Tot_Missed_Pmnt', 'Auto_TL', 'CC_TL', 'Consumer_TL', 'Gold_TL',
       'Home_TL', 'PL_TL', 'Secured_TL', 'Unsecured_TL', 'Other_TL',
       'Age_Oldest_TL', 'Age_Newest_TL', 'time_since_recent_payment',
       'num_times_delinquent', 'max_recent_level_of_deliq', 'num_deliq_6mts',
       'num_deliq_12mts', 'num_deliq_6_12mts', 'num_times_30p_dpd',
       'num_times_60p_dpd', 'num_std', 'num_std_6mts', 'num_std_12mts',
       'num_sub', 'num_sub_6mts', 'num_sub_12mts', 'num_dbt', 'num_dbt_6mts',
       'num_dbt_12mts', 'num_lss', 'num_lss_6mts', 'num_lss_12mts',
       'recent_level_of_deliq', 'EDUCATION', 'AGE', 'NETMONTHLYINCOME',
       'Time_With_Curr_Empr', 'pct_of_active_TLs_ever',
       'pct_opene

In [61]:
X_encoded['NETMONTHLYINCOME'].max()

np.int64(2500000)

In [62]:
X_encoded['NETMONTHLYINCOME'].min()

np.int64(0)

In [63]:
classes = y.unique()

In [64]:
classes.sort()

In [65]:
classes

array(['P1', 'P2', 'P3', 'P4'], dtype=object)

### Train Test Split

In [66]:
from sklearn.model_selection import train_test_split

In [67]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, random_state=42, test_size=0.2)

In [69]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.transform(y_test)

---
## Random Forest Classifier

In [70]:
from sklearn.ensemble import RandomForestClassifier

In [71]:
rfc = RandomForestClassifier(n_estimators=200, random_state=42)

In [72]:
rfc.fit(X_train, y_train)

In [73]:
y_pred = rfc.predict(X_test)

In [74]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

In [75]:
accuracy_score(y_test, y_pred)

0.9891443167305236

In [76]:
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)

In [77]:
for i,v in enumerate(classes):
    print(f"Class: {v}")
    print(f"Precision: {precision[i]}")
    print(f"recall: {recall[i]}")
    print(f"f1_score: {f1_score[i]}")
    print()

Class: P1
Precision: 0.9354838709677419
recall: 0.9953617810760668
f1_score: 0.964494382022472

Class: P2
Precision: 0.9962825278810409
recall: 0.999830422248601
f1_score: 0.9980533220482437

Class: P3
Precision: 0.9953051643192489
recall: 0.9325513196480938
f1_score: 0.9629068887206662

Class: P4
Precision: 1.0
recall: 0.9962157048249763
f1_score: 0.9981042654028436



---
## Xgboost

In [78]:
from xgboost import XGBClassifier

In [79]:
xgb = XGBClassifier(objective='multi:softmax', num_class=4)

In [80]:
xgb.fit(X_train, y_train)

In [81]:
y_pred = xgb.predict(X_test)

In [82]:
accuracy_score(y_test, y_pred)

0.9953171562366965

In [83]:
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)

In [84]:
for i,v in enumerate(classes):
    print(f"Class: {v}")
    print(f"Precision: {precision[i]}")
    print(f"recall: {recall[i]}")
    print(f"f1_score: {f1_score[i]}")
    print()

Class: P1
Precision: 0.9877474081055608
recall: 0.9721706864564007
f1_score: 0.9798971482000935

Class: P2
Precision: 1.0
recall: 1.0
f1_score: 1.0

Class: P3
Precision: 0.9775687409551375
recall: 0.9904692082111437
f1_score: 0.9839766933721777

Class: P4
Precision: 1.0
recall: 0.9990539262062441
f1_score: 0.9995267392333176



---
## Decision Trees

In [85]:
from sklearn.tree import DecisionTreeClassifier

In [86]:
dt = DecisionTreeClassifier(max_depth=20, min_samples_split=10)

In [87]:
dt.fit(X_train, y_train)

In [88]:
y_pred = dt.predict(X_test)

In [89]:
accuracy_score(y_test, y_pred)

0.9932950191570882

In [90]:
precision, recall, f1_score, _ = precision_recall_fscore_support(y_test, y_pred)
for i,v in enumerate(classes):
    print(f"Class: {v}")
    print(f"Precision: {precision[i]}")
    print(f"recall: {recall[i]}")
    print(f"f1_score: {f1_score[i]}")
    print()

Class: P1
Precision: 0.9669117647058824
recall: 0.9758812615955473
f1_score: 0.971375807940905

Class: P2
Precision: 1.0
recall: 1.0
f1_score: 1.0

Class: P3
Precision: 0.9800738007380074
recall: 0.9736070381231672
f1_score: 0.9768297168076499

Class: P4
Precision: 1.0
recall: 0.9990539262062441
f1_score: 0.9995267392333176



---
## As of now, XGBoost is best among 3, So, we will try to hypertune the parameters of it get more better results.
---

## Accuracy
- Out of total values, how many I have correctly predicted?
## Recall
- How much you recall that class?
- Out of actual P1 class, how many I predicted that they belong to P1 class?
- Recall(P1) = Correctly predicted P1 / total actual P1
## Precision
- Out of total Predicted P1 class, how many actually belong to P1 class?
- Precision(P1) = Correctly predicted P1/total predicted P1
## F1 score
- F1 score = 2 X Precision X Recall / (Precision + Recall) 

## Balanced VS Imbalanced
- Use Accuracy in balanced dataset
- Use F1 score in imbalanced dataset

## Metrices
### Error
- Error = Sum (y_test - y_pred) / n
### MAE (Mean Absolute Error)
- MAE = Sum (Abs(y_test - y_pred)) / n
- Robust to outliers
### MSE (Mean Squared Error)
- MSE = Sum ((y_test - y_pred)**2) / n
- Human interpretation is difficult
- Not robust to outliers
### RMSE (Root Mean Squared Error)
- RMSE = Sqrt ( Sum ((y_test - y_pred)**2)/n )
- Robust to outliers (PRO)
- Scale dependent (CONS)
### MAPE (Mean Absolute Percentage Error
- MAPE = Sum( Abs(y_test - y_pred)/y_test) / n
- Infinity if y_test is zero (CONS)
- Scale independent (PRO)
### R square
- SSR (Sum of Square of Residual) = (y_test - y_pred)**2
- SST (Sum of Square of total) = (Sum(y_test)/n - y_pred)**2
- R square = 1 - SSR/SST
- Scale Independent (PRO)

### Deployment
NG Rock

---
### Now we right to hypertune the parameters
---

## Decision Tree:
- Gini Impurity
- Max Depth
- Min sample split
### Motive:
- It decides how fast the algorithm want to finish (Converge)
---

In [91]:
params_grid = {
    'colsample_bytree': [0.1, 0.3, 0.5, 0.7, 0.9],
    'learning_rate': [0.001, 0.01, 0.1, 1],
    'max_depth': [3,5,8,10],
    'alpha': [1, 10, 100],
    'n_estimators': [10, 50, 100]
}

In [97]:
index = 0

In [98]:
answer_grid = {
    'combination' : [],
    'train_accuracy' : [],
    'test_accuracy' : [],
    'colsample_bytree': [],
    'learning_rate':[],
    'max_depth':[],
    'alpha':[],
    'n_estimators':[]
}

In [99]:
for colsample_bytree in params_grid['colsample_bytree']:
    for learning_rate in params_grid['learning_rate']:
        for max_depth in params_grid['max_depth']:
            for alpha in params_grid['alpha']:
                for n_estimators in params_grid['n_estimators']:
                    index += 1
                    model = XGBClassifier(
                        objective='multi:softmax', 
                        num_class=4,
                        colsample_bytree=colsample_bytree,
                        learning_rate=learning_rate,
                        max_depth=max_depth,
                        alpha=alpha,
                        n_estimators=n_estimators
                    )
                    model.fit(X_train, y_train)
                    y_pred_train = model.predict(X_train)
                    y_pred_test = model.predict(X_test)
                    train_accuracy = accuracy_score(y_train, y_pred_train)
                    test_accuracy = accuracy_score(y_test, y_pred_test)
                    
                    answer_grid['combination'].append(index)
                    answer_grid['train_accuracy'].append(train_accuracy)
                    answer_grid['test_accuracy'].append(test_accuracy)
                    answer_grid['colsample_bytree'].append(colsample_bytree)
                    answer_grid['learning_rate'].append(learning_rate)
                    answer_grid['max_depth'].append(max_depth)
                    answer_grid['alpha'].append(alpha)
                    answer_grid['n_estimators'].append(n_estimators)

                    print(f"Combination: {index}")
                    print(f"colsample_bytree: {colsample_bytree}, learning_rate: {learning_rate}, max_depth:{max_depth}, alpha: {alpha}, n_estimators: {n_estimators}")
                    print(f"train_accuracy: {train_accuracy}")
                    print(f"test_accuracy: {test_accuracy}")
                    print("-"*30)

Combination: 1
colsample_bytree: 0.1, learning_rate: 0.001, max_depth:3, alpha: 1, n_estimators: 10
train_accuracy: 0.6227449310840296
test_accuracy: 0.6313324819071946
------------------------------
Combination: 2
colsample_bytree: 0.1, learning_rate: 0.001, max_depth:3, alpha: 1, n_estimators: 50
train_accuracy: 0.6306742589537545
test_accuracy: 0.6383567475521499
------------------------------
Combination: 3
colsample_bytree: 0.1, learning_rate: 0.001, max_depth:3, alpha: 1, n_estimators: 100
train_accuracy: 0.6381246341333617
test_accuracy: 0.6453810131971052
------------------------------
Combination: 4
colsample_bytree: 0.1, learning_rate: 0.001, max_depth:3, alpha: 10, n_estimators: 10
train_accuracy: 0.6216539832898728
test_accuracy: 0.6303746275010643
------------------------------
Combination: 5
colsample_bytree: 0.1, learning_rate: 0.001, max_depth:3, alpha: 10, n_estimators: 50
train_accuracy: 0.6304347826086957
test_accuracy: 0.6380374627501064
----------------------------

In [102]:
df = pd.DataFrame(answer_grid)

In [103]:
df

Unnamed: 0,combination,train_accuracy,test_accuracy,colsample_bytree,learning_rate,max_depth,alpha,n_estimators
0,1,0.622745,0.631332,0.1,0.001,3,1,10
1,2,0.630674,0.638357,0.1,0.001,3,1,50
2,3,0.638125,0.645381,0.1,0.001,3,1,100
3,4,0.621654,0.630375,0.1,0.001,3,10,10
4,5,0.630435,0.638037,0.1,0.001,3,10,50
...,...,...,...,...,...,...,...,...
715,716,0.997100,0.994253,0.9,1.000,10,10,50
716,717,0.997100,0.994253,0.9,1.000,10,10,100
717,718,0.994678,0.995104,0.9,1.000,10,100,10
718,719,0.994678,0.995104,0.9,1.000,10,100,50


In [105]:
df.to_csv('Datasets/hyperparameters.csv')

## Results
- I got the best train accuracy of 0.996753765100314 and test accuracy of 0.996062154108131.
- On hyperparameters:
  - colsample_bytree = 0.3
  - learning_rate = 0.1
  - max_depth = 5
  - alpha = 1
  - n_estimators = 100

In [108]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.base import clone

params_grid = {
    "colsample_bytree": [0.3, 0.5, 0.8, 1.0],
    "learning_rate": [0.01, 0.05, 0.1],
    "max_depth": [3, 5, 7],
    "alpha": [0.0, 0.1, 1.0],
    "n_estimators": [10, 50, 100],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

base = XGBClassifier(
    objective="multi:softmax",
    num_class=4,
    tree_method="hist",        # faster
    eval_metric="mlogloss",    # avoids warnings
    random_state=42,
    n_jobs=-1
)

grid = GridSearchCV(
    estimator=base,
    param_grid=params_grid,
    scoring="accuracy",
    cv=cv,
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

grid.fit(X_train, y_train)

print("Best params:", grid.best_params_)
print("CV best mean accuracy:", grid.best_score_)

# Evaluate the best model on the holdout test set
best_model = grid.best_estimator_
y_pred_test = best_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred_test)
print("Holdout test accuracy:", test_acc)

# Make a tidy results table similar to your answer_grid
cvres = pd.DataFrame(grid.cv_results_)
param_cols = [c for c in cvres.columns if c.startswith("param_")]
summary = (cvres[param_cols + ["mean_train_score","mean_test_score","std_test_score","rank_test_score"]]
           .sort_values("mean_test_score", ascending=False)
           .reset_index(drop=True))
print(summary.head(10))


Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Best params: {'alpha': 1.0, 'colsample_bytree': 0.8, 'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 50}
CV best mean accuracy: 0.9951040759624998
Holdout test accuracy: 0.9959557258407833
   param_alpha  param_colsample_bytree  param_learning_rate  param_max_depth  \
0          1.0                     0.8                 0.05                5   
1          1.0                     1.0                 0.05                5   
2          1.0                     0.5                 0.01                5   
3          1.0                     0.8                 0.05                3   
4          1.0                     0.5                 0.05                3   
5          1.0                     0.8                 0.05                3   
6          1.0                     0.8                 0.10                3   
7          1.0                     0.8                 0.01                3   
8          1.0      