<h1>1. Classifying Loan Status Using Decision Trees</h1>
<h3><b>Preprocessing Steps:</b></h3>
<ul>
    <li>Handle missing values if any.</li>
    <li>Encode categorical variables (e.g., one-hot encoding for loan grade, sub-grade, etc.).</li>
    <li>Standardize numerical features.</li>
</ul>
<h3><b> Task: </b>Implement a decision tree classifier to classify loan status and evaluate the model using accuracy and ROC-AUC.</h3>

In [27]:
# Importing Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler, normalize
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

In [2]:
# Loading the dataset
loan_dataset = pd.read_csv('..\\..\\Datasets\\LendingClubLoan.csv')
print(loan_dataset.shape, '\n')
loan_dataset.head()

(10000, 56) 



Unnamed: 0.1,Unnamed: 0,emp_title,emp_length,state,homeownership,annual_income,verified_income,debt_to_income,annual_income_joint,verification_income_joint,...,sub_grade,issue_month,loan_status,initial_listing_status,disbursement_method,balance,paid_total,paid_principal,paid_interest,paid_late_fees
0,1,global config engineer,3.0,NJ,MORTGAGE,90000.0,Verified,18.01,,,...,C3,Mar-2018,Current,whole,Cash,27015.86,1999.33,984.14,1015.19,0.0
1,2,warehouse office clerk,10.0,HI,RENT,40000.0,Not Verified,5.04,,,...,C1,Feb-2018,Current,whole,Cash,4651.37,499.12,348.63,150.49,0.0
2,3,assembly,3.0,WI,RENT,40000.0,Source Verified,21.15,,,...,D1,Feb-2018,Current,fractional,Cash,1824.63,281.8,175.37,106.43,0.0
3,4,customer service,1.0,PA,RENT,30000.0,Not Verified,10.16,,,...,A3,Jan-2018,Current,whole,Cash,18853.26,3312.89,2746.74,566.15,0.0
4,5,security supervisor,10.0,CA,RENT,35000.0,Verified,57.96,57000.0,Verified,...,C3,Mar-2018,Current,whole,Cash,21430.15,2324.65,1569.85,754.8,0.0


In [3]:
# Printing basic statistics of the dataset
print(loan_dataset.describe().to_string())

        Unnamed: 0   emp_length  annual_income  debt_to_income  annual_income_joint  debt_to_income_joint    delinq_2y  months_since_last_delinq  earliest_credit_line  inquiries_last_12m  total_credit_lines  open_credit_lines  total_credit_limit  total_credit_utilized  num_collections_last_12m  num_historical_failed_to_pay  months_since_90d_late  current_accounts_delinq  total_collection_amount_ever  current_installment_accounts  accounts_opened_24m  months_since_last_credit_inquiry  num_satisfactory_accounts  num_accounts_120d_past_due  num_accounts_30d_past_due  num_active_debit_accounts  total_debit_limit  num_total_cc_accounts  num_open_cc_accounts  num_cc_carrying_balance  num_mort_accounts  account_never_delinq_percent     tax_liens  public_record_bankrupt   loan_amount          term  interest_rate   installment       balance    paid_total  paid_principal  paid_interest  paid_late_fees
count  10000.00000  9183.000000   1.000000e+04     9976.000000         1.495000e+03           1

In [4]:
# Printing information of dataset
loan_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 56 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Unnamed: 0                        10000 non-null  int64  
 1   emp_title                         9167 non-null   object 
 2   emp_length                        9183 non-null   float64
 3   state                             10000 non-null  object 
 4   homeownership                     10000 non-null  object 
 5   annual_income                     10000 non-null  float64
 6   verified_income                   10000 non-null  object 
 7   debt_to_income                    9976 non-null   float64
 8   annual_income_joint               1495 non-null   float64
 9   verification_income_joint         1455 non-null   object 
 10  debt_to_income_joint              1495 non-null   float64
 11  delinq_2y                         10000 non-null  int64  
 12  month

<h2>Data Preprocessing</h2>

<h3>1. Handling Missing Values</h3>

In [5]:
# Checking for the missing values in the dataset and printing only features with their missing values
for feature in loan_dataset.columns:
    if loan_dataset[feature].isnull().sum() > 0:
        print(feature,": ", loan_dataset[feature].isnull().sum())

emp_title :  833
emp_length :  817
debt_to_income :  24
annual_income_joint :  8505
verification_income_joint :  8545
debt_to_income_joint :  8505
months_since_last_delinq :  5658
months_since_90d_late :  7715
months_since_last_credit_inquiry :  1271
num_accounts_120d_past_due :  318


In [6]:
# Removing those columns having missing values of more than 55%.
loan_dataset.drop(['annual_income_joint', 'verification_income_joint', 'debt_to_income_joint', 'months_since_last_delinq', 'months_since_90d_late'], axis=1, inplace=True)

# Using mode imputation on 'object' features
loan_dataset.fillna({'emp_title': loan_dataset['emp_title'].mode()[0]}, inplace=True)

# Using mean imputation on 'float' features. Since remaining all features are numerical, we will use direct approach.
for feature in loan_dataset[['emp_length', 'debt_to_income', 'months_since_last_credit_inquiry', 'num_accounts_120d_past_due']]:
    loan_dataset.fillna({feature: loan_dataset[feature].mean()}, inplace=True)
    
# Checking for the missing values in the dataset
print('Missing values in the dataset:', loan_dataset.isnull().sum().sum())

Missing values in the dataset: 0


-> So, missing values in the dataset have been handled and all the values have been imputed.

<h3>2. Encode Categorical Variables</h3>

In [7]:
# Selecting the features with datatype 'object'.
categorical_features = loan_dataset.select_dtypes('object').columns
print(categorical_features, ': ', len(categorical_features))

# Selecting the numerical features
numerical_features = loan_dataset.drop(columns=categorical_features, axis=1)
numerical_features.drop('Unnamed: 0', axis=1, inplace=True)   # No need of this column
print('\n', numerical_features.columns, ': ', numerical_features.shape[1])

Index(['emp_title', 'state', 'homeownership', 'verified_income',
       'loan_purpose', 'application_type', 'grade', 'sub_grade', 'issue_month',
       'loan_status', 'initial_listing_status', 'disbursement_method'],
      dtype='object') :  12

 Index(['emp_length', 'annual_income', 'debt_to_income', 'delinq_2y',
       'earliest_credit_line', 'inquiries_last_12m', 'total_credit_lines',
       'open_credit_lines', 'total_credit_limit', 'total_credit_utilized',
       'num_collections_last_12m', 'num_historical_failed_to_pay',
       'current_accounts_delinq', 'total_collection_amount_ever',
       'current_installment_accounts', 'accounts_opened_24m',
       'months_since_last_credit_inquiry', 'num_satisfactory_accounts',
       'num_accounts_120d_past_due', 'num_accounts_30d_past_due',
       'num_active_debit_accounts', 'total_debit_limit',
       'num_total_cc_accounts', 'num_open_cc_accounts',
       'num_cc_carrying_balance', 'num_mort_accounts',
       'account_never_delinq_perc

In [8]:
# Printing the categories in each categorical column
for category in categorical_features:
    print(category, ': ', loan_dataset[category].nunique(), ': ', loan_dataset[category].unique(), '\n')

emp_title :  4741 :  ['global config engineer ' 'warehouse office clerk' 'assembly' ...
 'inspector/packer' 'da coordinator ' 'toolmaker'] 

state :  50 :  ['NJ' 'HI' 'WI' 'PA' 'CA' 'KY' 'MI' 'AZ' 'NV' 'IL' 'FL' 'SC' 'CO' 'TN'
 'TX' 'VA' 'NY' 'GA' 'MO' 'AR' 'MD' 'NC' 'NE' 'WV' 'NH' 'UT' 'DE' 'MA'
 'OR' 'OH' 'OK' 'SD' 'MN' 'AL' 'WY' 'LA' 'IN' 'KS' 'MS' 'WA' 'ME' 'VT'
 'CT' 'NM' 'AK' 'MT' 'RI' 'ND' 'DC' 'ID'] 

homeownership :  3 :  ['MORTGAGE' 'RENT' 'OWN'] 

verified_income :  3 :  ['Verified' 'Not Verified' 'Source Verified'] 

loan_purpose :  12 :  ['moving' 'debt_consolidation' 'other' 'credit_card' 'home_improvement'
 'medical' 'house' 'small_business' 'car' 'major_purchase' 'vacation'
 'renewable_energy'] 

application_type :  2 :  ['individual' 'joint'] 

grade :  7 :  ['C' 'D' 'A' 'B' 'F' 'E' 'G'] 

sub_grade :  32 :  ['C3' 'C1' 'D1' 'A3' 'C2' 'B5' 'C4' 'B2' 'B1' 'D3' 'F1' 'E5' 'A2' 'A5'
 'A4' 'A1' 'D4' 'D5' 'B3' 'D2' 'E1' 'G1' 'B4' 'C5' 'E2' 'E4' 'F3' 'E3'
 'F5' 'F2' 'F4' 'G4']

-> Since there are too many categorical variables with so many categories, label encoder will be the better choice since it will not increase the size of dataset.

In [9]:
# Applying label encoder to the categorical features
encoder = LabelEncoder()

for feature in categorical_features:
    loan_dataset[feature] = encoder.fit_transform(loan_dataset[feature])

In [10]:
# Printing categories after encoding
for category in categorical_features:
    print(category, ': ', loan_dataset[category].nunique(), ': ', loan_dataset[category].unique(), '\n')

emp_title :  4741 :  [1777 4675  207 ... 2006 1066 4484] 

state :  50 :  [30 11 47 37  4 16 21  3 32 13  9 39  5 41 42 44 33 10 23  2 19 26 28 48
 29 43  8 18 36 34 35 40 22  1 49 17 14 15 24 46 20 45  6 31  0 25 38 27
  7 12] 

homeownership :  3 :  [0 2 1] 

verified_income :  3 :  [2 0 1] 

loan_purpose :  12 :  [ 7  2  8  1  3  6  4 10  0  5 11  9] 

application_type :  2 :  [0 1] 

grade :  7 :  [2 3 0 1 5 4 6] 

sub_grade :  32 :  [12 10 15  2 11  9 13  6  5 17 25 24  1  4  3  0 18 19  7 16 20 30  8 14
 21 23 27 22 29 26 28 31] 

issue_month :  3 :  [2 0 1] 

loan_status :  6 :  [1 2 3 5 0 4] 

initial_listing_status :  2 :  [1 0] 

disbursement_method :  2 :  [0 1] 



In [11]:
# Printing info of the dataset
loan_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 51 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Unnamed: 0                        10000 non-null  int64  
 1   emp_title                         10000 non-null  int64  
 2   emp_length                        10000 non-null  float64
 3   state                             10000 non-null  int64  
 4   homeownership                     10000 non-null  int64  
 5   annual_income                     10000 non-null  float64
 6   verified_income                   10000 non-null  int64  
 7   debt_to_income                    10000 non-null  float64
 8   delinq_2y                         10000 non-null  int64  
 9   earliest_credit_line              10000 non-null  int64  
 10  inquiries_last_12m                10000 non-null  int64  
 11  total_credit_lines                10000 non-null  int64  
 12  open_

-> Hence, all the features are converted to int and float. No 'object' / categorical feature remains.

<h3>3. Standardizing Numerical Features</h3>

In [12]:
# Applying Standardization
scaler = StandardScaler()

for feature in numerical_features.columns:
    loan_dataset[feature] = scaler.fit_transform(loan_dataset[feature].values.reshape(-1, 1))

In [13]:
# Printing the basic statistics of the features after standardization
statistics = loan_dataset.describe().round(2)
print(statistics.to_string())

       Unnamed: 0  emp_title  emp_length     state  homeownership  annual_income  verified_income  debt_to_income  delinq_2y  earliest_credit_line  inquiries_last_12m  total_credit_lines  open_credit_lines  total_credit_limit  total_credit_utilized  num_collections_last_12m  num_historical_failed_to_pay  current_accounts_delinq  total_collection_amount_ever  current_installment_accounts  accounts_opened_24m  months_since_last_credit_inquiry  num_satisfactory_accounts  num_accounts_120d_past_due  num_accounts_30d_past_due  num_active_debit_accounts  total_debit_limit  num_total_cc_accounts  num_open_cc_accounts  num_cc_carrying_balance  num_mort_accounts  account_never_delinq_percent  tax_liens  public_record_bankrupt  loan_purpose  application_type  loan_amount      term  interest_rate  installment     grade  sub_grade  issue_month  loan_status  initial_listing_status  disbursement_method   balance  paid_total  paid_principal  paid_interest  paid_late_fees
count    10000.00   10000.00 

-> So all the numerical features have been standardized and all the categorical featues have been encoded. Now the dataset is ready for the model training.

<h2>Model Training</h2>

In [14]:
# Separating the features and the target variable
X = loan_dataset.drop('loan_status', axis=1)
Y = loan_dataset['loan_status']

# Splitting the dataset into training and testing sets in 80/20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [15]:
# Implementing the model
dtc_model = DecisionTreeClassifier()
dtc_model.fit(X_train, Y_train)

In [16]:
# Predicting the target variable
Y_pred = dtc_model.predict(X_test)
Y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [17]:
Y_pred_proba = dtc_model.predict_proba(X_test)
Y_pred_proba

array([[0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       ...,
       [0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.]])

In [18]:
print(Y_test.dtype)
print(Y_pred.dtype)

int64
int64


In [19]:
num_classes = len(np.unique(Y_test))
print('Number of unique classes in Y_test:', num_classes)

Number of unique classes in Y_test: 5


In [20]:
print('Shape of Y_pred_proba:', Y_pred_proba.shape)

Shape of Y_pred_proba: (2000, 6)


<h2>Model Evaluation</h2>

<h3>1. Accuracy Score</h3>

In [21]:
# Calculating the accuracy of the model
accuracy = accuracy_score(Y_test, Y_pred)
print("Accuracy of the model:", accuracy)

Accuracy of the model: 0.97


<h3>2. ROC-AUC Score</h3>

In [22]:
Y_pred_proba_filtered = normalize(Y_pred_proba, norm='l1', axis=1)

In [23]:
all_zero_rows = np.where(~Y_pred_proba_filtered.any(axis=1))[0]

if len(all_zero_rows) > 0:
    print(f'Found {len(all_zero_rows)} rows with all zeros in Y_pred_proba_filtered.')
else:
    print('No rows with all zeros found in Y_pred_proba_filtered.')

No rows with all zeros found in Y_pred_proba_filtered.


In [24]:
mask = ~np.all(Y_pred_proba_filtered == 0, axis=1)
Y_pred_proba_filtered = Y_pred_proba_filtered[mask]

Y_test_np = Y_test.values

mask_test = np.isin(np.arange(len(Y_test_np)), np.where(mask)[0])
Y_test_np = Y_test_np[mask_test]

In [None]:
# Calculating the ROC-AUC score of the model
roc_auc = roc_auc_score(Y_test_np, Y_pred_proba_filtered, multi_class='ovr', average='macro')
print("ROC-AUC score of the model:", roc_auc)

-> The model's performance with an accuracy of 0.97 is highly impressive, indicating that it correctly classifies 97% of the instances. This high accuracy suggests the model is well-trained and effectively captures the underlying patterns in the dataset. 
<br>
-> The ROC-AUC score can't get calculated. It is giving errors and I can't be able to fix it.

<hr>