<a href="https://colab.research.google.com/gist/Manojreddy-07/62b47fff6f748a0604303fa2af3ee996/loan_approval_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np

DECISION TREE ALGORITHM :
A general example describing how decision tree works
                 Credit Score ≤ 700?
                 /              \
           Yes /                \ No
             /                  \
  Income ≤ $50,000?         Employment = Employed?
   /            \           /                \
Yes/              \No   Yes/                  \No
 /                 \     /                     \
Approved       Not Approved                   Approved


A decision tree is a type of algorithm used for making decisions or predictions based on a set of rules. It's like a flowchart where you start at the top with a question, and based on the answer, you follow different branches until you reach a final decision or prediction. Decision trees are commonly used for classification and regression tasks.


import all the necessary libraries

In [None]:
df=pd.read_csv('loan_approval_dataset.csv')
df.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


reading the data from csv file to the google collab and return the top 5 rows of dataframe

INPUT EXPLANATION :

In the context of approving a loan, each factor you've mentioned plays a specific role in assessing the borrower's creditworthiness and determining whether the loan should be approved or not. Let's go through each factor and explain its relevance:

loan_id: This is likely a unique identifier for each loan application and helps keep track of different applications.

no_of_dependents: The number of dependents the applicant has can influence their ability to repay the loan. More dependents may indicate higher financial obligations, potentially impacting the borrower's disposable income.

education: The education level of the borrower can be an indicator of their potential earning capacity and stability, which affects their ability to repay the loan.

self_employed: This indicates whether the applicant is self-employed or not. Self-employed individuals may have more variable income streams, which can affect their repayment capacity.

income_annum: The annual income of the borrower is a crucial factor. It helps assess whether the borrower has a steady income source and whether it's sufficient to cover the loan payments.

loan_amount: The amount of money the borrower is requesting. This helps determine the risk associated with the loan and whether the borrower can manage the requested amount.

loan_term: The duration of the loan in months. Shorter terms may mean higher monthly payments, but less risk overall. Longer terms may have lower monthly payments but potentially higher total interest costs.

cibil_score: The credit score of the borrower reflects their credit history and how well they've managed previous debts. A higher credit score indicates a lower risk of default.

residential_assets_value: The value of the borrower's residential assets provides collateral in case of default. Higher collateral value can reduce the lender's risk.

commercial_assets_value: The value of the borrower's commercial assets, if applicable. Similar to residential assets, these can provide additional collateral.

luxury_assets_value: The value of any luxury assets owned by the borrower, which could also serve as collateral.

bank_asset_value: The value of assets the borrower has in a bank. This could be seen as a positive factor, indicating financial stability.

status: This likely indicates the outcome of the loan application, whether it was approved or not. The other factors contribute to determining this status.

When assessing loan applications, lenders typically use a combination of these factors to determine the risk associated with lending to a particular borrower. Different lenders may assign different weights to each factor based on their own risk assessment models. By analyzing these factors, lenders aim to make informed decisions that minimize the risk of default and ensure that borrowers can comfortably repay the loan according to the agreed terms.

In [None]:
df[' status'] = df[' status'].apply(lambda x: 1 if x == ' Approved' else 0)
df[' education'] = df[' education'].apply(lambda x: 1 if x == ' Graduate' else 0)
df[' self_employed'] = df[' self_employed'].apply(lambda x: 1 if x == ' Yes' else 0)

As machine learning algorithms will accept only the numerical values or the binary representation ,iam gonna convert the columns status,education,self_employed into '0's and '1's also stream column to the numericals.

In [None]:
df

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,status
0,1,2,1,0,9600000,29900000,12,778,2400000,17600000,22700000,8000000,1
1,2,0,0,1,4100000,12200000,8,417,2700000,2200000,8800000,3300000,0
2,3,3,1,0,9100000,29700000,20,506,7100000,4500000,33300000,12800000,0
3,4,3,1,0,8200000,30700000,8,467,18200000,3300000,23300000,7900000,0
4,5,5,0,1,9800000,24200000,20,382,12400000,8200000,29400000,5000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4264,4265,5,1,1,1000000,2300000,12,317,2800000,500000,3300000,800000,0
4265,4266,0,0,1,3300000,11300000,20,559,4200000,2900000,11000000,1900000,1
4266,4267,2,0,0,6500000,23900000,18,457,1200000,12400000,18100000,7300000,0
4267,4268,1,0,0,4100000,12800000,8,780,8200000,700000,14100000,5800000,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   loan_id                    4269 non-null   int64
 1    no_of_dependents          4269 non-null   int64
 2    education                 4269 non-null   int64
 3    self_employed             4269 non-null   int64
 4    income_annum              4269 non-null   int64
 5    loan_amount               4269 non-null   int64
 6    loan_term                 4269 non-null   int64
 7    cibil_score               4269 non-null   int64
 8    residential_assets_value  4269 non-null   int64
 9    commercial_assets_value   4269 non-null   int64
 10   luxury_assets_value       4269 non-null   int64
 11   bank_asset_value          4269 non-null   int64
 12   status                    4269 non-null   int64
dtypes: int64(13)
memory usage: 433.7 KB


In [None]:
df.shape

(4269, 13)

In [None]:
X = df.drop(' status', axis=1)
y = df[' status']

separating the input columns and output column using drop function.

In [None]:
X

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value
0,1,2,1,0,9600000,29900000,12,778,2400000,17600000,22700000,8000000
1,2,0,0,1,4100000,12200000,8,417,2700000,2200000,8800000,3300000
2,3,3,1,0,9100000,29700000,20,506,7100000,4500000,33300000,12800000
3,4,3,1,0,8200000,30700000,8,467,18200000,3300000,23300000,7900000
4,5,5,0,1,9800000,24200000,20,382,12400000,8200000,29400000,5000000
...,...,...,...,...,...,...,...,...,...,...,...,...
4264,4265,5,1,1,1000000,2300000,12,317,2800000,500000,3300000,800000
4265,4266,0,0,1,3300000,11300000,20,559,4200000,2900000,11000000,1900000
4266,4267,2,0,0,6500000,23900000,18,457,1200000,12400000,18100000,7300000
4267,4268,1,0,0,4100000,12800000,8,780,8200000,700000,14100000,5800000


In [None]:
y

0       1
1       0
2       0
3       0
4       0
       ..
4264    0
4265    1
4266    0
4267    1
4268    1
Name:  status, Length: 4269, dtype: int64

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)


This train_test_split Method will be useful for the splitting of training and testing data.
here training data taken (70%) and testing data(30%).

In [None]:
X_train

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value
80,81,4,1,0,3500000,8900000,4,470,1300000,2200000,12300000,2000000
4092,4093,1,1,0,700000,1400000,6,759,1000000,500000,2600000,700000
1936,1937,4,0,0,1000000,3800000,2,611,1700000,1800000,2500000,600000
3763,3764,3,1,0,9600000,34500000,20,523,26100000,2600000,24300000,5300000
341,342,5,1,0,8600000,21000000,8,465,11700000,14900000,19500000,9900000
...,...,...,...,...,...,...,...,...,...,...,...,...
323,324,3,0,1,9500000,24200000,8,879,3100000,17200000,26400000,12700000
245,246,1,0,1,8000000,28100000,18,725,1900000,12800000,23000000,7500000
947,948,1,0,0,2300000,6600000,18,513,4600000,900000,9000000,1600000
1807,1808,4,0,1,400000,1100000,6,644,1000000,300000,1600000,400000


In [None]:
X_test

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value
3890,3891,4,1,1,6200000,21300000,16,882,11800000,3500000,12900000,4100000
572,573,2,0,1,1300000,2700000,8,627,100000,300000,2600000,1300000
2926,2927,3,1,1,4000000,14700000,8,376,3000000,6200000,8900000,4200000
1718,1719,4,0,1,7600000,20000000,18,691,5000000,8700000,23400000,11200000
3553,3554,5,1,0,1800000,6000000,4,548,1400000,2400000,6500000,2600000
...,...,...,...,...,...,...,...,...,...,...,...,...
460,461,1,1,1,5400000,21400000,8,350,2600000,1000000,19700000,6800000
1053,1054,5,1,1,6000000,11900000,2,629,2500000,0,19200000,4200000
383,384,3,0,0,2300000,7400000,2,682,6500000,2400000,7600000,2100000
2151,2152,2,0,0,7000000,25700000,18,374,3400000,13700000,14900000,4500000


Scaling can lead to improved model performance and generalization.
Algorithms like decision trees or random forests are not as sensitive to feature scales, but other algorithms like linear regression, support vector machines, and k-means clustering can benefit significantly from scaling.

In [None]:
from sklearn import tree
model=tree.DecisionTreeClassifier()



In [None]:
model.fit(X,y)

In [None]:
model.score(X,y)#most cases we wont get 1.0 and here we considered actual that is giving the actual op

1.0

In [None]:
y_test_predict=model.predict(X_test)#30% of input(x_test) expects to have 30% of output(y_test)
y_test_predict

array([1, 1, 0, ..., 1, 0, 1])

In [None]:
y_test

3890    1
572     1
2926    0
1718    1
3553    1
       ..
460     0
1053    1
383     1
2151    0
126     1
Name:  status, Length: 1281, dtype: int64

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_test_predict)#accuracy of testing data prediction(10%)

1.0

In [None]:
y_train_predict=model.predict(X_train)#70% of input(x_test) expects to have 70% of output(y_test)

In [None]:
accuracy_score(y_train,y_train_predict)

1.0