# **Loan Approval predictor**

### **Project Aim:**
The Loan Approval Predictor project aims to develop a machine learning model capable of accurately predicting the approval status of a loan application based on a variety of applicant and financial factors. The model utilizes the following key columns in the dataset:

- loan_id: Unique identifier for each loan application.
- no_of_dependents: Number of dependents the applicant supports.
- education: Education level of the applicant (e.g., graduate, undergraduate).
- self_employed: Indicates whether the applicant is self-employed.
- income_annum: Annual income of the applicant.
- loan_amount: Requested loan amount.
- loan_term: Duration of the loan in months.
- cibil_score: Credit score of the applicant indicating financial - reliability.
- residential_assets_value: Value of the applicant's residential assets.
- commercial_assets_value: Value of the applicant's commercial assets.
- luxury_assets_value: Value of the applicant's luxury assets.
- bank_asset_value: Total value of assets held by the bank as collateral.
- loan_status: Target variable indicating whether the loan application is approved or rejected.

### Objectives:
Build a robust data preprocessing pipeline to handle missing values, outliers, and inconsistencies in the dataset.
Explore and visualize the relationships between independent variables and the target variable to gain insights.
Develop and train a machine learning model using suitable algorithms to predict the loan_status.
Optimize the model for accuracy, precision, and recall to ensure reliable loan predictions.
Deploy the trained model for real-world applications and provide insights to financial institutions to improve decision-making.
### Significance:
This project will help financial institutions streamline the loan approval process, reduce manual effort, and minimize risks associated with lending by making data-driven decisions.

# **1. Importing the libraries**

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# **2. Load the dataset**

In [22]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/loan_approval_dataset.csv')
data.head()

Unnamed: 0,loan_id,no_of_dependents,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,1,2,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,2,0,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,3,3,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,4,3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,5,5,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected


# ***About Dataset***

In [23]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4269 entries, 0 to 4268
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   loan_id                    4269 non-null   int64 
 1    no_of_dependents          4269 non-null   int64 
 2    education                 4269 non-null   object
 3    self_employed             4269 non-null   object
 4    income_annum              4269 non-null   int64 
 5    loan_amount               4269 non-null   int64 
 6    loan_term                 4269 non-null   int64 
 7    cibil_score               4269 non-null   int64 
 8    residential_assets_value  4269 non-null   int64 
 9    commercial_assets_value   4269 non-null   int64 
 10   luxury_assets_value       4269 non-null   int64 
 11   bank_asset_value          4269 non-null   int64 
 12   loan_status               4269 non-null   object
dtypes: int64(10), object(3)
memory usage: 433.7+ KB


In [24]:
data.isnull().sum()

Unnamed: 0,0
loan_id,0
no_of_dependents,0
education,0
self_employed,0
income_annum,0
loan_amount,0
loan_term,0
cibil_score,0
residential_assets_value,0
commercial_assets_value,0


# **3. Feature Selection**

In [25]:
data = data.iloc[:, 2:]  # Select all rows and columns starting from the third column
data

Unnamed: 0,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
0,Graduate,No,9600000,29900000,12,778,2400000,17600000,22700000,8000000,Approved
1,Not Graduate,Yes,4100000,12200000,8,417,2700000,2200000,8800000,3300000,Rejected
2,Graduate,No,9100000,29700000,20,506,7100000,4500000,33300000,12800000,Rejected
3,Graduate,No,8200000,30700000,8,467,18200000,3300000,23300000,7900000,Rejected
4,Not Graduate,Yes,9800000,24200000,20,382,12400000,8200000,29400000,5000000,Rejected
...,...,...,...,...,...,...,...,...,...,...,...
4264,Graduate,Yes,1000000,2300000,12,317,2800000,500000,3300000,800000,Rejected
4265,Not Graduate,Yes,3300000,11300000,20,559,4200000,2900000,11000000,1900000,Approved
4266,Not Graduate,No,6500000,23900000,18,457,1200000,12400000,18100000,7300000,Rejected
4267,Not Graduate,No,4100000,12800000,8,780,8200000,700000,14100000,5800000,Approved


# **4. Apply LabelEncoder in Last column**

In [26]:
# Encoding the last column if it's of object type
last_column = data.columns[-1]
if data[last_column].dtype == 'object':
    data[last_column] = LabelEncoder().fit_transform(data[last_column])
# or
# data[last_column] = data[last_column].map({'Approved': 1, 'Rejected': 0})
# or
# data[last_column] = data[last_column].replace({'Approved': 1, 'Rejected': 0})
# or
# data[data.columns[-1]] = LabelEncoder().fit_transform(data[data.columns[-1]])

In [27]:
data.sample(5)

Unnamed: 0,education,self_employed,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
1869,Graduate,Yes,3900000,9000000,8,340,4900000,6800000,9000000,5200000,1
1190,Graduate,Yes,8100000,25700000,8,455,22300000,11500000,19800000,9500000,1
1641,Not Graduate,No,9600000,31200000,12,578,9900000,2200000,26500000,13500000,0
617,Not Graduate,No,5400000,15100000,12,522,5200000,4700000,13900000,5700000,1
2183,Graduate,Yes,7300000,18300000,8,434,10000000,2900000,26800000,6700000,1


# **5. Apply OneHotEncoder in First Two column**

In [28]:
# Select columns 1 and 2 (index 0 and 1) for one-hot encoding
cols_to_encode = data.columns[[0, 1]]

# Create a OneHotEncoder object
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

# Fit and transform the selected columns
encoded_data = ohe.fit_transform(data[cols_to_encode])


# Create a new DataFrame from the encoded data
encoded_df = pd.DataFrame(encoded_data, columns=ohe.get_feature_names_out(cols_to_encode))

# Concatenate the encoded DataFrame with the original DataFrame, excluding the original columns
data = pd.concat([data.drop(cols_to_encode, axis=1), encoded_df], axis=1)

# Now the 'data' DataFrame has one-hot encoded columns and the original columns have been removed.

data.head()

Unnamed: 0,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status,education_ Graduate,education_ Not Graduate,self_employed_ No,self_employed_ Yes
0,9600000,29900000,12,778,2400000,17600000,22700000,8000000,0,1.0,0.0,1.0,0.0
1,4100000,12200000,8,417,2700000,2200000,8800000,3300000,1,0.0,1.0,0.0,1.0
2,9100000,29700000,20,506,7100000,4500000,33300000,12800000,1,1.0,0.0,1.0,0.0
3,8200000,30700000,8,467,18200000,3300000,23300000,7900000,1,1.0,0.0,1.0,0.0
4,9800000,24200000,20,382,12400000,8200000,29400000,5000000,1,0.0,1.0,0.0,1.0


# **6. Splitting the data into features and target variable**

In [29]:
X = data.iloc[:, :-1]  # Features (all columns except the last one)
y = data.iloc[:, -1]   # Target variable (the last column)

In [30]:
X

Unnamed: 0,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status,education_ Graduate,education_ Not Graduate,self_employed_ No
0,9600000,29900000,12,778,2400000,17600000,22700000,8000000,0,1.0,0.0,1.0
1,4100000,12200000,8,417,2700000,2200000,8800000,3300000,1,0.0,1.0,0.0
2,9100000,29700000,20,506,7100000,4500000,33300000,12800000,1,1.0,0.0,1.0
3,8200000,30700000,8,467,18200000,3300000,23300000,7900000,1,1.0,0.0,1.0
4,9800000,24200000,20,382,12400000,8200000,29400000,5000000,1,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
4264,1000000,2300000,12,317,2800000,500000,3300000,800000,1,1.0,0.0,0.0
4265,3300000,11300000,20,559,4200000,2900000,11000000,1900000,0,0.0,1.0,0.0
4266,6500000,23900000,18,457,1200000,12400000,18100000,7300000,1,0.0,1.0,1.0
4267,4100000,12800000,8,780,8200000,700000,14100000,5800000,0,0.0,1.0,1.0


In [None]:
y

Unnamed: 0,loan_status
0,0
1,1
2,1
3,1
4,1
...,...
4264,1
4265,0
4266,1
4267,0


# **7. Splitting the data into training and testing sets**

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
X_train

Unnamed: 0,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status,education_ Graduate,education_ Not Graduate,self_employed_ No
1675,7900000,29900000,6,568,5800000,13900000,15900000,8700000,0,0.0,1.0,1.0
1164,9600000,34000000,12,710,23800000,10300000,38100000,7800000,0,0.0,1.0,0.0
192,800000,2900000,8,682,2200000,1100000,2900000,700000,0,1.0,0.0,1.0
910,4900000,13100000,18,754,8200000,3300000,16500000,7200000,0,1.0,0.0,0.0
567,3000000,11100000,12,441,8500000,2500000,7300000,2000000,1,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
3444,1300000,4700000,16,530,3200000,1000000,3800000,800000,1,1.0,0.0,0.0
466,500000,1800000,18,411,1000000,500000,1300000,200000,1,0.0,1.0,0.0
3092,8600000,20600000,16,449,10800000,10600000,28700000,5400000,1,0.0,1.0,0.0
3772,7000000,21400000,12,541,3700000,1300000,19800000,9600000,1,0.0,1.0,0.0


In [33]:
X_train

Unnamed: 0,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status,education_ Graduate,education_ Not Graduate,self_employed_ No
1675,7900000,29900000,6,568,5800000,13900000,15900000,8700000,0,0.0,1.0,1.0
1164,9600000,34000000,12,710,23800000,10300000,38100000,7800000,0,0.0,1.0,0.0
192,800000,2900000,8,682,2200000,1100000,2900000,700000,0,1.0,0.0,1.0
910,4900000,13100000,18,754,8200000,3300000,16500000,7200000,0,1.0,0.0,0.0
567,3000000,11100000,12,441,8500000,2500000,7300000,2000000,1,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
3444,1300000,4700000,16,530,3200000,1000000,3800000,800000,1,1.0,0.0,0.0
466,500000,1800000,18,411,1000000,500000,1300000,200000,1,0.0,1.0,0.0
3092,8600000,20600000,16,449,10800000,10600000,28700000,5400000,1,0.0,1.0,0.0
3772,7000000,21400000,12,541,3700000,1300000,19800000,9600000,1,0.0,1.0,0.0


In [None]:
X_test

Unnamed: 0,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,education_ Graduate,education_ Not Graduate,self_employed_ No,self_employed_ Yes
1703,5400000,19700000,20,423,6500000,10000000,15700000,7300000,1.0,0.0,1.0,0.0
1173,5900000,14000000,8,599,4700000,9500000,17800000,6700000,1.0,0.0,1.0,0.0
308,9600000,19900000,14,452,4200000,16200000,28500000,6600000,1.0,0.0,1.0,0.0
1322,6200000,23400000,8,605,10000000,10800000,21800000,9200000,1.0,0.0,1.0,0.0
3271,5800000,14100000,12,738,11700000,4400000,15400000,8400000,0.0,1.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...
912,2500000,5400000,10,592,3400000,1500000,9900000,2900000,1.0,0.0,1.0,0.0
443,7700000,16700000,6,555,12900000,2900000,18100000,8500000,1.0,0.0,1.0,0.0
1483,5600000,11500000,4,695,9500000,7100000,11700000,7800000,0.0,1.0,0.0,1.0
668,2200000,8600000,20,373,4100000,1300000,5900000,1400000,0.0,1.0,1.0,0.0


# **8. Training the model**

In [34]:
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)

# **9. Making predictions**

In [35]:
y_pred = model.predict(X_test)

# **10. Evaluating the model**

In [36]:
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print('Confusion Matrix:')
print(conf_matrix)
print('Classification Report:')
print(class_report)

Accuracy: 1.0
Confusion Matrix:
[[437   0]
 [  0 417]]
Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00       437
         1.0       1.00      1.00      1.00       417

    accuracy                           1.00       854
   macro avg       1.00      1.00      1.00       854
weighted avg       1.00      1.00      1.00       854



# **11. Prediction With New Data**

# *************************************************************************

# About Me:-
## Name - Aatish Kumar Baitha
  - M.Tech(Data Science)
- YouTube
  - https://www.youtube.com/@EngineeringWithAatish/playlists
- My Linkedin Profile
  - https://www.linkedin.com/in/aatish-kumar-baitha-ba9523191
- My Blog
  - https://computersciencedatascience.blogspot.com/
- My Github Profile
  - https://github.com/Aatishkb

# Thank you!

# *************************************************************************