# Loan Approval Prediction

## Step 1: Importing Libraries

We begin by importing:

- **Pandas & NumPy** for data manipulation and numerical operations.  
- **Seaborn & Matplotlib** for plotting distributions and relationships.  
- **scikit-learn** modules for train/test split, encoding, model building (Random Forest), and evaluation metrics.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from google.colab import files
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

sns.set(style="whitegrid")
%matplotlib inline


## Step 2: Uploading & Loading the Data

We upload both `train.csv` and `test.csv`.  
- The **training set** contains features and the target `Loan_Status`.  
- The **test set** has the same features but no target (used later for final predictions/submission).


In [2]:
train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")

print("Original train shape:", train.shape)
print("Original test  shape:", test.shape)


Original train shape: (614, 13)
Original test  shape: (367, 12)


## Step 3: Dropping Rows with Missing Target

Any row where `Loan_Status` is missing cannot teach the model, so we remove them:

```python
train = train.dropna(subset=["Loan_Status"])


In [3]:
print("Missing Loan_Status before drop:", train["Loan_Status"].isnull().sum())

train = train.dropna(subset=["Loan_Status"])

print("Shape after drop:", train.shape)
print("Unique Loan_Status values:", train["Loan_Status"].unique())


Missing Loan_Status before drop: 0
Shape after drop: (614, 13)
Unique Loan_Status values: ['Y' 'N']


This ensures our target vector y will have no nulls.
Missing values in input features can bias or break models if left untreated. We:

- Fill **categorical** columns (`Gender`, `Married`, etc.) with their **mode** (most common value).  
- Fill **numeric** column `LoanAmount` with its **median** (robust to outliers).


In [4]:
for col in ["Gender","Married","Dependents","Self_Employed","Credit_History","Loan_Amount_Term"]:
    train[col].fillna(train[col].mode()[0], inplace=True)

train["LoanAmount"].fillna(train["LoanAmount"].median(), inplace=True)

print("Remaining nulls:\n", train.isnull().sum())


Remaining nulls:
 Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train[col].fillna(train[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train["LoanAmount"].fillna(train["LoanAmount"].median(), inplace=True)


## Step 5: Encoding Categorical Features

Machine-learning models require numeric inputs. We convert each string column (except IDs and the original target) into integers via **LabelEncoder**, which assigns a unique integer to each category.


In [5]:
encoder = LabelEncoder()
for col in train.select_dtypes(include="object").columns:
    if col not in ["Loan_ID","Loan_Status"]:
        train[col] = encoder.fit_transform(train[col].astype(str))

print("Data types after encoding:\n", train.dtypes)


Data types after encoding:
 Loan_ID               object
Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area          int64
Loan_Status           object
dtype: object


## Step 6: Encoding the Target

Rather than manual mapping, we use **LabelEncoder** on `Loan_Status`:

- `'N'` → 0  
- `'Y'` → 1  

This guarantees no unexpected nulls slip through.

In [6]:
le_target = LabelEncoder()
train["Loan_Status_enc"] = le_target.fit_transform(train["Loan_Status"].astype(str))

print("Classes:", list(le_target.classes_))              # ['N','Y']
print("Counts:\n", train["Loan_Status_enc"].value_counts())
print("Nulls in y    :", train["Loan_Status_enc"].isnull().sum())  # should be 0


Classes: ['N', 'Y']
Counts:
 Loan_Status_enc
1    422
0    192
Name: count, dtype: int64
Nulls in y    : 0


## Step 7: Feature Engineering

We add two derived features:

- **TotalIncome** = ApplicantIncome + CoapplicantIncome  
- **LogLoanAmount** = log(LoanAmount) (reduces skew)

These often improve model performance by capturing important relationships.

In [7]:

train["TotalIncome"]   = train["ApplicantIncome"] + train["CoapplicantIncome"]
train["LogLoanAmount"] = np.log(train["LoanAmount"])


## Step 8: Preparing Features (X) & Target (y)

- **X**: All columns except `Loan_ID`, `Loan_Status`, and our new encoded target.  
- **y**: The encoded `Loan_Status_enc` vector (0/1).


In [8]:
X = train.drop(["Loan_ID","Loan_Status","Loan_Status_enc"], axis=1)
y = train["Loan_Status_enc"]

print("Shape X:", X.shape)
print("Shape y:", y.shape)


Shape X: (614, 13)
Shape y: (614,)


## Step 9: Train/Validation Split

We split into **80% train** / **20% validation** to evaluate model generalization before touching the test set.


In [9]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.20, random_state=42
)

print("X_train:", X_train.shape, "y_train:", y_train.shape)
print("X_val  :", X_val.shape,   "y_val  :", y_val.shape)


X_train: (491, 13) y_train: (491,)
X_val  : (123, 13) y_val  : (123,)


## Step 10: Training & Evaluating the Random Forest

We fit a **RandomForestClassifier** on the train fold and predict on the validation fold.  
Key metrics printed:

- **Accuracy**: Overall correctness of predictions.  
- **Precision**: Of all ‘approve’ predictions, how many were correct.  
- **Recall**: Of all truly approved loans, how many we caught.  
- **F1-score**: Harmonic mean of precision & recall (balance of both).


In [10]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_val)

print("► Accuracy :", accuracy_score(y_val, y_pred))
print("\n► Classification Report:\n", classification_report(y_val, y_pred))


► Accuracy : 0.7479674796747967

► Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.42      0.54        43
           1       0.75      0.93      0.83        80

    accuracy                           0.75       123
   macro avg       0.75      0.67      0.68       123
weighted avg       0.75      0.75      0.73       123



## Interpreting Results

► **Accuracy**: 0.748  
   - The model correctly predicts loan approval status about 75% of the time.

► **Class 0 (Not Approved)**  
   - Precision 0.75: When we predict “Not Approved,” we’re right 75% of the time.  
   - Recall 0.42: We only catch 42% of the actual “Not Approved” cases (miss many).

► **Class 1 (Approved)**  
   - Precision 0.75: When we predict “Approved,” we’re right 75% of the time.  
   - Recall 0.93: We identify 93% of all true “Approved” cases (strong coverage).

**Takeaway:** The model is much better at identifying approved loans (high recall) than rejecting them. We might tune class weights, gather more data, or engineer additional features to boost recall on the minority class.
