# Loan Approval Prediction Model

This notebook implements a complete machine learning pipeline for predicting loan approval decisions using Random Forest classification.

## 📊 Dataset Overview
- **Features**: 11 input variables including personal, financial, and asset information
- **Target**: Binary classification (Approved/Rejected)
- **Algorithm**: Random Forest Classifier with hyperparameter tuning

## 🔄 Pipeline Steps
1. **Data Preprocessing & Feature Engineering** 
2. **Model Training & Evaluation**
3. **Model Persistence & Testing**

---

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1. Load the dataset
df = pd.read_csv(r"loan_approval_predicter\DATASET\loan_approval_dataset.csv")

# 2. Clean up whitespace from all column names
df.columns = df.columns.str.strip()

# 3. Strip whitespace from all string values in the dataframe
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.strip()

# 4. Drop the irrelevant 'loan_id' column
df = df.drop('loan_id', axis=1)

# 5. **THIS IS THE CRITICAL STEP**
# Convert categorical columns to numerical format (using map to avoid FutureWarnings)
df['education'] = df['education'].map({'Graduate': 1, 'Not Graduate': 0})
df['self_employed'] = df['self_employed'].map({'Yes': 1, 'No': 0})
df['loan_status'] = df['loan_status'].map({'Approved': 1, 'Rejected': 0})

# 6. Separate features (X) and target (y)
X = df.drop('loan_status', axis=1)
y = df['loan_status']

# 7. Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 8. Apply Feature Scaling
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

print("--- Data successfully preprocessed and scaled! ---")
print("\nShape of scaled training data:", x_train_scaled.shape)
print("Shape of scaled testing data:", x_test_scaled.shape)
print("\nFirst few rows of processed data:")
print(df.head())

--- Data successfully preprocessed and scaled! ---

Shape of scaled training data: (3415, 11)
Shape of scaled testing data: (854, 11)

First few rows of processed data:
   no_of_dependents  education  self_employed  income_annum  loan_amount  \
0                 2          1              0       9600000     29900000   
1                 0          0              1       4100000     12200000   
2                 3          1              0       9100000     29700000   
3                 3          1              0       8200000     30700000   
4                 5          0              1       9800000     24200000   

   loan_term  cibil_score  residential_assets_value  commercial_assets_value  \
0         12          778                   2400000                 17600000   
1          8          417                   2700000                  2200000   
2         20          506                   7100000                  4500000   
3          8          467                  18200000   

## 1. Complete Data Preprocessing Pipeline

### Comprehensive Data Cleaning & Preprocessing
This cell performs the complete preprocessing pipeline in a clean, organized manner:

1. **Data Loading**: Import the dataset from CSV
2. **Data Cleaning**: Remove whitespace from column names and values
3. **Feature Engineering**: Drop irrelevant columns (loan_id)
4. **Categorical Encoding**: Convert text to numerical values using map()
5. **Feature-Target Separation**: Split X and y variables
6. **Train-Test Split**: Create training and testing sets with stratification
7. **Feature Scaling**: Apply StandardScaler to normalize features

In [19]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression(random_state=42)
model.fit(x_train_scaled,y_train)

## 2. Model Training & Evaluation

### Logistic Regression Baseline Model
Training a Logistic Regression classifier as a baseline:
- **Algorithm**: Linear classifier using sigmoid function
- **Advantages**: Fast, interpretable, provides probability scores
- **Use Case**: Baseline comparison for Random Forest model

In [20]:
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

# Make predictions
y_pred = model.predict(x_test_scaled)

# Evaluate
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nROC AUC Score:")
print(roc_auc_score(y_test, model.predict_proba(x_test_scaled)[:, 1]))

Confusion Matrix:
[[280  43]
 [ 31 500]]

Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.87      0.88       323
           1       0.92      0.94      0.93       531

    accuracy                           0.91       854
   macro avg       0.91      0.90      0.91       854
weighted avg       0.91      0.91      0.91       854


ROC AUC Score:
0.9725501857002093


### Logistic Regression Model Evaluation
Evaluating the Logistic Regression model performance using multiple metrics:
- **Confusion Matrix**: Shows true vs predicted classifications
- **Classification Report**: Precision, Recall, F1-score for each class
- **ROC AUC Score**: Area under the ROC curve (higher is better)

In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the model and parameters to search
rfc = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [5, 10, None],
    'min_samples_leaf': [1, 2, 4]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2, scoring='f1')
grid_search.fit(x_train_scaled, y_train)

# Get the best model
best_model = grid_search.best_estimator_
print(f"Best parameters found: {grid_search.best_params_}")

# Evaluate the best model
y_pred_best = best_model.predict(x_test_scaled)
print(classification_report(y_test, y_pred_best))

Fitting 5 folds for each of 18 candidates, totalling 90 fits
Best parameters found: {'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 100}
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       323
           1       0.98      0.98      0.98       531

    accuracy                           0.98       854
   macro avg       0.98      0.98      0.98       854
weighted avg       0.98      0.98      0.98       854



### Random Forest with Hyperparameter Tuning
Training a Random Forest model with Grid Search for optimal hyperparameters:

**Parameters being tuned:**
- **n_estimators**: Number of trees (100, 200)
- **max_depth**: Maximum depth of trees (5, 10, None)
- **min_samples_leaf**: Minimum samples at leaf node (1, 2, 4)

**Grid Search**: Tests all combinations using 5-fold cross-validation to find the best model.

In [25]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(x_train_scaled, y_train)

print("--- Random Forest model trained successfully! ---")

--- Random Forest model trained successfully! ---


### Final Random Forest Model Training
Training the final Random Forest model with standard parameters:
- **n_estimators**: 100 trees
- **random_state**: 42 (for reproducibility)

This model will be saved and used in production.

In [None]:
import joblib

# Save the trained Random Forest model
joblib.dump(rf_model, r"loan_approval_predicter\MODEL FILE\loan_approval_model.joblib")
print("Model saved successfully as 'loan_approval_model.joblib'")

Model saved successfully as 'loan_approval_model.joblib'


## 3. Model Persistence

### Saving the Trained Model
Saving the trained Random Forest model using joblib for production use:
- **File Format**: .joblib (efficient for sklearn models)
- **Location**: MODEL FILE directory
- **Purpose**: Load in Flask API for real-time predictions

In [27]:
import joblib

# Load the saved model
model = joblib.load(r"loan_approval_predicter\MODEL FILE\loan_approval_model.joblib")

# Make predictions on your test data
y_pred = model.predict(x_test_scaled)

# Evaluate the model
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nROC AUC Score:")
print(roc_auc_score(y_test, model.predict_proba(x_test_scaled)[:, 1]))

Confusion Matrix:
[[314   9]
 [  8 523]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.97      0.97       323
           1       0.98      0.98      0.98       531

    accuracy                           0.98       854
   macro avg       0.98      0.98      0.98       854
weighted avg       0.98      0.98      0.98       854


ROC AUC Score:
0.9988747208666399


## 4. Model Testing & Validation

### Loading and Testing the Saved Model
Testing the saved model to ensure it works correctly:
1. **Load Model**: Import the saved .joblib file
2. **Make Predictions**: Test on the test data
3. **Evaluate Performance**: Verify metrics are consistent

This validates that the model serialization/deserialization process works properly.

In [None]:
import joblib
joblib.dump(scaler, r"loan_approval_predicter\MODEL FILE\scaler.joblib")

['c:\\Users\\adhin\\OneDrive\\Desktop\\VS CODE\\project\\CML project\\loan_approval_predicter\\MODEL FILE\\scaler.joblib']