#  Task 4: Loan Approval Prediction Description 
Description:
- Dataset (Recommended): Loan-Approval-Prediction-Dataset (Kaggle
- Build a model to predict whether a loan application will be approved
- Handle missing values and encode categorical features
- Train a classification model and evaluate performance on imbalanced data
- Focus on precision, recall, and F1-score

Tools & Libraries:
 - Python
 - Pandas
 
Covered Topics
 - Scikit-learn
 - Binary classification |  Imbalanced data
  
Bonus: 
- Use SMOTE or other techniques to address class imbalance 
- Try logistic regression vs. decision tree 

# Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns 
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier # rf model
from sklearn.linear_model import LogisticRegression #lo model
from sklearn.tree import DecisionTreeClassifier # dt model
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score , precision_score , recall_score , f1_score

import os

import warnings
warnings.filterwarnings("ignore")

# Load the data

In [None]:
df = pd.read_csv('data.csv')
df.head()

# Exploratory Data Analysis

In [None]:
df.shape 

In [None]:
df.info()

In [None]:
print(f'Missing value = {df.isnull().sum()}') # => missing value 


In [None]:
print(f' Duplicated = {df.duplicated().sum()}') # =>duplicated


In [None]:
#columns name
df.columns

In [None]:
df.isna().sum()

In [None]:
#show NULL value
data_null = round(df.isna().sum() / df.shape[0] * 100, 2)
data_null.to_frame(name = 'percent NULL data (%)')

In [None]:
df.describe().T

In [None]:
#show Distribution for Score 
plt.figure(figsize=(10, 6))
plt.hist(df[' cibil_score'], bins=50, color='#B07AA1', edgecolor='white')
plt.title(' cibil_score Distribution of Customers', fontsize=14, color='white')
plt.xlabel(' cibil_score', color='white')
plt.ylabel('Frequency', color='white')
plt.grid(True, linestyle='--', alpha=0.3)
plt.show()

**Observation**
- This histogram depicts an approximately uniform distribution of data spanning from 300 to 900. 
- There is no discernible central tendency (like a mean or mode), as the frequencies across all bins are relatively consistent. 
- The data points are spread quite evenly, with most bins containing between 70 and 105 observations. 
- This pattern suggests that any value within the 300-900 range has a roughly equal probability of occurring.

In [None]:
#show Target columns Analysis 
consistent_colors = [  '#B07AA1', '#FF9DA7']
plt.figure(figsize=(10,6))
explode = (0,0.03)
plt.pie(df[' loan_status'].value_counts().values,
        labels=df[' loan_status'].value_counts().index,
        colors=consistent_colors[:len(df[' loan_status'].value_counts())],
        explode=explode,
        autopct="%1.2f%%",
        )
plt.title('show Target ')
plt.legend()
plt.show()

**Observation**
- This pie chart illustrates an imbalanced dataset for the target variable. 
- The "Approved" class is the clear majority, accounting for 62.22% of the instances. 
- Conversely, the "Rejected" class represents the minority, making up the remaining 37.78%. 
- This class imbalance is a crucial consideration for predictive modeling, as it could lead to a model that is biased towards the majority "Approved" outcome.

In [None]:
# Select numerical columns
numerical = df.select_dtypes(include=['int64', 'float64']).columns

# Set seaborn theme
sns.set(style="whitegrid", palette="Spectral", font_scale=1.1)

# Define color palette (vivid & unique)
palette = sns.color_palette("Spectral", len(numerical))

# Plot
plt.figure(figsize=(16, 10))
for i, col in enumerate(numerical, 1):
    plt.subplot(4, 4, i)
    sns.boxplot(
        x=df[col],
        color=palette[i-1],
        width=0.6,
        fliersize=2,  
        linewidth=1.2
    )
    plt.title(f"Boxplot of {col}", fontsize=11, fontweight="bold", color="#333333")
    plt.xlabel("")  
plt.suptitle("Outlier Detection for Numerical Features", fontsize=16, fontweight="bold", color="#2c3e50")
plt.tight_layout(rect=[0, 0, 1, 0.96])  # space for suptitle
plt.show()


**Observations and Insights from Outlier Detection Plots**

* **Significant Right-Skewness in Asset Features**: The boxplots for `residential_assets_value`, `commercial_assets_value`, `luxury_assets_value`, and `bank_asset_value` all reveal **highly right-skewed distributions**. This is indicated by the medians being close to the bottom of the boxes and numerous data points lying far beyond the upper whisker, highlighting a significant presence of **high-value outliers**. This suggests that while most applicants have modest asset values, a small subset possesses exceptionally high-value assets.

* **Varied Distributions for Other Features**: The `cibil_score` distribution is **left-skewed**, implying that the dataset contains a higher concentration of individuals with good credit scores. In contrast, features like `income_annum` and `loan_term` appear relatively **symmetrical** with no obvious outliers. The `loan_amount` shows a slight right skew.

* **Implications for Data Preprocessing**: The numerous outliers in the asset-related features can disproportionately influence the performance of many machine learning models. Therefore, it's crucial to apply appropriate data transformation techniques—such as **log transformation** to reduce the skewness or **capping/winsorization** to handle the extreme values—before proceeding with model building.

In [None]:
#show Distribution for all numerical columns 
palette =sns.color_palette("husl",len(numerical))

plt.figure(figsize=(15,8))

#for loop 
for i,col in enumerate(numerical,1):
    plt.subplot(4,3,i)
    sns.histplot(df[col],kde=True ,color=palette[i-1], bins=30)
    plt.title(f'Distribution of {col} ')

plt.tight_layout()
plt.show()


**Observations and Insights from Distribution Plots**
* **Prevalence of Skewed Distributions**: These histograms confirm and provide more detail on the data's skewness. The distributions for all asset-related features (`residential`, `commercial`, `luxury`, `bank_asset_value`) and `loan_amount` are heavily **right-skewed**. This is evident from the concentration of data on the left side and a long tail extending to the right, which is typical for financial value data where most values are low and a few are exceptionally high. 💰

* **Confirmation of Other Distribution Shapes**: The plot for `cibil_score` clearly shows a **left-skewed distribution**, with a majority of applicants having higher scores. Conversely, `income_annum` exhibits a relatively **uniform distribution**, where different income levels within the range appear almost equally frequently. The `loan_id` is also uniformly distributed, as expected for an identifier.

* **Discrete vs. Continuous Data**: The plots clearly distinguish between continuous and discrete features. For discrete variables like `no_of_dependents` and `loan_term`, the bars are more informative than the Kernel Density Estimate (KDE) curve, which can be misleading. The `no_of_dependents` follows a discrete uniform distribution.

* **Modeling Implications**: The pronounced skew in key financial predictors reinforces the need for preprocessing. Applying transformations like a **logarithm or square root** to the right-skewed features is essential to normalize their distributions. This will help improve the performance and stability of many machine learning algorithms, especially linear models and those sensitive to feature scale. ⚙️

## Relation between Target column with Categorical columns

In [None]:
df.columns = df.columns.str.strip()

target = "loan_status"

# Categorical
categorical = ["education", "self_employed", "no_of_dependents"]
for col in categorical:
    plt.figure(figsize=(6,4))
    sns.countplot(x=col, hue=target, data=df, palette="Set2")
    plt.title(f"{col} vs {target}")
    plt.show()


🎓 Education vs Loan Status
- Loan approval rates are **almost identical** for Graduates (62.45%) and Non-Graduates (61.98%).  
- Education level **does not appear to be a strong differentiator** in loan approval decisions.  
- The dataset is **balanced** across both groups (≈50% each), ensuring fair comparison.

---

💼 Self Employed vs Loan Status
- Approval rate is **virtually the same** for self-employed (62.23%) and non–self-employed (62.20%) applicants.  
- Employment type **does not significantly influence** loan approval.  
- The data is evenly distributed (≈50% Yes / No), which indicates no sampling bias.

---

👨‍👩‍👧 Number of Dependents vs Loan Status
- Applicants with **no dependents** have the **highest approval rate** (64.19%), suggesting lower financial burden helps approval chances.  
- As dependents increase, approval rate **slightly decreases** — from 64.19% (0 dependents) to 60.33% (5 dependents).  
- Majority of applicants fall within 0–4 dependents, showing a **stable but mild negative correlation** between dependents and approval.

---

✅ **Overall Insight:**  
Loan approval seems **largely unaffected by education or employment type**, but a **slight decline with more dependents** indicates lenders may prefer applicants with fewer financial responsibilities.


## Relation between Target column with Numerical columns

In [None]:
numerical = [
    "income_annum", "loan_amount", "loan_term", "cibil_score",
    "residential_assets_value", "commercial_assets_value",
    "luxury_assets_value", "bank_asset_value"
]

fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(18, 10))
axes = axes.flatten()  

for i, col in enumerate(numerical):
    sns.boxplot(
        data=df, 
        x=target, 
        y=col, 
        ax=axes[i], 
        palette="coolwarm",
        fliersize=2, 
        linewidth=1.2
    )
    axes[i].set_title(f"{col} vs {target}", fontsize=11, fontweight="bold")
    axes[i].set_xlabel("")  
    axes[i].set_ylabel(col, fontsize=10)

plt.suptitle("Numerical Features vs Loan Status", fontsize=16, fontweight="bold", color="#2c3e50")
plt.tight_layout(rect=[0, 0, 1, 0.96])  
plt.show()

📊 **Numerical Features vs Loan Status**

💰 Income vs Loan Status
- The **median annual income ($\text{income\_annum}$) is slightly higher** for **Approved** loans compared to Rejected loans.
- The **range and spread** of income are very similar for both groups, with Approved loans having slightly more high-end outliers.
- Income is **not a sharp differentiator**, but higher income applicants show a marginal advantage.

---

💵 Loan Amount vs Loan Status
- The **median loan amount ($\text{loan\_amount}$) is lower** for **Approved** loans than for Rejected loans.
- This suggests that lenders may be **more cautious** or approve smaller loan amounts more readily, potentially viewing larger requests as riskier.

---

⏳ Loan Term vs Loan Status
- The **median loan term ($\text{loan\_term}$) is slightly shorter** for **Approved** loans than for Rejected loans.
- This aligns with the loan amount insight, indicating a potential preference for **shorter-term, lower-risk** loans among approved applications.

---

💳 CIBIL Score vs Loan Status
- The **median CIBIL score ($\text{cibil\_score}$) is substantially higher** for **Approved** loans (approx. 700) compared to Rejected loans (approx. 500).
- This feature shows the **clearest separation**, with a high CIBIL score being a **critical factor** for loan approval.

---

🏠 Asset Values vs Loan Status
- The **median values** for **Residential, Commercial, and Luxury assets** are **uniformly lower** for **Approved** loans compared to Rejected loans.
- This counter-intuitive trend suggests that applicants with **very high asset values** (often correlated with larger loan requests or higher financial complexity) might be **more likely to be rejected**, possibly due to applying for **disproportionately large loans** or having a different risk profile not captured here.
- The **Bank Assets ($\text{bank\_asset\_value}$)** distributions are **almost identical** for Approved and Rejected, suggesting they are **not a strong differentiator** in the decision.

---

✅ **Overall Insight:**
**CIBIL Score is the strongest predictor** of loan status, with approved applicants having a significantly better score. While approved loans tend to be **smaller in amount and shorter in term**, high asset values surprisingly show a **mild negative correlation** with approval, possibly indicating rejection for complex or overly large loan applications from high-net-worth individuals.

# Outliers processor
- We removed outliers to ensure the model learns from realistic applicant patterns instead of being influenced by extreme values. 
- Outliers can distort feature scales, reduce model accuracy, and cause biased predictions.
- We chose the IQR (Interquartile Range) method because it’s simple, robust, and non-parametric—it doesn’t assume any specific data distribution. 
- This makes it ideal for financial data, where income or asset values often vary widely but shouldn’t dominate the model.

In [None]:
#remove outliers & Outliers processor before split data
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    
    
    df[column] = df[column].clip(lower, upper)
    return df


num_cols = ['no_of_dependents','income_annum','loan_amount','loan_term',
            'cibil_score','residential_assets_value',
            'commercial_assets_value','luxury_assets_value','bank_asset_value']

for col in num_cols:
    df = remove_outliers_iqr(df, col)

In [None]:
#delete unneeded column
df = df.drop(columns=['loan_id'])

# Encoding Categorical columns

- We used **Label Encoding** to convert categorical variables like `education`, `self_employed`, and `loan_status` into numerical form for model training.  
- Machine learning algorithms **cannot process text labels directly**, so encoding ensures they can interpret categories as numeric values.  
- We chose **Label Encoding** because these features have **only two categories (binary)** — making it a **simple, efficient, and memory-friendly** method compared to One-Hot Encoding.  
- It helps keep the dataset compact and ready for ML models that expect numerical input.


In [None]:
# Encode to categorical
le = LabelEncoder()
df['education'] = le.fit_transform(df['education'])
df['self_employed'] = le.fit_transform(df['self_employed'])
df['loan_status'] = le.fit_transform(df['loan_status'])  # => target 


# split data to Train & Test

In [None]:
X = df.drop(columns=['loan_status'])
y = df['loan_status']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

## Random Forest model

In [None]:
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluation 

In [None]:
# 🎯 Predict on test data
y_pred = model.predict(X_test)

# 📊 Evaluate model performance
print(f"Accuracy  : {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision : {precision_score(y_test, y_pred):.4f}")
print(f"Recall    : {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score  : {f1_score(y_test, y_pred):.4f}")

# 🔍 Detailed metrics per class
print("\nClassification Report:\n", classification_report(y_test, y_pred))





**Model Performance:**
- ✅ **Accuracy:** 98.2% — predicts loan approvals very well.
- 🎯 **Precision & Recall:** High for both approved and rejected loans. Recall for approved loans is slightly lower (0.96), meaning a few approvals are missed but false approvals are rare.
- ⚖️ **F1-Score:** 0.98–0.99, indicating balanced performance across classes.

**Business Insight:**
- The model is **conservative**, minimizing risky loan approvals.

**Next Steps:**
- Analyze misclassified approvals (false negatives) to further improve recall.


In [None]:

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Plot heatmap
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix - Loan Approval Model')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

In [None]:
# Feature importance as a DataFrame
feat_imp = pd.DataFrame({'Feature': X_train.columns, 'Importance': model.feature_importances_})
feat_imp = feat_imp.sort_values(by='Importance', ascending=False)

# Plot using seaborn
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=feat_imp, palette='viridis')
plt.title('Random Forest Feature Importance')
plt.show()


🌲 **Random Forest Feature Importance Insights**

- 🥇 **Top Predictor:** `cibil_score` dominates the model, contributing ~80% of total importance.  
- 🔹 **Moderate Predictors:** `loan_term` and `loan_amount` have some impact but are far behind `cibil_score`.  
- 📉 **Low-Impact Features:** Asset-related features (`luxury_assets_value`, `commercial_assets_value`, etc.) and categorical features (`self_employed`, `education`) contribute very little.  
- ✅ **Key Insight:** Model decisions rely heavily on `cibil_score`; other features have minimal effect, suggesting opportunities for **feature selection** or model simplification.


# Class Imbalance
- ⚖️ **Balance Classes:** When one class dominates (e.g., approved loans), the model may become biased toward it. SMOTE balances the dataset by generating synthetic samples for the minority class.  
- 🔄 **Improve Model Performance:** Balancing helps the model better learn patterns of the minority class, improving recall and F1-score.  
- 🧪 **Prevent Data Bias:** Using SMOTE on training data only avoids leakage and ensures fair learning.  
- 🧐 **Better Predictions:** After resampling, the model can predict both approved and rejected loans more accurately.  
- ✅ **Reproducibility:** Setting `random_state` ensures consistent results across runs.


In [None]:
from imblearn.over_sampling import SMOTE

# 🎯 Tackling Imbalanced Data:
# When one class (e.g., 'Approved' loans) has many more samples than the other ('Rejected' loans), 
# our model might become biased and struggle to predict the minority class correctly.

# 🧪 Initialize the SMOTE (Synthetic Minority Over-sampling Technique)
# SMOTE creates *synthetic* new samples for the minority class to balance the dataset.
# setting random_state ensures we get the same results every time (reproducibility!).
sm = SMOTE(random_state=42)

# 🔄 Apply SMOTE to the training data
# We fit and resample the data in one step. SMOTE only uses the training data 
# (X_train, y_train) to prevent data leakage from the test set.
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

# 🧐 Check the results (the crucial step!)
# Before SMOTE: We see the original imbalance (e.g., more 1s than 0s).
print("Before SMOTE:", y_train.value_counts().to_dict())

# After SMOTE: The classes are now perfectly balanced! (e.g., equal numbers of 1s and 0s).
# This balanced data (X_resampled, y_resampled) will be used to train our final model.
print("After SMOTE:", y_resampled.value_counts().to_dict())

# Logistic Regression

In [None]:
# Logistic Regression
log_reg = LogisticRegression(max_iter=1000, random_state=42)
log_reg.fit(X_train, y_train)
y_pred_log = log_reg.predict(X_test)

print("=== Logistic Regression ===")
print("Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

🔍 Model Comparison: Logistic Regression vs Random Forest

| Metric                | Logistic Regression (After SMOTE) | Random Forest |
|-----------------------|---------------------------------|---------------|
| **Accuracy**          | 81%                              | 98.2%         |
| **Precision (Approved)** | 0.83                            | 0.99          |
| **Recall (Approved)**    | 0.63                            | 0.96          |
| **F1-Score (Approved)**  | 0.71                            | 0.98          |
| **Precision & Recall (Rejected)** | 0.80 / 0.92              | 0.98 / 0.99   |

### 🔹 Insights
- **Random Forest clearly outperforms Logistic Regression** across all metrics, especially for predicting approved loans.  
- Logistic Regression struggles with **recall for approved loans** (0.63), even after SMOTE, whereas Random Forest maintains high recall (0.96).  
- **Business Impact:** RF is safer for loan decisions, minimizing risky false approvals while accurately identifying approvals.  
- **Next Steps:** Logistic Regression may still be useful for interpretability, but for best predictive performance, Random Forest is preferred.


# 🎯 Conclusion & Key Takeaways

- ✅ **Random Forest is the best performer** for this dataset, achieving 98.2% accuracy and strong precision & recall across both approved and rejected loans.  
- 💡 **Feature Insights:** `cibil_score` is the dominant predictor, while loan amount and term contribute moderately; demographic and asset-related features have minimal impact.  
- ⚖️ **Imbalanced Data Handling:** Applying SMOTE improved model learning for minority class (approved loans), especially for Logistic Regression, though Random Forest still outperforms.  
- 🔍 **Business Perspective:** The model is conservative in approving loans, minimizing risk and supporting data-driven decision-making.  
- 🚀 **Next Steps:** Further improvements could include hyperparameter tuning, exploring ensemble methods, or adding new features for even better predictive performance.

