<a href="https://colab.research.google.com/github/AaryanPriyadarshi/Paisabazaar-Project/blob/main/Paisabazaar_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Paisabazaar Project



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Aaryan Priyadarshi

# **Project Summary -**

The financial sector relies heavily on accurate risk assessment to ensure that loans and credit are issued responsibly. As digital lending platforms like Paisabazaar expand, the challenge of evaluating customer creditworthiness has become increasingly critical. One of the key indicators of financial health is the credit score, which reflects a customer’s ability to manage debt, repay loans on time, and maintain financial stability. This project focuses on performing an Exploratory Data Analysis (EDA) on a large dataset of customers to understand the factors that influence credit scores and identify patterns that can help improve loan approval decisions while reducing the risk of fraud or default.

The dataset provided contains approximately 100,000 records with 28 features. These features include demographic attributes such as Age and Occupation, financial details like Annual Income, Monthly Inhand Salary, and Number of Loans, and behavioral indicators such as Credit Utilization Ratio, Outstanding Debt, and Delayed Payments. The target variable is the Credit Score, classified into three categories: Good, Standard, and Poor. To prepare the data for analysis, irrelevant identifier fields such as ID, Name, SSN, and Customer_ID were removed. In addition, placeholder values such as "No Data" in the loan type column were treated as missing values. Fortunately, the dataset did not contain significant null values, making preprocessing relatively straightforward.

The primary objective of this analysis was to uncover meaningful insights about the relationship between customer attributes and credit scores. Several visual and statistical techniques were applied to achieve this. First, the distribution of credit scores revealed that the majority of customers fell under the Standard category, with more Poor scorers than Good. This highlights a potential financial risk, as a large portion of customers are either average or below-average in terms of credit health.

Further analysis of age distribution showed that most customers were between 20 and 45 years old, with relatively fewer older clients. This suggests that credit risk is concentrated in younger to middle-aged groups, who are often in the process of building or stabilizing their financial lives.

The relationship between annual income and credit score revealed a clear trend: individuals with higher incomes were more likely to have Good credit scores, while those with lower incomes were disproportionately represented in the Poor category. This finding emphasizes the importance of income stability in maintaining financial credibility.

A correlation heatmap provided deeper insights into how numeric features interact. Strong positive correlations were observed among income, salary, and monthly balance, while negative associations emerged between credit score and high credit utilization, outstanding debt, and delayed payments. These indicators are particularly important for lenders, as they directly reflect repayment behavior and financial discipline.

Finally, the occupation versus credit score analysis highlighted profession-based trends. Stable and high-paying professions such as Engineers, Lawyers, and Doctors were more likely to have Good credit scores. In contrast, individuals in less stable or irregular-income professions, including Writers, Musicians, and Mechanics, tended toward Standard or Poor scores.

In conclusion, this project demonstrates how exploratory data analysis can uncover valuable insights into customer behavior and credit health. The findings suggest that income stability, repayment discipline, and credit utilization are the most significant factors influencing credit scores. By leveraging these insights, Paisabazaar can refine its credit evaluation processes, reduce fraud risks, and provide tailored financial solutions to customers. This project not only highlights the importance of data-driven decision-making in financial services but also lays the groundwork for predictive modeling in future phases.

# **GitHub Link -**

https://github.com/AaryanPriyadarshi/Paisabazaar-Project

# **Problem Statement**


**The core problem addressed in this project is to understand the factors that influence credit score categories (Good, Standard, Poor) by performing an Exploratory Data Analysis (EDA) on customer data. By examining patterns in features such as income, salary, occupation, delayed payments, outstanding debt, and credit utilization ratio, this project aims to identify which variables most strongly affect credit health..**

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from google.colab import drive
drive.mount('/content/drive')
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.feature_selection import mutual_info_classif
from xgboost import XGBClassifier
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

df = pd.read_csv('/content/drive/MyDrive/Paisabazaar/dataset-2.csv')


### Dataset First View

In [None]:
# Basic information and missing Values
df.shape

df.info()
print("\nMissing Values:")
print(df.isnull().sum())


The dataset contains customer demographic and financial details such as:
- **Age, Occupation, Annual Income, Monthly Salary**  
- **Outstanding Debt, Number of Delayed Payments, Credit Utilization Ratio**  
- **Type of Loan, Number of Loans, Monthly Balance**  
- **Credit Score** (target variable: Good, Standard, Poor)  

###Dataset Rows & Columns count



In [None]:
# Dataset Rows & Columns count

df.describe()

Data Dictionary

| Column Name              | Description |
|---------------------------|-------------|
| Age                      | Age of the customer |
| Occupation               | Customer's profession |
| Annual_Income            | Total annual income |
| Monthly_Inhand_Salary    | Average monthly take-home salary |
| Num_Bank_Accounts        | Number of active bank accounts |
| Num_Credit_Card          | Number of credit cards held |
| Interest_Rate            | Average interest rate on loans |
| Num_of_Loan              | Number of loans taken |
| Type_of_Loan             | Types of loans held (Home, Auto, etc.) |
| Delay_from_due_date      | Average payment delay in days |
| Num_of_Delayed_Payment   | Number of delayed payments |
| Outstanding_Debt         | Total outstanding debt |
| Credit_Utilization_Ratio | Ratio of used credit to available credit |
| Monthly_Balance          | Average monthly account balance |
| Credit_Score             | Target variable: Good / Standard / Poor |

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Dropping irrelevant data
df.isnull().sum()
df = df.drop(columns=["ID","SSN","Name","Customer_ID"], errors="ignore")
df["Type_of_Loan"] = df["Type_of_Loan"].replace("No Data", pd.NA)
df.fillna(method="ffill", inplace=True)


# Feature Engineering
# Adding 'Quarter' derived from 'Month' to capture seasonal spending or behavioural trends.

if 'Month' in df.columns:
  df['Quarter'] = df['Month'].map({1:'Q1',2:'Q1',3:'Q1',4:'Q2',5:'Q2',6:'Q2',7:'Q3',8:'Q3',9:'Q3',10:'Q4',11:'Q4',12:'Q4'})

Handling Missing / Irrelevant Data

Steps taken:
- Dropped identifier columns such as **ID, SSN, Customer_ID, and Name**, as they do not add predictive value.  
- Replaced placeholder entries like "No Data" in "Type_of_Loan" with NaN values.  
- Checked for missing values in all columns using `.isnull().sum()`.  
- Imputed or dropped missing values where necessary to ensure dataset consistency.

## ***Data Vizualization***

###Credit Score Distribution

-Most customers fall in the Standard category.

-Poor credit scores are higher than Good scores, meaning many customers are financially at risk.

In [None]:
sns.countplot(data=df, x="Credit_Score", palette="Set2")
plt.title("Distribution of Credit Score")
plt.show()

###Age Distribution

-Customers are mostly between 20–45 years old.

-Very few older customers, meaning credit risks are concentrated among younger to middle-aged individuals.

In [None]:
sns.histplot(df["Age"], bins=30, kde=True, color="skyblue")
plt.title("Distribution of Age")
plt.show()

###Annual Income by Credit Score

-Good credit scorers generally earn higher annual incomes.

-Poor scorers have noticeably lower income, with many outliers.

-Suggests income is positively related to credit health

In [None]:
sns.boxplot(data=df, x="Credit_Score", y="Annual_Income", palette="Set3")
plt.title("Annual Income by Credit Score")
plt.show()


###Correlation Heatmap

-Annual Income, Monthly Inhand Salary, and Monthly Balance are strongly correlated.

-Credit Utilization Ratio is negatively correlated with credit health → higher utilization increases risk.

-Outstanding Debt and Num of Delayed Payments also align with poorer scores.

In [None]:
corr = df.select_dtypes(include=["float64","int64"]).corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap of Numeric Features")
plt.show()


###Occupation vs Credit Score

-Some professions (Engineer, Lawyer, Doctor) show better credit scores.

-Jobs with irregular income (Writer, Mechanic, Musician) lean toward Poor or Standard scores.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df, x="Occupation", hue="Credit_Score")
plt.title("Occupation vs Credit Score")
plt.xticks(rotation=45)
plt.show()


### Final Insight:
Accurate classification of credit scores depends strongly on **income stability, repayment history, and credit utilization patterns**.  
This analysis can help Paisabazaar improve its **loan approval process**, reduce defaults, and give **personalized financial recommendations**.

## ***Hypothesis Testing***

H₀ (Null): Mean annual income is the same across Good, Standard, and Poor credit scores.

H₁ (Alternative): At least one credit score group has a different mean annual income.

In [None]:
# Group incomes by Credit Score
income_good = df[df["Credit_Score"] == "Good"]["Annual_Income"].dropna()
income_standard = df[df["Credit_Score"] == "Standard"]["Annual_Income"].dropna()
income_poor = df[df["Credit_Score"] == "Poor"]["Annual_Income"].dropna()

# Perform one-way ANOVA
f_stat, p_value = stats.f_oneway(income_good, income_standard, income_poor)

print("F-statistic:", f_stat)
print("p-value:", p_value)

# Interpretation
alpha = 0.05
if p_value <= alpha:
    print("Reject Null Hypothesis: Income differs significantly across credit score groups.")
else:
    print("Fail to Reject Null Hypothesis: No significant difference in income across groups.")

#### Credit Score Distribution

Credit Score Distribution

*   Most customers fall in the Standard category.

*   Poor credit scores are higher than Good scores, meaning many customers are financially at risk.

In [None]:
sns.countplot(data=df, x="Credit_Score", palette="Set2")
plt.title("Distribution of Credit Score")
plt.show()

#### Age Distribution

Age Distribution

*   Customers are mostly between 20–45 years old.
*  Very few older customers, meaning credit risks are concentrated among younger to middle-aged individuals

In [None]:
sns.histplot(df["Age"], bins=30, kde=True, color="skyblue")
plt.title("Distribution of Age")
plt.show()

#### Annual Income by Credit Score

Annual Income by Credit Score

*   Good credit scorers generally earn higher annual incomes.
*   Poor scorers have noticeably lower income, with many outliers.
*   Suggests income is positively related to credit health.

In [None]:
sns.boxplot(data=df, x="Credit_Score", y="Annual_Income", palette="Set3")
plt.title("Annual Income by Credit Score")
plt.show()

#### Correlation Heatmap

Correlation Heatmap

*   Annual Income, Monthly Inhand Salary, and Monthly Balance are strongly correlated.
*   Credit Utilization Ratio is negatively correlated with credit health → higher utilization increases risk.
*   Outstanding Debt and Num of Delayed Payments also align with poorer scores.

In [None]:
corr = df.select_dtypes(include=["float64","int64"]).corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap of Numeric Features")
plt.show()

#### Occupation vs Credit Score

Occupation vs Credit Score

*  Some professions (Engineer, Lawyer, Doctor) show better credit scores.
*   Jobs with irregular income (Writer, Mechanic, Musician) lean toward Poor or Standard scores.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(data=df, x="Occupation", hue="Credit_Score")
plt.title("Occupation vs Credit Score")
plt.xticks(rotation=45)
plt.show()

## ***Feature Engineering & Data Pre-processing***

In [None]:
from re import X

# Reloading the dataset
df = pd.read_csv('/content/drive/MyDrive/Paisabazaar/dataset-2.csv')


# Encode categorical variables
# Label Encoding coverts categorical data into numerical format for ML model compatibility
cat_cols = ['Occupation', 'Credit_Mix', 'Payment_of_Min_Amount', 'Payment_Behaviour']
le = LabelEncoder()
for col in cat_cols:
    # Check if the column exists in the DataFrame
    if col in df.columns:
        df[col] = le.fit_transform(df[col])
    else:
        print(f"Warning: Column '{col}' not found in DataFrame.")

# Encode 'Quarter' if it exists (Added in feature engineering)
if 'Quarter' in df.columns:
  df = le.fit_transform(df['Quarter'])

# Dropping irrelevant data and Setting 'Credit Score' as the target variable
X = df.drop(['ID', 'Customer_ID', 'Name', 'SSN' ,'Credit_Score', 'Type_of_Loan'], axis=1)
y = le.fit_transform(df['Credit_Score'])


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Scaling
# StandardScaler helps model like XGboost and RandomForest perform better with scaled data

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In this step, we get the dataset ready for modeling:  

- Reload the dataset into a DataFrame.  
- Convert text columns (`Occupation`, `Credit_Mix`, etc.) into numbers using Label Encoding.  
- Drop irrelevant columns (`ID`, `Customer_ID`, `Name`, `SSN`, etc.) that don’t help in prediction.  
- Set **Credit Score** as our target variable (`y`).  
- Split the data into training (80%) and testing (20%) sets.  
- Scale the features using **StandardScaler** so that models can train more effectively.  


## ***ML Model Implementation***

In [None]:
# Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=200, random_state=42)
rf_clf.fit(X_train_scaled, y_train)
y_pred_rf = rf_clf.predict(X_test_scaled)

print("Random Forest Results")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))

# Confusion Matrix Visualization
plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Blues')
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


# XGBoost Classifier
xgb_clf = XGBClassifier(n_estimators=300, learning_rate=0.1, random_state=42,
                        use_label_encoder=False, eval_metric='mlogloss')
xgb_clf.fit(X_train_scaled, y_train)
y_pred_xgb = xgb_clf.predict(X_test_scaled)

print("\n XGBoost Results")
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test, y_pred_xgb))

# Confusion Matrix Visualization
plt.figure(figsize=(6,4))
sns.heatmap(confusion_matrix(y_test, y_pred_xgb), annot=True, fmt='d', cmap='Greens')
plt.title("XGBoost Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# Compare Evaluation Metrics

# Collect metrics
results = {
    "Model": ["Random Forest", "XGBoost"],
    "Accuracy": [
        accuracy_score(y_test, y_pred_rf),
        accuracy_score(y_test, y_pred_xgb)
    ],
    "Precision": [
        precision_score(y_test, y_pred_rf, average="weighted"),
        precision_score(y_test, y_pred_xgb, average="weighted")
    ],
    "Recall": [
        recall_score(y_test, y_pred_rf, average="weighted"),
        recall_score(y_test, y_pred_xgb, average="weighted")
    ],
    "F1-Score": [
        f1_score(y_test, y_pred_rf, average="weighted"),
        f1_score(y_test, y_pred_xgb, average="weighted")
    ]
}

# Convert to DataFrame
metrics_df = pd.DataFrame(results)
display(metrics_df)

# Plot comparison
metrics_df.set_index("Model")[["Accuracy","Precision","Recall","F1-Score"]].plot(
    kind="bar", figsize=(8,6), colormap="Set2", rot=0
)
plt.title("Model Performance Comparison")
plt.ylabel("Score")
plt.ylim(0,1)
plt.legend(loc="lower right")
plt.show()

# Feature Importance

# Random Forest Feature Importance
importances_rf = rf_clf.feature_importances_
features = X.columns

plt.figure(figsize=(10,5))
sns.barplot(x=importances_rf, y=features, palette="Blues_r")
plt.title("Random Forest - Feature Importance")
plt.xlabel("Importance")
plt.ylabel("Features")
plt.show()

# XGBoost Feature Importance
importances_xgb = xgb_clf.feature_importances_

plt.figure(figsize=(10,5))
sns.barplot(x=importances_xgb, y=features, palette="Greens_r")
plt.title("XGBoost - Feature Importance")
plt.xlabel("Importance")
plt.ylabel("Features")
plt.show()




###  ML Model Implementation  

Implemented two machine learning models to predict **Credit Score**:  

1. **Random Forest Classifier**  
   - An ensemble method that builds multiple decision trees and combines their predictions.  
   - Helps reduce overfitting and improves accuracy.  

2. **XGBoost Classifier**  
   - A boosting algorithm that builds models sequentially, correcting errors step by step.  
   - Known for strong performance in classification problems.  

 For each model, I:  
- Trained on the **training set (80%)**.  
- Tested on the **test set (20%)**.  
- Evaluated with **Accuracy, Classification Report, and Confusion Matrix**.  
- Used **heatmaps** to visualize classification performance.  


### Model Evaluation Metrics Comparison  

To evaluate our models, I compared **Accuracy, Precision, Recall, and F1-Score**:  

- **Accuracy**: Percentage of correct predictions.  
- **Precision**: How many predicted positives were actually correct.  
- **Recall**: How many actual positives were correctly identified.  
- **F1-Score**: Balance between Precision and Recall.  

 The comparison table and bar chart clearly show which model performs better overall.  


### Feature Importance

I also checked which features matter the most for predicting credit scores.  

- Random Forest and XGBoost both rank features by importance.  
- Features like income, payment delays, and credit inquiries turn out to be strong predictors.  


# **Conclusion**

The project explores customer financial data through EDA and uncoveres valuable insights:

- **Income stability, repayment history, and credit utilization** are the most influential factors for credit scores.  
- The majority of customers are of **Standard** credit risk, requiring closer evaluation before loan approvals.  
- Occupation and income significantly affect credit health, with stable professions having stronger financial discipline.  
- Correlation analysis shows strong relationships among income, salary, and balance, while utilization and delays negatively impact scores.  

**Business Value:**  
These findings can help Paisabazaar refine loan approval processes, reduce fraud risk, and deliver tailored financial recommendations to customers.  
