<a href="https://colab.research.google.com/github/Prianka-Mukhopadhyay/Project_PaisaBazaar/blob/main/PaisaBazaar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** Prianka Mukhopadhyay


# **Project Summary -**

Raisabazaar is a financial services company that assists customers in finding and applying for various banking and credit products. An integral part of their service is assessing the creditworthiness of individuals, which is crucial for both loan approval and risk management. The credit score of a person is a significant metric used by financial institutions to determine the likelihood that an individual will repay their loans or credit balances. Accurate classification of credit scores can help Paisabazaar enhance their credit assessment processes, reduce the risk of loan defaults, and offer personalized financial advice to their customers. In this context, analyzing and classifying credit scores based on customer data can improve decision-making processes and contribute to better financial product recommendations. This case study aims to develop a model that predicts the credit score of individuals based on various features, such as income, credit card usage, and payment behavior

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Paisabazaar is a financial services company that assists customers in finding and applying for various banking and credit products. An integral part of their service is assessing the creditworthiness of individuals, which is crucial for both loan approval and risk management. The credit score of a person is a significant metric used by financial institutions to determine the likelihood that an individual will repay their loans or credit balances. Accurate classification of credit scores can help Paisabazaar enhance their credit assessment processes, reduce the risk of loan defaults, and offer personalized financial advice to their customers. In this context, analyzing and classifying credit scores based on customer data can improve decision-making processes and contribute to better financial product recommendations. This case study aims to develop a model that predicts the credit score of individuals based on various features, such as income, credit card usage, and payment behavior

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# utilities
import os, io, sys, warnings, textwrap, re
warnings.filterwarnings("ignore")

# display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 120)
pd.set_option("display.float_format", lambda x: f"{x:,.3f}")

# reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("✅ Libraries imported. Pandas version:", pd.__version__)


### Dataset Loading

In [None]:
# from google.colab import files
# import pandas as pd

# # Upload file manually
# uploaded = files.upload()

# Load dataset
# import io
# df = pd.read_csv(io.BytesIO(uploaded['dataset-2.csv']))

# print("✅ Dataset uploaded successfully!")
# print("Shape:", df.shape)


In [None]:
# Read dataset from the default Colab working directory
df = pd.read_csv("/content/dataset-2.csv")

print("✅ Dataset loaded successfully!")
print("Shape:", df.shape)


### Dataset First View

In [None]:
# Dataset First Look
df.head(10)

'''
Getting the first view of the dataset helps us get a better understanding of the data ,
the rows and the columns we will be dealing with.
'''

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Number of Rows:", df.shape[0])
print("Number of Columns:", df.shape[1])

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print("Number of duplicate rows:", duplicate_count)

#### Missing Values/Null Values

In [None]:
# ===== Step 7: Missing Values =====
print("🔎 Checking missing values in the dataset...\n")

# Count of missing values per column
missing_values = df.isnull().sum()

# Only show columns with missing values
missing_values = missing_values[missing_values > 0]

if missing_values.empty:
    print("✅ No missing values found in the dataset!")
else:
    print("Columns with missing values:\n")
    print(missing_values)


In [None]:
# Visualizing the missing values

plt.figure(figsize=(8,4))
sns.heatmap(df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap", fontsize=12)
plt.show()


In [None]:
missing = df.isnull().sum()
missing = missing[missing > 0]

if missing.empty:
    print("✅ No missing values to visualize!")
else:
    missing.plot(kind="bar", figsize=(10,5), color="salmon")
    plt.title("Missing Values per Column")
    plt.ylabel("Count")
    plt.show()


### What did you know about your dataset?

->The dataset contains 100,000 records and 28 columns.

->It includes a mix of numerical (float64, int64) and categorical (object) variables.

->No missing values were found in any of the columns.
No duplicate rows are present.

====Key customer attributes are captured====

**Demographics:** Age, Occupation, SSN, Name

**Financial indicators:** Annual_Income, Monthly_Inhand_Salary, Outstanding_Debt, Monthly_Balance

**Credit behavior:** Num_Bank_Accounts, Num_Credit_Card, Num_of_Loan, Delay_from_due_date, Num_of_Delayed_Payment, Credit_Utilization_Ratio

**Categorical insights:** Type_of_Loan, Credit_Mix, Payment_of_Min_Amount, Payment_Behaviour

The target variable for prediction is **Credit_Score**, which is categorical.

 Overall, the dataset looks clean (no missing/duplicate values) and well-structured for further analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print("Dataset Columns:\n")
print(df.columns.tolist())

In [None]:
# Dataset Describe
df.describe(include="all").transpose()

### Variables Description

**Identifiers:**

**ID, Customer_ID, SSN, Name** → Identifier columns with high cardinality. These do not carry predictive power for credit scoring, so they should be dropped later.

**Demographics:**

**Age** → Numeric, range 14–56, 43 unique values.

**Occupation** → 15 unique categories (e.g., Lawyer, Engineer).

**Financial attributes:**

**Annual_Income, Monthly_Inhand_Salary** → Continuous, wide ranges, strong indicators of financial health.

**Outstanding_Debt, Monthly_Balance, Total_EMI_per_month, Amount_invested_monthly** → Numeric measures of debt, balance, liabilities, and investments.

**Credit behavior:**

Num_Bank_Accounts (0–11), Num_Credit_Card (0–11), Interest_Rate (1–34), Num_of_Loan (0–9).
Delay_from_due_date (0–62 days), Num_of_Delayed_Payment (0–25).

**Changed_Credit_Limit →** continuous variable (0.5–29.98).
Num_Credit_Inquiries (0–17).

**Credit_History_Age →** ranges from 1–404 (probably months).

**Categorical credit attributes:**

**Type_of_Loan →** very high unique count (6261), may need cleaning/simplification.

**Credit_Mix →** 3 categories (e.g., Standard, Good, Bad).

**Payment_of_Min_Amount →** 3 categories (Yes/No/Not Specified).

**Payment_Behaviour →** 6 categories (e.g., Low_spent_Small_value_payments).

**Target variable:**

**Credit_Score →** 3 classes (e.g., Standard, Good, Poor).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for col in df.columns:
    unique_vals = df[col].nunique()
    print(f"{col}: {unique_vals} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# ===== Data Wrangling =====

# 1. Drop identifier columns
df_clean = df.drop(columns=["ID", "Customer_ID", "SSN", "Name"])

# 2. Handle 'No Data' / inconsistent values in categorical variables
df_clean["Type_of_Loan"] = df_clean["Type_of_Loan"].replace("No Data", np.nan)
df_clean["Payment_of_Min_Amount"] = df_clean["Payment_of_Min_Amount"].replace("NM", "Not Specified")

# 3. Check target variable distribution
print("Target variable (Credit_Score) distribution:")
print(df_clean["Credit_Score"].value_counts())

# Show shape after cleaning
print("\nShape after cleaning:", df_clean.shape)


### What all manipulations have you done and insights you found?

**Dropped identifier columns (ID, Customer_ID, SSN, Name)** as they don’t contribute to predictive modeling and only serve as unique identifiers.

###**Handled categorical anomalies:**

Replaced "No Data" in Type_of_Loan with NaN for clarity.

Standardized "Payment_of_Min_Amount" by renaming "NM" to "Not Specified".

Dataset shape changed from (100,000, 28) to (100,000, 24) after removing identifiers.
### **Target variable (Credit_Score) distribution:**
**Standard:** 53,174 records (≈53%)

**Poor:** 28,998 records (≈29%)

**Good:** 17,828 records (≈18%)

The dataset is imbalanced, with more “Standard” credit scores. This imbalance will need to be addressed during model training.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(6,4))
sns.countplot(data=df_clean, x="Credit_Score", palette="viridis")
plt.title("Distribution of Credit Scores")
plt.ylabel("Count")
plt.show()


##### 1. Why did you pick the specific chart?

To visualize class imbalance in Credit_Score.

##### 2. What is/are the insight(s) found from the chart?

Standard dominates → dataset imbalance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business impact: Helps identify need for strategies to focus on "Poor" & "Good" groups for targeted financial advice.

Positive: Tells PaisaBazaar that a large portion of their customers are “Standard,” which could be the focus for upselling loans with stricter monitoring.

Negative: Class imbalance may make ML models biased toward the majority class.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(8,5))
sns.boxplot(data=df_clean, x="Credit_Score", y="Age", palette="Set2")
plt.title("Age distribution across Credit Score categories")
plt.show()


##### 1. Why did you pick the specific chart?

To check if age distribution differs across credit score categories.

##### 2. What is/are the insight(s) found from the chart?

Younger customers may cluster in “Poor” scores if their credit history is short.

“Good” credit score customers generally have higher median incomes.

“Poor” credit scores cluster around lower incomes, but with many outliers.

Income clearly plays a role in determining creditworthiness.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business impact: Lenders may design beginner-friendly credit products.

Positive: High-income customers are less risky, so they can be offered premium financial products.

Negative: Low-income groups need risk-adjusted products; if ignored, PaisaBazaar could face higher loan defaults.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10,6))
sns.countplot(data=df_clean, x="Occupation", hue="Credit_Score", palette="husl")
plt.xticks(rotation=45)
plt.title("Occupation vs Credit Score")
plt.show()


##### 1. Why did you pick the specific chart?

Occupation influences income and repayment ability.

##### 2. What is/are the insight(s) found from the chart?

**Some jobs (e.g., lawyers/engineers) may skew toward better credit scores.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business impact: Helps Paisabazaar recommend occupation-specific credit products.**

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8,5))
sns.boxplot(data=df_clean, x="Credit_Score", y="Annual_Income", palette="coolwarm")
plt.ylim(0, 200000)  # keep axis readable
plt.title("Annual Income vs Credit Score")
plt.show()


##### 1. Why did you pick the specific chart?

**To check how income levels correlate with creditworthiness.**

##### 2. What is/are the insight(s) found from the chart?

**Higher incomes may not always guarantee “Good” scores if debts/behavior are poor.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Business impact: Identifies risk even among high-income customers.**

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(8,5))
sns.violinplot(data=df_clean, x="Credit_Score", y="Num_of_Loan", palette="muted")
plt.title("Number of Loans vs Credit Score")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# ===== Correlation Heatmap (Numeric Features Only) =====

# Select only numeric columns
numeric_df = df_clean.select_dtypes(include=["int64", "float64"])

plt.figure(figsize=(14,8))
corr = numeric_df.corr()

sns.heatmap(corr, annot=False, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap of Numerical Features", fontsize=16)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df_clean[["Age","Annual_Income","Outstanding_Debt","Num_of_Loan","Credit_Score"]], hue="Credit_Score", palette="husl")
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

Answer Here.

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Does Annual Income differ significantly across different Credit Score categories?
Null Hypothesis (H0): There is no significant difference in annual income among customers with different credit scores.
Alternate Hypothesis (H1): There is a significant difference in annual income among customers with different credit scores.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
from scipy.stats import f_oneway

# Group data by Credit Score
poor_income = df_clean[df_clean["Credit_Score"]=="Poor"]["Annual_Income"]
standard_income = df_clean[df_clean["Credit_Score"]=="Standard"]["Annual_Income"]
good_income = df_clean[df_clean["Credit_Score"]=="Good"]["Annual_Income"]

# Perform ANOVA
f_stat, p_value = f_oneway(poor_income, standard_income, good_income)

print("ANOVA test for Annual Income across Credit Scores")
print("F-statistic:", f_stat)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

**I used a One-Way ANOVA test.**

##### Why did you choose the specific statistical test?

Because the independent variable (Credit_Score) is categorical with three groups (Poor, Standard, Good) and the dependent variable (Annual_Income) is continuous. ANOVA is appropriate to test whether there is a significant difference in mean income across multiple groups.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Is there an association between the number of delayed payments and credit score category?
Null Hypothesis (H0): There is no association between Num_of_Delayed_Payment and Credit_Score.
Alternate Hypothesis (H1): There is an association between Num_of_Delayed_Payment and Credit_Score.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import pandas as pd
from scipy.stats import chi2_contingency

# Bin delayed payments into categories
df_clean["Delayed_Pay_Category"] = pd.cut(
    df_clean["Num_of_Delayed_Payment"],
    bins=[-1, 5, 15, df_clean["Num_of_Delayed_Payment"].max()],
    labels=["Low", "Medium", "High"]
)

# Create contingency table
contingency = pd.crosstab(df_clean["Delayed_Pay_Category"], df_clean["Credit_Score"])

# Perform chi-square test
chi2, p, dof, expected = chi2_contingency(contingency)

print("Chi-Square Test for Delayed Payments vs Credit Score")
print("Chi2:", chi2)
print("P-value:", p)


##### Which statistical test have you done to obtain P-Value?

**I used a Chi-Square Test of Independence.**

##### Why did you choose the specific statistical test?

Both variables (Num_of_Delayed_Payment after binning and Credit_Score) are categorical. The Chi-square test is the best choice to check whether two categorical variables are independent or associated.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Do customers with higher outstanding debt have lower credit scores?
Null Hypothesis (H0): There is no significant difference in outstanding debt among different credit score groups.
Alternate Hypothesis (H1): There is a significant difference in outstanding debt among different credit score groups.

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Group data by Credit Score
poor_debt = df_clean[df_clean["Credit_Score"]=="Poor"]["Outstanding_Debt"]
standard_debt = df_clean[df_clean["Credit_Score"]=="Standard"]["Outstanding_Debt"]
good_debt = df_clean[df_clean["Credit_Score"]=="Good"]["Outstanding_Debt"]

# Perform ANOVA
f_stat, p_value = f_oneway(poor_debt, standard_debt, good_debt)

print("ANOVA test for Outstanding Debt across Credit Scores")
print("F-statistic:", f_stat)
print("P-value:", p_value)


##### Which statistical test have you done to obtain P-Value?

**I used a One-Way ANOVA test.**

##### Why did you choose the specific statistical test?

The independent variable (Credit_Score) is categorical with three groups, while the dependent variable (Outstanding_Debt) is continuous. ANOVA is suitable to test for mean differences across multiple groups.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***