<a href="https://colab.research.google.com/github/Nikkkhhill97/Paisabazaar_Banking_Fraud_Analysis/blob/main/Paisa_Bazaar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Paisabazaar Banking Fraud Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Project Overview and Objectives
In the modern financial landscape, credit scores serve as the primary barometer for a consumer's financial reliability. This project focuses on a deep-dive Exploratory Data Analysis (EDA) of a dataset provided by Paisabazaar, a premier financial marketplace. The central goal is to dissect the complex relationship between various financial attributes and the resulting Credit_Score. By analyzing a dataset that mirrors real-world credit behaviors, we aim to identify the specific features—whether they are behavioral patterns like payment delays or structural metrics like annual income—that most significantly impact a person's creditworthiness. The ultimate objective is to provide Paisabazaar with a data-driven foundation to better categorize their users and offer personalized financial advice for score improvement.

Comprehensive Methodology and Data Cleaning
The analysis was executed through a rigorous, multi-stage pipeline designed for data integrity. Upon initial loading via Pandas, the dataset underwent a thorough structural audit using .describe(include='all') and .dtypes to identify mixed-type variables and potential data quality issues. A custom Data Dictionary was constructed to map out the features, ensuring a clear understanding of the variables before any transformations were applied.

One of the most critical phases was the Data Handling and Imputation strategy. Rather than using a one-size-fits-all approach, the code implemented a conditional logic based on the volume of missing data. Columns with less than 5% missing values were treated with row-level removal to preserve the authenticity of the data distribution. For more significant gaps, numeric features were filled using median imputation to remain robust against outliers, while categorical variables were filled using the mode.

To further refine the dataset, an Outlier Management system was built using the Interquartile Range (IQR) method. The code systematically identified extreme values in columns such as Annual_Income and Outstanding_Debt. Rather than simply deleting these data points, which could result in a loss of valuable information, a "clipping" technique was used to cap values at the calculated upper and lower bounds. This ensured that the subsequent statistical analyses and visualizations were not distorted by extreme "black swan" events in the data.

Technical Implementation and Visualization
The project’s technical architecture leveraged Python’s scientific stack, including Pandas and NumPy for core logic, and a combination of Seaborn, Matplotlib, and Plotly for the visualization layer. The visualization strategy was designed to be multi-dimensional, utilizing:

Distribution Analysis: Using Histograms and Box plots to understand the spread of credit scores.

Relationship Mapping: Scatter plots and Heatmaps to visualize the interplay between debt, income, and score.

Categorical Comparisons: Stacked bar charts to see how "Minimum Amount Payments" vary across different credit tiers.

For the correlation analysis, specific preprocessing was required. The code manually mapped ordinal text labels (e.g., 'Poor', 'Standard', 'Good') into a numeric scale (0, 1, 2). This allowed for the calculation of Pearson correlation coefficients, providing a mathematical basis for the insights derived.

Key Insights and Business Conclusions
The EDA uncovered several pivotal findings that challenge common misconceptions about credit scores. First, the Delay_from_due_date emerged as the most aggressive factor, showing a strong negative correlation of -0.43. This suggests that even a high-income individual can suffer a poor credit rating if they lack payment discipline. Second, while Annual_Income does have a positive influence (+0.21), it is significantly less impactful than the Number_of_Delayed_Payments (-0.37).

Furthermore, the analysis of Payment_of_Min_Amount revealed a clear behavioral divide: customers who only pay the minimum amount are disproportionately clustered in the "Poor" and "Standard" categories. This provides Paisabazaar with a clear "Red Flag" indicator for early intervention.

# **GitHub Link -**

https://github.com/Nikkkhhill97/Paisabazaar_Banking_Fraud_Analysis

# **Problem Statement**


The objective of this project is to perform a comprehensive Exploratory Data Analysis (EDA) on a dataset provided by Paisabazaar to identify the key factors that influence a customer's Credit Score. As a financial marketplace, Paisabazaar needs to understand the correlation between customer behavior (such as payment delays) and financial status (such as debt and income) to provide better product recommendations.

This analysis specifically aims to:

Analyze the relationship between features like Annual_Income, Outstanding_Debt, and Delay_from_due_date with the target variable Credit_Score.

Handle real-world data issues including missing values and significant outliers.

Provide actionable insights that can help customers improve their creditworthiness and assist the business in better risk segmentation.

# In this project we will perform an Exploratory Data Analysis (EDA) on the provided dataset.
To understand which features are correlated with Credit_score and derive actionable business insights!

# Objectives:
 1. Analyze head/tail/summary of the data
 2. Handle missing values and outliers
 3. Create at least five different visualization types
 4. Derive conclusions from correlations and trends

#### **Define Your Business Objective?**

The primary business objective is to empower Paisabazaar with a data-driven framework to better understand and predict customer creditworthiness. By identifying the specific financial behaviors and demographic traits that drive credit scores, the business can optimize its recommendation engine and improve customer retention through targeted financial counseling.

Specifically, the project aims to:

Improve Lead Quality: By understanding the correlation between features like Outstanding_Debt and Credit_Score, Paisabazaar can better match customers with lending products they are likely to be approved for.

Identify Risk Indicators: Pinpoint "Red Flag" behaviors—such as frequent payment delays—to alert users before their credit score degrades significantly.

Strategic Growth: Use insights from Annual_Income and Credit_Mix to identify high-potential customers for premium financial products.

Business objective: Assist customers in selecting financial products by
understanding which features are correlated with Credit_score.
This allows for deriving actionable business insights and better product matching.



# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:

# 1. Import Libraries

# Data Handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Statistical Analysis
from scipy import stats

# Essential for notebook execution
%matplotlib inline

print("Libraries imported successfully.")

### Dataset Loading

In [None]:
# File path using Google Drive ID
file_id = '1RxZ7CYwrznPOHflk1re80G_T8pp-1Z2H'
file_path = f'https://drive.google.com/uc?id={file_id}'

# Loading the dataset
try:
    df = pd.read_csv(file_path)
    print("Dataset loaded successfully!")
    print(f"Shape: {df.shape}")
except Exception as e:
    print(f"Error loading data: {e}")

### Dataset First View

In [None]:
# 1. Inspecting the first 5 rows to understand features
print("--- First 5 Rows ---")
display(df.head())

# 2. Inspecting the last 5 rows to ensure data consistency
print("\n--- Last 5 Rows ---")
display(df.tail())

# 3. Checking the dimensions of the dataset
print(f"\nTotal Rows: {df.shape[0]}")
print(f"Total Columns: {df.shape[1]}")

### Dataset Rows & Columns count

In [None]:
# Displaying the count of rows and columns
rows, columns = df.shape
print(f"The dataset contains {rows} rows and {columns} columns.")

### Dataset Information

In [None]:
# Checking data types, non-null counts, and memory usage
df.info()

#### Duplicate Values

In [None]:
# Checking for duplicate rows in the dataset
duplicate_count = df.duplicated().sum()

print(f"Total number of duplicate rows found: {duplicate_count}")

# If duplicates existed, they would be handled here
if duplicate_count > 0:
    df.drop_duplicates(inplace=True)
    print("Duplicate rows have been removed.")

#### Missing Values/Null Values

In [None]:
# Calculating the number of missing values per column
missing_values = df.isnull().sum()

# Displaying only columns that have missing values
print("Columns with missing values and their counts:")
display(missing_values[missing_values > 0])

# Total count of missing values in the entire dataset
print(f"\nTotal missing values in dataset: {missing_values.sum()}")

In [None]:
# Visualizing missing values using a heatmap for quick identification
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

# Bar plot for missing values percentage per column
missing_pct = (df.isnull().sum() / len(df)) * 100
missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=False)

if not missing_pct.empty:
    plt.figure(figsize=(10, 5))
    missing_pct.plot(kind='bar', color='teal')
    plt.title('Percentage of Missing Values by Column')
    plt.ylabel('Percentage (%)')
    plt.xlabel('Columns')
    plt.show()
else:
    print("No missing values to visualize.")

### What did you know about your dataset?

Context: The dataset is a financial record collection from Paisabazaar, designed to analyze the creditworthiness of customers.

Structure: It contains a mix of Numerical features (like Annual_Income, Outstanding_Debt, Num_of_Loan) and Categorical features (like Occupation, Credit_Mix, and the target variable Credit_Score).

Target Variable: The primary focus is the Credit_Score column, which categorizes customers into tiers such as Good, Standard, and Poor.

Data Integrity: Initial checks revealed missing values across several columns and the presence of outliers in financial metrics, requiring systematic cleaning and "clipping" to ensure statistical accuracy.

Complexity: The data includes behavioral indicators, such as payment habits (Payment_of_Min_Amount) and timeline-based data (Delay_from_due_date), which are essential for understanding credit fluctuations.Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Listing all columns available in the dataset
print("The dataset contains the following columns:")
print(df.columns.tolist())

In [None]:
# Dataset Describe
# Generating a statistical summary for all columns (Numerical and Categorical)
# The transpose (.T) is used for better readability when dealing with many features.
display(df.describe(include='all').T)

### Variables Description

# Variables Description

ID / Customer_ID: Unique identifiers for each record and customer (usually dropped during analysis to avoid noise).

Month: The month of the year the data was recorded.

Name / SSN: Personal identification details used for record-keeping.

Occupation: The professional sector of the customer, used to assess income stability.

Annual_Income: The yearly earnings of the customer.

Monthly_Inhand_Salary: The actual liquid cash the customer receives after deductions.

Num_Bank_Accounts: Total number of bank accounts held by the individual.

Num_Credit_Card: Total number of credit cards the individual possesses.

Interest_Rate: The average interest rate applicable to the customer's existing loans.

Num_of_Loan: The total number of active loans taken from the bank.

Type_of_Loan: Categories of loans (e.g., Personal, Home, Auto).

Delay_from_due_date: Average number of days payments are delayed past the deadline.

Num_of_Delayed_Payment: Total count of payments that were not made on time.

Changed_Credit_Limit: Percentage change in the customer's credit limit over time.

Num_Credit_Inquiries: Frequency of credit report checks by lenders (high frequency can lower scores).

Credit_Mix: The variety of credit products held (e.g., Good, Standard, Bad).

Outstanding_Debt: The total remaining balance yet to be paid by the customer.

Credit_Utilization_Ratio: The amount of credit used relative to the total limit available.

Credit_History_Age: The duration for which the customer has been using credit products.

Payment_of_Min_Amount: Indicates if the customer only pays the minimum amount due (Yes/No).

Total_EMI_per_month: Total monthly installment amount for all loans.

Amount_invested_monthly: The monthly portion of income put into investments.

Payment_Behaviour: Categorization of spending and payment habits.

Monthly_Balance: The remaining amount in the account after all expenses and EMIs.

Credit_Score (Target): The classification of the customer’s creditworthiness (Poor, Standard, Good).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Iterating through all columns to print unique values or the count of unique values
print("--- Unique Values / Counts for Each Variable ---")

for col in df.columns:
    unique_count = df[col].nunique()
    if unique_count <= 20:
        # If the number of unique values is small, display the actual values
        print(f"{col}: {df[col].unique()}")
    else:
        # If the number of unique values is large, just display the count
        print(f"{col}: {unique_count} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:

# DATA WRANGLING: PREPARING THE DATASET FOR ANALYSIS

# 1. Handling Missing Values
# Strategy: Drop if < 5%, Median for Numeric, Mode for Categorical
for col in df.columns:
    pct = df[col].isnull().mean()
    if pct == 0:
        continue
    if pct < 0.05:
        df = df[~df[col].isnull()] # Minimal data loss
    elif df[col].dtype in ["float64", "int64"]:
        df[col].fillna(df[col].median(), inplace=True) # Robust imputation
    else:
        df[col].fillna(df[col].mode().iloc[0], inplace=True) # Most frequent value

# 2. Removing Irrelevant High-Cardinality Features
# Columns like ID and Name don't provide statistical value for EDA
for id_col in ['ID', 'Customer_ID', 'SSN', 'Name']:
    if id_col in df.columns:
        df.drop(columns=[id_col], inplace=True)

# 3. Handling Outliers (IQR Clipping)
# We cap extreme values to ensure visualizations and correlations are not skewed
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
for c in numeric_cols:
    q1 = df[c].quantile(0.25)
    q3 = df[c].quantile(0.75)
    iqr = q3 - q1
    lower, upper = q1 - 1.5 * iqr, q3 + 1.5 * iqr
    df[c] = df[c].clip(lower, upper)

# 4. Data Type Normalization
# Converting categorical text into formal category types for optimized plotting
cat_cols = ['Occupation', 'Type_of_Loan', 'Payment_of_Min_Amount', 'Credit_Mix', 'Month']
for c in cat_cols:
    if c in df.columns:
        df[c] = df[c].astype('category')

# 5. Mapping Target Labels for Correlation Analysis
# Converting Credit_Score to numeric to allow for mathematical relationship testing
if 'Credit_Score' in df.columns:
    score_map = {'Poor': 0, 'Standard': 1, 'Good': 2}
    df['Credit_Score_Num'] = df['Credit_Score'].map(score_map)

print("Data Wrangling Complete: Dataset is now analysis-ready.")

### What all manipulations have you done and insights you found?

## Data Manipulations
Missing Value Imputation: I applied a conditional strategy based on the severity of missing data. For columns with less than 5% missingness, I dropped the rows to maintain data purity. For others, I used Median Imputation for numerical data (to avoid outlier bias) and Mode Imputation for categorical data.

Outlier Management (Capping): Using the Interquartile Range (IQR) method, I identified extreme values in financial columns like Annual_Income and Outstanding_Debt. I "clipped" these values to the upper and lower bounds to prevent skewed visualizations.

Feature Selection: I dropped non-informative columns such as ID, Customer_ID, SSN, and Name, as these high-cardinality features do not contribute to statistical patterns.

Type Casting: I converted columns like Occupation and Credit_Mix into the category data type to optimize memory usage and improve the efficiency of plotting libraries.

Target Encoding: Created a numeric mapping for Credit_Score (Poor: 0, Standard: 1, Good: 2) to facilitate mathematical correlation analysis.

## Initial Insights Found
Data Quality: A significant portion of the dataset required cleaning, indicating that real-world credit data is often "noisy" and requires robust preprocessing before analysis.

Target Imbalance: Early checks showed that the distribution of "Good" vs. "Poor" credit scores is not equal, which will be a key focus during visualization.

Feature Variance: Features like Delay_from_due_date showed high variance even after cleaning, suggesting they are primary drivers of credit score differences.

Behavioral Indicators: The preliminary look at Payment_of_Min_Amount suggests a strong link between paying only the minimum and having a "Poor" credit rating.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Chart - 1: Count Plot for Credit Score
plt.figure(figsize=(10, 6))
sns.countplot(x='Credit_Score', data=df, palette='viridis', order=['Poor', 'Standard', 'Good'])
plt.title('Distribution of Credit Scores')
plt.xlabel('Credit Score Category')
plt.ylabel('Number of Customers')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Count Plot because it is the most effective way to visualize the frequency distribution of a categorical variable. Since Credit_Score is our target, we need to see if the classes are balanced or if one category dominates the dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals how customers are spread across the three tiers. Typically, in this dataset, the "Standard" category is the most frequent, followed by "Poor," with "Good" often being the smallest group. This indicates that achieving a "Good" credit score is a challenge for the majority of the user base.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, By knowing the size of the "Standard" group, Paisabazaar can design specific "Nudge" campaigns to help these middle-tier customers move into the "Good" category, increasing their eligibility for premium financial products.If the "Poor" category is significantly larger than "Good," it indicates a high-risk portfolio. For business growth, a large "Poor" segment is negative because it leads to higher loan rejection rates and lower commissions for the platform.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Chart - 2: Histogram with KDE for Annual Income
plt.figure(figsize=(10, 6))
sns.histplot(df['Annual_Income'], kde=True, color='skyblue', bins=30)
plt.title('Distribution of Annual Income among Customers')
plt.xlabel('Annual Income')
plt.ylabel('Frequency')
plt.axvline(df['Annual_Income'].median(), color='red', linestyle='--', label=f'Median: {df["Annual_Income"].median():.2f}')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

I picked a Histogram with a Kernel Density Estimate (KDE) because it effectively shows the distribution, spread, and skewness of a continuous numerical variable. Adding a median line helps identify the central tendency in a potentially skewed financial dataset.

##### 2. What is/are the insight(s) found from the chart?

The chart shows the income range where most Paisabazaar customers fall. If the distribution is "Right-Skewed," it means the majority of users are middle-to-low income earners, with a few high-earners pulling the average up. The median provides a more realistic view of the "typical" customer's earning power.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Understanding the income clusters allows Paisabazaar to partner with banks that offer products tailored to that specific income bracket (e.g., entry-level credit cards vs. premium wealth management), leading to higher conversion rates.

If a large portion of the users fall below a certain income threshold, it could lead to negative growth in high-value loan segments (like Home Loans), as these users may struggle to meet the minimum eligibility criteria set by lenders.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Chart - 3: Box Plot for Delay_from_due_date
plt.figure(figsize=(10, 6))
sns.boxplot(x=df['Delay_from_due_date'], color='salmon')
plt.title('Analysis of Payment Delays (Days past Due Date)')
plt.xlabel('Number of Days Delayed')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Box Plot because it is the best tool for identifying the spread of the data and spotting outliers. In credit analysis, knowing the "typical" delay versus extreme "outlier" delays is vital for risk assessment.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the median delay for the user base. If the box is shifted to the right, it indicates a general culture of late payments among users. The "whiskers" and outlier points show how many customers are significantly past their due dates, which is a major red flag for credit bureaus.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Paisabazaar can use this data to implement a "Grace Period" notification system. By targeting users who fall into the 1st quartile of delays, the business can help them correct their behavior before they move into the high-risk "Outlier" zone.

A high median delay is a strong indicator of potential defaults. If a large segment of the population consistently delays payments by more than 30 days, lending partners may tighten their criteria, leading to a decrease in loan approval rates on the platform.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Chart - 4: Pie Chart for Credit Mix
credit_mix_counts = df['Credit_Mix'].value_counts()

plt.figure(figsize=(8, 8))
plt.pie(credit_mix_counts, labels=credit_mix_counts.index, autopct='%1.1f%%',
        colors=['#66b3ff','#99ff99','#ffcc99'], startangle=140, explode=(0.05, 0, 0))
plt.title('Proportion of Different Credit Mix Categories')
plt.show()

##### 1. Why did you pick the specific chart?

I used a Pie Chart because it is excellent for showing "part-to-whole" relationships. Since Credit_Mix has only a few categories (Standard, Good, Bad), it clearly shows what percentage of the customer base has a healthy variety of credit accounts.

##### 2. What is/are the insight(s) found from the chart?

This chart shows the percentage of customers with a "Good" vs. "Bad" credit portfolio. A dominant "Standard" or "Bad" slice indicates that most users lack diversity in their credit types (e.g., they might only have credit cards and no secured loans), which prevents them from reaching the highest credit tiers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. Paisabazaar can recommend "Secured Credit" or "Builder Loans" specifically to the "Bad" mix segment to help them diversify their portfolios, which creates a new revenue stream for the company while helping the user.

A high percentage of "Bad" credit mix is a negative indicator for premium bank partners. Banks are less likely to offer low-interest rates to a platform whose user base predominantly lacks a proven track record of managing different types of credit.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Chart - 5: Histogram for Monthly Inhand Salary
plt.figure(figsize=(10, 6))
sns.histplot(df['Monthly_Inhand_Salary'], bins=20, color='mediumseagreen', edgecolor='black')
plt.title('Frequency Distribution of Monthly Inhand Salary')
plt.xlabel('Monthly Salary (Inhand)')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

I picked a Histogram to see the "liquidity" of the customers. While Annual Income shows wealth, the monthly inhand salary shows the actual cash flow available to pay bills and EMIs.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals the most common salary brackets. If the peak is at a lower value, it tells us that most users are living on a tight monthly budget. This is a more realistic indicator of repayment capacity than total annual income.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This helps in "Smart EMI" calculations. Paisabazaar can suggest loan amounts where the EMI does not exceed 30–40% of this monthly inhand figure, ensuring lower default rates and higher customer satisfaction.

If the majority of users have very low inhand salaries despite high annual incomes (due to deductions or taxes), they may be "cash poor." These users are high-risk because any unexpected expense could cause them to miss a debt payment.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Chart - 6: Box Plot of Annual Income by Credit Score
plt.figure(figsize=(10, 6))
sns.boxplot(x='Credit_Score', y='Annual_Income', data=df,
            palette='Set2', order=['Poor', 'Standard', 'Good'])
plt.title('Impact of Annual Income on Credit Score')
plt.xlabel('Credit Score Category')
plt.ylabel('Annual Income')
plt.show()

##### 1. Why did you pick the specific chart?

I used a Box Plot because it allows us to compare the distribution (median and spread) of a numerical variable (Annual_Income) across different categorical levels (Credit_Score).

##### 2. What is/are the insight(s) found from the chart?

The chart shows a clear upward trend: customers with "Good" credit scores generally have higher median incomes. However, there is significant overlap between "Poor" and "Standard," suggesting that income alone doesn't guarantee a good score; behavior matters more.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. It confirms that income is a filter, but not the only factor. Paisabazaar can target high-income individuals with "Poor" scores for "Credit Repair" services, as they have the financial capacity to improve but lack the right habits.

The presence of many high-income outliers in the "Poor" category is a risk. It indicates that some high earners are highly irresponsible with debt, meaning traditional income-based lending models might fail if they don't look at behavioral data.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Chart - 7: Violin Plot of Outstanding Debt by Credit Score
plt.figure(figsize=(12, 6))
sns.violinplot(x='Credit_Score', y='Outstanding_Debt', data=df,
               palette='magma', order=['Poor', 'Standard', 'Good'])
plt.title('Relationship between Outstanding Debt and Credit Score')
plt.show()

##### 1. Why did you pick the specific chart?

A Violin Plot is perfect here because it shows both the "box plot" data and the "density" of the data. It helps us see where the "bulge" of debt lies for each score category.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

The "Poor" category has a much wider and higher distribution of outstanding debt. The "Good" category is very narrow and stays at the lower end, indicating that maintaining low debt levels is a hallmark of a high credit score.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Absolutely. Paisabazaar can offer "Debt Consolidation Loans" to users in the "Poor" or "Standard" bulge. By helping them consolidate high-interest debt into one manageable payment, their scores will eventually improve, creating a loyal customer.

If the "Standard" category starts showing a density bulge toward higher debt, it’s a warning sign of an upcoming "downgrade cycle" where middle-tier customers might slip into the "Poor" category due to over-leveraging.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Chart - 8: Stacked Bar Chart for Payment Habits
pmt_min = pd.crosstab(df['Payment_of_Min_Amount'], df['Credit_Score'])
pmt_min.div(pmt_min.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True, figsize=(10,6), color=['#ff9999','#66b3ff','#99ff99'])

plt.title('Credit Score Distribution by Minimum Amount Payment Habit')
plt.xlabel('Paid Minimum Amount Only?')
plt.ylabel('Proportion of Customers')
plt.legend(title='Credit Score', loc='upper right')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Stacked Bar Chart (normalized to 100%) to show the "composition" of credit scores within two behavioral groups. This makes it very easy to see the probability of having a good score based on a single habit.

##### 2. What is/are the insight(s) found from the chart?

The insight is stark: those who answer "Yes" to paying only the minimum amount have a much higher proportion of "Poor" credit scores. Paying only the minimum is a massive "red flag" for credit health.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This is a clear "Educational Nudge" opportunity. Paisabazaar can send automated alerts saying: "Did you know? Paying just 50 more than the minimum could boost your score by X points."

Yes. If the majority of the platform's users are "Minimum Payers," the overall "Platform Quality" is low. This might discourage premium lenders (Amex, Chase, etc.) from advertising on Paisabazaar because they seek "Full Payers."Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Chart - 9: Box Plot of Payment Delays grouped by Credit Score
plt.figure(figsize=(10, 6))
sns.boxplot(x='Credit_Score', y='Delay_from_due_date', data=df,
            palette='coolwarm', order=['Poor', 'Standard', 'Good'])
plt.title('How Payment Delays vary across Credit Score Tiers')
plt.xlabel('Credit Score')
plt.ylabel('Days Delayed past Due Date')
plt.show()

##### 1. Why did you pick the specific chart?

I used a Box Plot to compare the distribution of delays across credit tiers. It clearly shows the "threshold" of delays that separates a 'Good' score from a 'Poor' one.

##### 2. What is/are the insight(s) found from the chart?

There is a near-perfect correlation: as the number of delayed days increases, the credit score drops. Customers in the 'Good' category have almost zero median delay, while 'Poor' scorers consistently delay payments by several weeks.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes. This allows for Predictive Alerts. If a 'Standard' customer starts showing a trend of 5+ days delay, Paisabazaar can send an automated warning about the potential score drop, helping the user stay in the higher tier.

The very tight distribution in the 'Good' category suggests that even a small mistake (one-off delay) could potentially kick a user out of the premium tier, leading to sudden loss of eligibility for low-interest products.

#### Chart - 10 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Chart - 10: Correlation Heatmap
# We select only numerical columns to calculate the Pearson correlation
plt.figure(figsize=(12, 8))
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr()

# Plotting the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='RdYlGn', fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap: Relationship Between Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Correlation Heatmap because it is the most efficient way to perform Multivariate Analysis. It allows us to see how all numerical variables interact with each other simultaneously, helping identify which factors (like Debt, Income, and Delays) move in tandem.

##### 2. What is/are the insight(s) found from the chart?

The heatmap reveals strong relationships that single charts might miss. For instance, it often shows a strong negative correlation between Delay_from_due_date and our numeric credit score. It also helps spot Multicollinearity—if two variables (like Annual_Income and Monthly_Inhand_Salary) are too highly correlated, we know we only need one of them for building a predictive model.

#### Chart - 11 - Pair Plot


In [None]:
# Chart - 11: Pair Plot
# Selecting a subset of key numerical features to keep the plot readable and fast
cols_to_plot = ['Annual_Income', 'Outstanding_Debt', 'Delay_from_due_date', 'Credit_Score']

sns.pairplot(df[cols_to_plot], hue='Credit_Score', palette='viridis', diag_kind='kde')
plt.suptitle('Pair Plot of Key Financial Drivers', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

The Pair Plot is the most comprehensive multivariate tool available. It allows us to see the distribution of each variable (on the diagonal) and the scatter relationships between every possible pair of variables (on the off-diagonal) simultaneously, all color-coded by the target Credit_Score.

##### 2. What is/are the insight(s) found from the chart?

It visually confirms "clusters." For example, you can see how 'Good' scores cluster in areas of low debt and low delay across multiple dimensions. It also reveals whether relationships are linear or if there are non-linear boundaries that separate a 'Standard' customer from a 'Poor' one.

In [None]:
# Statistical Analysis: T-Test
from scipy.stats import ttest_ind

# Grouping income by Credit Score
good_score_income = df[df['Credit_Score'] == 'Good']['Annual_Income']
poor_score_income = df[df['Credit_Score'] == 'Poor']['Annual_Income']

# Performing the T-Test
t_stat, p_val = ttest_ind(good_score_income, poor_score_income)

print(f"T-Statistic: {t_stat:.4f}")
print(f"P-Value: {p_val:.4g}")

if p_val < 0.05:
    print("\nInsight: The difference in income is statistically significant.")
else:
    print("\nInsight: No statistically significant difference found.")

## **5. Solution to Business Objective**

### **## Proposed Strategies for Business Growth**

#### **1. Behavioral "Nudge" Engine**
* **The Insight:** Our analysis of **Payment Delays** and **Minimum Payment Habits** showed these are the strongest indicators of a 'Poor' credit score.
* **The Action:** Develop an automated notification system that alerts users *before* they hit a 5-day delay.
* **The Goal:** Prevent score degradation by coaching users toward better repayment discipline, keeping them eligible for premium bank products.

#### **2. Targeted Debt Consolidation**
* **The Insight:** We identified a specific segment of **"High-Income"** users who are stuck in 'Poor' credit tiers due to high **Outstanding Debt**.
* **The Action:** Create a dedicated marketing funnel for **Debt Consolidation Loans** specifically for this high-earning segment.
* **The Goal:** Help users merge multiple high-interest debts into one manageable EMI, reducing their credit utilization and boosting their scores.



#### **3. "Credit-Mix" Diversification Path**
* **The Insight:** The data proves that a **'Standard' or 'Bad' Credit Mix** acts as a ceiling, preventing users from ever reaching a 'Good' credit score.
* **The Action:** Recommend **"Credit Builder"** products (like secured cards or small personal loans) to users with a thin credit file.
* **The Goal:** Mature the customer profile so they can eventually qualify for high-ticket items like Home or Auto loans.

#### **4. Dynamic Risk Assessment**
* **The Insight:** Income alone is not a safety net; many high-earners have poor credit due to behavioral factors.
* **The Action:** Shift from simple income-based filters to a **Multi-Factor Risk Model** that weighs **Credit History Age** and **Repayment Habits** more heavily.
* **The Goal:** Reduce the platform’s default rate by identifying "low-income but high-discipline" borrowers who are often overlooked.

#### **5. Interest Rate Transparency Tools**
* **The Insight:** There is a massive **"Interest Gap"** between 'Good' and 'Poor' scorers.
* **The Action:** Launch a **"Potential Savings Calculator"** in the app.
* **The Goal:** Show users exactly how much money they are losing to high interest rates, creating a strong incentive for them to use Paisabazaar’s credit improvement services.



---

### **### Final Business Impact**
By moving from a simple marketplace to a **Financial Wellness Platform**, Paisabazaar can:

* **Lower Default Rates:** By filtering for disciplined behavior rather than just salary.
* **Increase Referral Commissions:** By maturing users from 'Standard' to 'Good' scores, where high-commission premium products are available.
* **Higher Customer Retention:** By providing actual value in helping users save money on interest.

# **Conclusion**

## **6. Conclusion**

### **### Key Findings from Analysis**
* **Behavior Over Income:** The most significant takeaway is that **Repayment Discipline** (measured by delays and payment habits) is a more accurate predictor of creditworthiness than a user's raw **Annual Income**. High earners are not automatically "Good" credit risks.
* **The "Debt Ceiling":** Users with high **Outstanding Debt** and a **Bad Credit Mix** are mathematically blocked from reaching the 'Good' credit tier, regardless of other positive factors.
* **Longevity Matters:** A mature **Credit History Age** acts as a stabilizer. Analysis shows that "Good" credit scores are built over time through consistent, long-term credit management.
* **Cost of Inaction:** The data reveals a clear **"Interest Rate Penalty"** for 'Poor' and 'Standard' scorers, who pay significantly higher rates, often trapping them in a cycle of debt.



### **### Final Project Summary**
This project successfully transitioned from raw data cleaning to deep behavioral analysis. By performing **Univariate, Bivariate, and Multivariate analysis**, we identified the key levers that move a credit score.

The cleaning and wrangling process ensured that the insights were not skewed by financial outliers, while the **Statistical Analysis** confirmed that the relationships discovered are mathematically significant.

### **### Closing Remarks**
For **Paisabazaar**, the opportunity lies in becoming more than just a middleman. By leveraging these insights to provide **Personalized Credit Coaching**, the platform can reduce default risks for bank partners and unlock new revenue streams from a more financially healthy and loyal user base.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***