<a href="https://colab.research.google.com/github/ManojSheshama/Capstone_Project_2/blob/main/Capstone_Project_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Paisabazaar Banking Fraud Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Project By**      - Manoj Kumar Sheshama


# **Project Summary -**

The financial industry has witnessed a significant rise in fraudulent activities and credit risks over the past decade, making fraud detection and credit risk assessment critical areas of focus for banks and financial institutions. In this project, we analyze a large-scale dataset provided by Paisabazaar, which includes customer demographics, financial details, credit utilization patterns, and payment behaviors. By conducting Exploratory Data Analysis (EDA) and visualizations, the goal is to uncover key insights that can help detect fraudulent behavior, assess financial risk, and strengthen customer credit profiling.

The dataset consists of 100000 records with 28 attributes related to customer identity, financial history, and credit score. Key columns include Annual Income, Monthly Inhand Salary, Number of Bank Accounts, Number of Credit Cards, Interest Rate, Outstanding Debt, Credit Utilization Ratio, Credit History Age, Payment Behavior, and Credit Score. These variables are essential indicators of financial health and are widely used in risk modeling and fraud detection systems. The presence of demographic details such as Age, Occupation, and SSN adds an additional layer for identifying potential fraud cases, such as multiple accounts linked to the same identity or unrealistic financial behaviors.

The first step of the project involves data cleaning and preprocessing. Although the dataset does not contain missing values, duplicate entries or repeated customer records across multiple months are also examined to ensure data integrity.

The next stage focuses on Exploratory Data Analysis (EDA). Here, we investigate how different features relate to fraudulent or risky financial behavior. For instance, the relationship between Credit Utilization Ratio and Credit Score can reveal whether customers who use higher proportions of their available credit are more likely to have poor scores. Similarly, analysis of Number of Delayed Payments and Delay from Due Date can highlight patterns of financial irresponsibility, which are often precursors to fraud or default. Visualizations such as histograms, boxplots, heatmaps, and correlation matrices help identify trends, anomalies, and feature interdependencies.

One of the core aspects of this project is fraud analysis. Fraudulent behavior often manifests in unrealistic or contradictory patterns in financial data. For example, a customer with a low annual income but multiple credit cards, high loan counts, and excessive monthly spending could indicate suspicious activity. Similarly, sudden changes in Credit Limit, unusually high Credit Inquiries, or inconsistent Payment Behavior can serve as red flags. By combining statistical analysis and visual exploration, we aim to highlight such irregularities.

Another key outcome of this project is credit risk profiling. The dataset contains a labeled feature, Credit Score, categorized as Good, Standard, or Poor. By analyzing the distribution of features across these groups, we can uncover the primary factors contributing to poor credit performance. For example, high EMI obligations compared to income, frequent delays in payment, and poor credit mix are expected to correlate strongly with lower credit scores. These insights are valuable for banks in developing scoring models and improving lending decisions.

Visual storytelling plays a vital role in this analysis. Through bar charts, scatter plots, and trend graphs, we illustrate how financial indicators differ across customer segments. Heatmaps and pair plots help identify multi-feature interactions, while boxplots highlight outliers that may correspond to fraud. For instance, we can visualize the variation of Outstanding Debt across different income groups or compare Payment Behavior with Credit Utilization Ratios to detect unusual spending habits.

In conclusion, this project provides a comprehensive understanding of customer financial behavior using real-world banking data. By leveraging EDA and visualization techniques, we identify patterns of fraud, risk factors affecting credit scores, and relationships between financial attributes. The insights gained not only aid in detecting fraudulent activities but also enable financial institutions to optimize their credit risk strategies and enhance customer profiling systems. Ultimately, this project demonstrates how data-driven analysis can contribute to safer, more transparent, and more efficient banking systems.

# **GitHub Link -**

https://github.com/ManojSheshama/Capstone_Project_2.git

# **Problem Statement**


We are given a dataset of 100000 banking customers with detailed information about their financial activities, credit history, and personal profiles, how can we effectively analyze this data to detect and flag potential fraudulent behavior? The analysis should focus on identifying patterns and anomalies that deviate from typical customer behavior, which could indicate fraudulent activities such as credit card fraud, loan exploitation, or unusual financial transactions. We have to use atleast 5 different types of charts in the visualization section.

#### **Define Your Business Objective?**

The objective of this project is to analyze customer financial and behavioral data to detect potential banking fraud and assess credit risk more effectively. By leveraging Exploratory Data Analysis and visualization techniques, our project aims to uncover hidden patterns, anomalies, and key factors that indicate fraudulent activity or poor credit performance. These insights will help financial institutions like Paisabazaar strengthen fraud detection systems, optimize lending decisions, and improve overall customer risk profiling.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
paisabazaar_df = pd.read_csv('/content/dataset.csv')

### Dataset First View

In [None]:
# Dataset First Look
paisabazaar_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
paisabazaar_df.shape

### Dataset Information

In [None]:
# Dataset Info
paisabazaar_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(paisabazaar_df[paisabazaar_df.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
paisabazaar_df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(paisabazaar_df.isnull(), cbar='coolwarm')

### What did you know about your dataset?

During the initial exploration of the dataset, I found that there are 100000 rows and 28 columns in the dataset. Also the Age column is stored as a float type, so we have to convert it into an integer type for better accuracy and interpretation. I also checked for duplicate records, but no duplicates were found in the dataset. While analyzing missing values, I observed that the last 3 columns namely (Payment_Behaviour, Monthly_Balance, Credit_Score) contained one null value each.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
paisabazaar_df.columns

In [None]:
# Dataset Describe
paisabazaar_df.describe(include='all')

### Variables Description

The dataset consists of 28 variables capturing customer demographics, financial details, credit history, and behavioral patterns. Key features include Age, Occupation, Annual Income, Number of Bank Accounts, Credit Utilization Ratio, Outstanding Debt, and Payment Behavior, which help in understanding customer financial profiles. The target variable is Credit Score, categorized as Good, Standard, or Poor, which serves as a benchmark for assessing fraud risk and creditworthiness.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
paisabazaar_df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Convert 'Age' column from float to integer
paisabazaar_df['Age'] = paisabazaar_df['Age'].astype(int)

# Check for duplicate values
duplicates = paisabazaar_df.duplicated().sum()
print("Number of duplicate rows:", duplicates)

# Check for null values
print("Null values before cleaning: As we found earlier in the project there were 1 null value in last 9 columns each")

# Drop rows with null values (since only 1 null in last 9 columns)
paisabazaar_df = paisabazaar_df.dropna()


# Verify after cleaning
print("Shape after dropping null values:", paisabazaar_df.shape)
print("Null values after cleaning:\n", paisabazaar_df.isnull().sum())


In [None]:
#Removing outliers in 'Annual_Income' column using IQR
Q1 = paisabazaar_df['Annual_Income'].quantile(0.25)
Q3 = paisabazaar_df['Annual_Income'].quantile(0.75)
IQR = Q3 - Q1
df = paisabazaar_df[(paisabazaar_df['Annual_Income'] >= (Q1 - 1.5 * IQR)) & (paisabazaar_df['Annual_Income'] <= (Q3 + 1.5 * IQR))]

In [None]:
#Creating a new column 'Income_per_Account'
paisabazaar_df['Income_per_Account'] = paisabazaar_df['Annual_Income'] / paisabazaar_df['Num_Bank_Accounts']


In [None]:
#Creating a feature 'Credit_Score_Level'
paisabazaar_df['Credit_Score_Level'] = paisabazaar_df['Credit_Score'].apply(lambda x: 'High' if isinstance(x, int) and x > 700 else 'Low')

### What all manipulations have you done and insights you found?

The Age column was stored as a float type, so I converted it into an integer type for better accuracy and interpretation. I also checked for duplicate records, but no duplicates were found in the dataset. While analyzing missing values, I observed that the last nine columns namely (Payment_Behaviour, Monthly_Balance, Credit_Score) contained one null value each. To ensure data consistency and avoid biased analysis, I decided to drop all rows containing null values. After these preprocessing steps, the dataset is now cleaner, with standardized datatypes and no missing or duplicate records, making it ready for further analysis and visualization. Also, we have remomed outliers from our "Annual_Income" column to remove extremely high or low incomes compared to the majority. We have also created two new columns:- 1- Income_per_Account - To represent income per account of an individual if they are having multiple accounts. 2- Credit_Score_Level - To divide the credit score levels into low, medium, or high levels.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  Number of Customers By Occupation - Bar Chart (Demographic Analysis)

In [None]:
Occupation_wise_no_of_customers = paisabazaar_df['Occupation'].value_counts().reset_index()

In [None]:
Occupation_wise_no_of_customers

In [None]:

fig, ax = plt.subplots(figsize= [10,3])
ax.bar(x= Occupation_wise_no_of_customers['Occupation'].astype('str'), height = Occupation_wise_no_of_customers['count'], color = "violet")
ax.set_xlabel('Occupation')
ax.set_ylabel('No. of Customers')
ax.set_title('Occupation-wise Number of Customers')
ax.spines[['top', 'right', 'left']].set_visible(False)
plt.xticks(rotation=45)


plt.show()

##### 1. Why did you pick the specific chart?

Bar charts are ideal for categorical comparisons across occupations.

##### 2. What is/are the insight(s) found from the chart?

Certain occupations like Lawyers & Engineers dominate the dataset, while others have smaller representation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify high-volume customer segments to target for loans and credit products.

#### Chart - 2  Age Distribution of Customers - Histogram(Demographic Analysis)

In [None]:
sns.histplot(paisabazaar_df['Age'])
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

Histograms reveal distribution trends across a continuous variable.

##### 2. What is/are the insight(s) found from the chart?

Most customers are concentrated in the working-age bracket (25–40 years).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Allows Paisabazaar to tailor financial products for the most active age group.


#### Chart - 3    Annual Income by Occupation - Boxplot(Demographic Analysis)

In [None]:

Occupation_wise_income = paisabazaar_df.groupby('Occupation')['Annual_Income'].sum().reset_index()

Occupation_wise_income

In [None]:
# Create a box plot for Annual Income across Occupation
plt.figure(figsize=(12, 8))
sns.boxplot(x='Occupation', y='Annual_Income', data=paisabazaar_df)
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.title('Distribution of Annual Income Across Different Occupations')
plt.xlabel('Occupation')
plt.ylabel('Annual Income')
plt.show()


plt.show()

##### 1. Why did you pick the specific chart?

Boxplots display income spread and outliers across groups.

##### 2. What is/are the insight(s) found from the chart?

Salaried individuals generally earn higher and more stable incomes, while some occupations show wider variability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Informs creditworthiness assessment by linking occupation to income stability.

#### Chart - 4  Annual Income Distribution - Histogram (Income & Salary Insights)

In [None]:
sns.histplot(paisabazaar_df['Annual_Income'])
plt.title('Distribution of Annual Income')
plt.show()

##### 1. Why did you pick the specific chart?

Useful for visualizing skewness and spread in customer income levels.

##### 2. What is/are the insight(s) found from the chart?

Income distribution is slightly skewed, with many customers in lower-to-mid income ranges.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps identify affordability ranges and potential loan size limits.

#### Chart - 5  Monthly In-Hand Salary VS Credit Score - Boxplot (Income & Salary Insights)

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Credit_Score', y='Monthly_Inhand_Salary', data=paisabazaar_df)
plt.title('Monthly In-Hand Salary VS Credit Score')
plt.xlabel('Credit Score')
plt.ylabel('Monthly In-Hand Salary')
plt.show()


##### 1. Why did you pick the specific chart?

Shows how salary variation relates to credit score categories.

##### 2. What is/are the insight(s) found from the chart?

Customers with higher in-hand salaries are more often in the Good credit score group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Demonstrates salary as a predictor of repayment ability and creditworthiness.

#### Chart - 6  Comparing Income Distribution of Good VS Poor Credit Score Customers - KDE plot (Income & Salary Insights)

In [None]:
#Compare Income Distribution of Good VS Poor Credit Score Customers - KDE plot
plt.figure(figsize=(10, 6))

# KDE plot for Good credit score customers
sns.kdeplot(df[df['Credit_Score'] == 'Good']['Annual_Income'],
            label='Good Credit Score', shade=True)

# KDE plot for Poor credit score customers
sns.kdeplot(df[df['Credit_Score'] == 'Poor']['Annual_Income'],
            label='Poor Credit Score', shade=True, color='red')

plt.title("Income Distribution: Good vs Poor Credit Score Customers")
plt.xlabel("Annual Income")
plt.ylabel("Density")
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

KDE highlights income density differences between groups.

##### 2. What is/are the insight(s) found from the chart?

Good scorers cluster around higher income levels, while poor scorers are more spread in the lower range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Reinforces that income strongly influences repayment and credit risk.

#### Chart - 7  Average Number of Loans VS Credit Score - Bar Chart (Credit & Loans Analysis)

In [None]:

#Average Number of Loans VS Credit Score - Bar Chart
plt.figure(figsize=(10, 6))
sns.barplot(x='Credit_Score', y='Num_of_Loan', data=paisabazaar_df)
plt.title('Average Number of Loans VS Credit Score')
plt.xlabel('Credit Score')
plt.ylabel('Average Number of Loans')
plt.show()


##### 1. Why did you pick the specific chart?

Simple comparison of averages across categories.

##### 2. What is/are the insight(s) found from the chart?

Poor scorers hold more loans on average than good scorers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Multiple active loans increase default risk; can guide stricter multi-loan approval checks.

#### Chart - 8  Types of Loans Availed by Customers - Stacked Bar Chart (Credit & Loans Analysis)

In [None]:
# Split the Type_of_Loan column (loans are separated by commas)
loan_split = paisabazaar_df['Type_of_Loan'].str.get_dummies(sep=',')

# Add Credit Score if you want to analyze loans by score groups
loan_by_score = pd.concat([paisabazaar_df['Credit_Score'], loan_split], axis=1)

# Group by Credit Score and sum
loan_distribution = loan_by_score.groupby('Credit_Score').sum()

# Plot stacked bar chart
loan_distribution.plot(kind='bar', stacked=True, figsize=(10,6))

plt.title("Types of Loans Availed by Customers (Stacked by Credit Score)")
plt.xlabel("Credit Score Category")
plt.ylabel("Number of Loans")
plt.legend(title="Loan Types", bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

Stacked bar charts are effective for showing how different loan types are distributed across customers.

##### 2. What is/are the insight(s) found from the chart?

Personal loans and credit cards dominate, with customers holding multiple loans often having weaker credit scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps lenders identify over-leveraged customers and control loan approvals accordingly.

#### Chart - 9  Outstanding Debt VS Annual Income(with Credit Score as Color) - Scatter plot (Credit & Loan Analysis)

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Annual_Income', y='Outstanding_Debt', hue='Credit_Score', data=paisabazaar_df)
plt.title('Outstanding Debt VS Annual Income (with Credit Score as Color)')
plt.xlabel('Annual Income')
plt.ylabel('Outstanding Debt')
plt.show()


##### 1. Why did you pick the specific chart?

Best to show relation between income and debt with credit score as color.

##### 2. What is/are the insight(s) found from the chart?

Poor scorers: higher debt at low income.

Good scorers: lower debt, better balance.

Clear trend: higher income → better credit & lower debt.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Identify risky low-income/high-debt customers.

Refine loan approvals using debt-to-income ratio.

Target stable customers for premium products.

Yes, there is an insight that could lead to negative growth. The scatter plot for Outstanding Debt VS Annual Income reveals that customers with 'Poor' credit scores tend to have higher outstanding debt, especially when their annual income is low. This group represents a significant risk of loan defaults, leading to increased bad debt and financial losses for the banking institution, thus contributing to negative growth.

#### Chart - 10 Number of Delayed Payments VS Credit Score - Box Plot (Risk & Fraud Indicators)

In [None]:
#Number of Delayed Payments VS Credit Score - Box Plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Credit_Score', y='Num_of_Delayed_Payment', data=paisabazaar_df)
plt.title('Number of Delayed Payments VS Credit Score')
plt.xlabel('Credit Score')
plt.ylabel('Number of Delayed Payments')
plt.show()


##### 1. Why did you pick the specific chart?

Boxplots capture repayment irregularities per credit score group.

##### 2. What is/are the insight(s) found from the chart?

Poor scorers have a much higher number of delayed payments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Strong evidence that repayment discipline directly impacts credit ratings.

The insight that 'Poor scorers have a much higher number of delayed payments' directly points to a potential for negative growth. A higher number of delayed payments among customers with poor credit scores means an increased risk of loan defaults. This leads to higher non-performing assets for the financial institution, resulting in direct financial losses, increased provisioning requirements, and reduced profitability. Ultimately, this can negatively impact the institution's financial health and growth.

#### Chart - 11 Delay From Due Date Distribution - Histogram & KDE plots (Risk & Fraud Indicator)

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(paisabazaar_df['Delay_from_due_date'], bins=30, edgecolor = 'black', kde=True)
plt.title('Delay From Due Date Distribution')
plt.xlabel('Days Delayed')
plt.ylabel('Number of Customers')
plt.show()

##### 1. Why did you pick the specific chart?

Histograms and KDE plots show the frequency and distribution of delays, helping visualize repayment behavior.

##### 2. What is/are the insight(s) found from the chart?

Many customers have consistent payment delays, which correlates with poor credit scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Enables the company to track repayment discipline and set up automated reminders or stricter monitoring for chronic defaulters.

Yes, insight that 'Many customers have consistent payment delays, which correlates with poor credit scores' presents a significant potential for negative growth. Consistent delays signal a higher risk of loan defaults. When a substantial portion of the customer base frequently delays payments, it increases the financial institution's exposure to non-performing assets, bad debts, and requires higher provisioning.

#### Chart - 12 Credit Utilization Ratio Grouped By Credit Score - Bar Chart (Risk & Fraud Indicators)

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x='Credit_Utilization_Ratio', y='Credit_Score', data=paisabazaar_df)
plt.title('Credit Utilization Ratio Grouped By Credit Score')
plt.xlabel('Credit Utilization Ratio')
plt.ylabel('Credit Score')
plt.show

##### 1. Why did you pick the specific chart?

Clearly shows how utilization varies by group.

##### 2. What is/are the insight(s) found from the chart?

Poor scorers have the highest utilization ratio, while good scorers keep it low.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.

Can be used as an early-warning KPI to flag risky customers.

#### Chart - 13  Percentage of Customers In Good, Standard, Poor categories - Pie Chart (Overall Credit Score Analysis)

In [None]:

# Count customers in each credit score category
credit_score_counts = paisabazaar_df['Credit_Score'].value_counts()

# Pie chart
plt.figure(figsize=(6,6))
plt.pie(credit_score_counts,
        labels=credit_score_counts.index,
         autopct='%1.1f%%',
         startangle=90,
        colors=['green', 'orange', 'red'],
        explode=(0.05, 0.05, 0.05))  # slight separation for clarity

plt.title("Percentage of Customers by Credit Score Category")
plt.show()


##### 1. Why did you pick the specific chart?

A pie chart is best to show the percentage split of customers across Good, Standard, and Poor credit score categories.

##### 2. What is/are the insight(s) found from the chart?

Majority of customers fall under the Standard category, with Poor scorers forming a significant portion, highlighting risk pockets.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps the bank identify the proportion of high-risk customers, so they can design stricter credit policies and focus retention efforts on good scorers.

Yes, the insight that 'Majority of customers fall under the Standard category, with Poor scorers forming a significant portion, highlighting risk pockets' presents a clear potential for negative growth. The existence of a substantial percentage of 'Poor' credit score customers in the portfolio signifies an elevated risk of loan defaults. This leads to increased non-performing assets, higher write-offs, and potentially larger provisions for bad debts, all of which directly erode the financial institution's profitability and overall financial health, contributing to negative growth.



#### Chart - 14 - Correlation Heatmap

In [None]:
# Select only numerical features
num_df = paisabazaar_df.select_dtypes(include=['int64', 'float64'])

# Compute correlation matrix
corr = num_df.corr()

# Plot heatmap
plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)

plt.title("Correlation Heatmap of Numerical Features")
plt.show()


##### 1. Why did you pick the specific chart?

Heatmaps are ideal for showing correlations between multiple numerical features.

##### 2. What is/are the insight(s) found from the chart?

Strong positive correlation between Annual Income & Monthly Salary, and between Debt & EMI. Credit Utilization Ratio is strongly tied to credit score.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Selecting a small subset of numerical columns for a more concise pair plot.
relevant_cols_small = [
    'Annual_Income',
    'Outstanding_Debt',
    'Credit_Utilization_Ratio',
    'Num_of_Delayed_Payment',
    'Credit_Score'
]
# Filter the DataFrame to include only relevant columns
df_pairplot_small = paisabazaar_df[relevant_cols_small]

# Create a Pair Plot
plt.figure(figsize=(10, 10)) # Adjust figure size for a smaller plot
sns.pairplot(df_pairplot_small, hue='Credit_Score', diag_kind='kde', plot_kws={'alpha': 0.7})
plt.suptitle('Pair Plot of Key Financial Features by Credit Score', y=1.02) # Adjust suptitle position
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot allows for simultaneous visualization of individual variable distributions and pairwise relationships across different credit score categories.

##### 2. What is/are the insight(s) found from the chart?

The pair plot reveals that customers with 'Good' credit scores generally have higher annual incomes and lower outstanding debt, while 'Poor' credit score customers tend to exhibit lower incomes and higher outstanding debt.



## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To address the business objective of reducing fraud risk and improving credit portfolio quality, I recommend a data-driven approach using the insights derived from the analysis. First, customers can be segmented based on their credit score, debt levels, and utilization ratios to identify high-risk groups. Customers showing consistently high outstanding debt, EMI burden, and credit utilization should be flagged for closer monitoring. The institution should encourage timely payments by offering reminders, flexible repayment plans, and financial counseling. At the same time, predictive models can be developed to identify early warning signals of potential fraud or default. By promoting responsible borrowing and proactive engagement, the bank can improve repayment rates. Automated alerts and stricter loan eligibility criteria for high-risk customers will further reduce exposure. Meanwhile, low-risk customers should be rewarded with better credit terms to strengthen loyalty. Over time, these actions will not only reduce fraud but also improve the overall profitability and customer trust of the organization.

# **Conclusion**

In conclusion, this project successfully explored the Paisabazaar dataset to uncover patterns in customer behavior, financial health, and creditworthiness. After cleaning and preparing the data by fixing datatypes, handling nulls, and checking for duplicates, meaningful EDA and visualizations were performed. The analysis of credit score distribution highlighted that most customers fall in the Standard category, with Good and Poor segments revealing distinct financial patterns. Key KPIs such as Outstanding Debt, EMI, and Utilization Ratio clearly distinguished Good scorers, who maintained disciplined debt levels, from Poor scorers, who struggled with high debt burdens and delayed repayments.

The study of income groups revealed that while income is an important factor, repayment discipline and credit utilization were far more decisive in determining credit scores. Charts comparing payment of minimum amount against credit scores confirmed that customers paying only minimum dues were more likely to have poor scores. The delay from due date analysis further reinforced that late payments are strong indicators of financial stress and risk. Loan-type analysis showed a high reliance on personal loans and credit cards, with multiple loan types clustering among weaker credit scorers.

Correlation analysis indicated strong links between income, debt, and utilization ratios, making them critical features for credit risk modeling. Radar charts further emphasized that poor credit scorers consistently scored worse across all financial indicators, while good scorers maintained balance and control. These findings align with the business objective by highlighting how repayment behavior, utilization ratio, and EMI obligations serve as stronger risk predictors than income alone.

Overall, the project demonstrates that fraud risk and poor credit performance can be mitigated by focusing on early detection of high-risk behaviors such as high utilization, minimum payments, and frequent delays. With these insights, Paisabazaar can implement targeted interventions, improve risk assessment models, and promote responsible borrowing, ultimately strengthening both customer trust and institutional stability.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***