# **Project Name** -   **Paisabazaar Banking Fraud Analysis**


##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

In this project i have done analysis of Paisabazaar's credit score data to uncover key drivers of creditworthiness and identify major risk indicators. Using a dataset consisting of 100,000 records with demographic, financial, and behavioral features, the objective was to explore relationships influencing credit scores and thereby support improved risk management and loan approval processes.

The analysis included essential data preprocessing steps such as handling numeric and categorical variables, detecting outliers and treating them, creating new features like Debt-to-Income ratio, EMI-to-Salary ratio, Delay Ratio, and a composite Risk Score. These derived metrics capture customer financial stress and suggest repayment behavior.

Then i have done univariate, bivariate, and multivariate analysis with visualizations to gain some insights and relations from the data. This revealed that income, credit utilization, delays, outstanding debt, and loan counts significantly influence credit scores and associate reliably with fraud flags.

The project demonstrated a thorough, business-aligned approach to data analytics leveraging statistical techniques, domain knowledge, and visualization best practices. The insights gained empower Paisabazaar to develop predictive credit scoring models, personalize financial products, and proactively manage portfolio risk.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


 **Business Context**

Paisabazaar is a financial services company that assists customers in finding and applying for various banking and credit products. An integral part of their service is assessing the creditworthiness of individuals, which is crucial for both loan approval and risk management. The credit score of a person is a significant metric used by financial institutions to determine the likelihood that an individual will repay their loans or credit balances. Accurate classification of credit scores can help Paisabazaar enhance their credit assessment processes, reduce the risk of loan defaults, and offer personalized financial advice to their customers.

In this context, analyzing and classifying credit scores based on customer data can improve decision-making processes and contribute to better financial product recommendations. This case study aims to develop a model that predicts the credit score of individuals based on various features, such as income, credit card usage, and payment behavior.

#### **Define Your Business Objective?**

The main objective of this study is to improve Paisabazaar's credit assessment process by exploring and analyzing financial and demographic factors influencing credit scores of the customers. By performing Exploratory Data Analysis (EDA), we aim to:

Faster Loan Decisions:
Enhancing the credit score checking process to cut down on manual work, speed up approvals.

Smarter Customer Segmentation:
Use of customer's financial and behavioral data to design credit products, interest rates, and repayment plans that suit different customer groups.

Better Risk Control:
Apply advanced credit scoring to accurately measure customer risk, reduce loan defaults, and keep the loan portfolio healthy.

Personalized Financial Solutions:
Offer customized product recommendations—like loans, credit cards, or insurance—based on each customer’s financial habits and goals.

Build Customer Trust and Loyalty:
Share clear, data-driven credit insights and personalized advice to strengthen customer confidence and encourage long-term relationships.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import missingno
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
df=pd.read_csv('/content/drive/MyDrive/dataset.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head(10)

In [None]:
# last 10 rows view
df.tail(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("rows:",len(df))
print("columns:",len(df.columns))

### Dataset Information

In [None]:
# Dataset Info
df.info()

Here, the columns - Occupation, Type_of_Loan, Credit_Mix, Payment_of_Min_Amount, Payment_Behaviour, Credit_Score are categorical. Hence, we modify the datatypes of these columns to category.

In [None]:
# object -> category, it saves memory and is fast to retreive.
df.Occupation = df.Occupation.astype('category')
df.Type_of_Loan = df.Type_of_Loan.astype('category')
df.Credit_Mix = df.Credit_Mix.astype('category')
df.Payment_of_Min_Amount = df.Payment_of_Min_Amount.astype('category')
df.Payment_Behaviour = df.Payment_Behaviour.astype('category')
df.Credit_Score = df.Credit_Score.astype('category')

In [None]:
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sort_values(ascending = False)


In [None]:
#removing repeated rows with all same values
df.drop_duplicates(inplace=True)


In [None]:
len(df) #length is same no duplicate rows

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)
plt.show()

### What did you know about your dataset?

The dataset given is credit risk classification dataset with 100k records and 12.5k unique customers. It combines demographic, financial, and behavioral variables to predict a customer’s Credit Score.No duplicates and null values were there. A few entities had incorrect data types, which I corrected. This dataset provides a strong foundation for building a credit scoring model that can help financial institutions assess creditworthiness, reduce loan defaults, and improve risk management strategies.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe().T

In [None]:
#Check statistical values for fields with other than numerical datatype
df.describe(exclude=np.number).T

### Variables Description

 | **Variable Name**              | **Description**                                                            |
| ------------------------------ | -------------------------------------------------------------------------- |
| **ID**                         | Unique ID of the record                                                    |
| **Customer\_ID**               | Unique ID of the customer                                                  |
| **Month**                      | Month of the year                                                          |
| **Name**                       | The name of the person                                                     |
| **Age**                        | The age of the person                                                      |
| **SSN**                        | Social Security Number of the person                                       |
| **Occupation**                 | Occupation of the person                                                   |
| **Annual\_Income**             | Annual income of the person                                                |
| **Monthly\_Inhand\_Salary**    | Monthly in-hand salary of the person                                       |
| **Num\_Bank\_Accounts**        | Number of bank accounts of the person                                      |
| **Num\_Credit\_Card**          | Number of credit cards the person has                                      |
| **Interest\_Rate**             | Interest rate on the credit card of the person                             |
| **Num\_of\_Loan**              | Number of loans taken by the person from the bank                          |
| **Type\_of\_Loan**             | Types of loans taken by the person from the bank                           |
| **Delay\_from\_due\_date**     | Average number of days delayed by the person from the due date of payment  |
| **Num\_of\_Delayed\_Payment**  | Total number of payments delayed by the person                             |
| **Changed\_Credit\_Card**      | Percentage change in the credit card limit of the person                   |
| **Num\_Credit\_Inquiries**     | Number of credit card inquiries by the person                              |
| **Credit\_Mix**                | Classification of credit mix of the customer                               |
| **Outstanding\_Debt**          | Total outstanding balance of the person                                    |
| **Credit\_Utilization\_Ratio** | Credit utilization ratio of the credit card of the customer                |
| **Credit\_History\_Age**       | Age of the credit history of the person                                    |
| **Payment\_of\_Min\_Amount**   | Whether the person paid the minimum amount due (Yes = paid, No = not paid) |
| **Total\_EMI\_per\_month**     | Total EMI per month of the person                                          |
| **Amount\_invested\_monthly**  | Monthly amount invested by the person                                      |


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print(i,":  ",df[i].nunique())

credit score values


In [None]:
credit_score_values= df['Credit_Score'].unique()
print(credit_score_values)

Month has only 8 unique values. Better to analyse further which months are present.


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# removing columns which are not of use
df.drop(['ID','Customer_ID','Name','SSN'], axis=1, inplace=True)

#selecting numerical columns
numerical_cols = df.select_dtypes(include=[np.number])
print("numeric columns: ",numerical_cols.columns)

#selecting categorical columns
categorical_cols = df.select_dtypes(include=['object', 'category'])
print("categorical columns: ",categorical_cols.columns)


In [None]:
# detecting outliers in numeric continuous columns
columns_to_check = ['Annual_Income', 'Outstanding_Debt', 'Monthly_Balance',
                    'Credit_Utilization_Ratio', 'Delay_from_due_date', 'Total_EMI_per_month']
# visualy checking outliers using boxplot
print("Before outlier treatment")

plt.figure(figsize=(12,7))
for i, col in enumerate(columns_to_check, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

# IQR calculation
outlier_indices = dict()

for col in columns_to_check:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)].index
    outlier_indices[col] = outliers
    print(f"{col} has {len(outliers)} outliers")



In [None]:
#checking summary stats for high outlier col
cols = ['Monthly_Balance', 'Outstanding_Debt', 'Total_EMI_per_month']
print(df[cols].describe())

**Insights:**

* Monthly_Balance:

  Skewed high values- The maximum is about 6x the mean, indicating some customers have much higher balances than typical.

* Outstanding_Debt:

   High spread- The maximum is over 3.5x the mean, can indicate financial risk or potential errors.

* Total_EMI_per_month:

  Very high outliers- The max is more than 16x the mean, suggesting a few customers have much larger EMI obligations.

In [None]:
#checking outliers
cols = ['Monthly_Balance', 'Outstanding_Debt', 'Total_EMI_per_month']
for col in cols:
    print(f"Top outliers for {col}:")
    print(df[[col,'Age', 'Annual_Income', 'Occupation']].sort_values(by=col, ascending=False).head(5))


**Insights:**

* The combination of high income and balance here looks plausible and consistent
for a professional.

* Moderate income with high debt may indicate financial stress but is not necessarily an error.
* Given a high annual income, this large EMI is likely plausible.

In [None]:
#treating outliers
print(f"Rows before dropping outliers :{len(df)}")
# Define columns to cap
columns_to_cap = ['Annual_Income', 'Outstanding_Debt', 'Monthly_Balance',
                  'Delay_from_due_date', 'Total_EMI_per_month']

for col in columns_to_cap:
    lower_cap = df[col].quantile(0.01)  # 1st percentile
    upper_cap = df[col].quantile(0.99)  # 99th percentile
    df[col] = df[col].clip(lower=lower_cap, upper=upper_cap)
    print(f"{col}: capped at [{lower_cap:.2f}, {upper_cap:.2f}]")

# Define columns to remove outliers
col = 'Credit_Utilization_Ratio'

Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter to keep only rows within bounds for Credit_Utilization_Ratio
df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]


print(f"Rows remaining after dropping outliers in {col}: {len(df)}")

In [None]:
# visualy checking outliers after handeling them using boxplot
plt.figure(figsize=(12,7))
for i, col in enumerate(columns_to_check, 1):
    plt.subplot(2, 3, i)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()


In [None]:
#checking stats of rows after removing outliers
cols = ['Monthly_Balance', 'Outstanding_Debt', 'Total_EMI_per_month']
print(df[cols].describe())

**Insights:**

* Maximums are capped:

           Monthly_Balance: 1054.67

           Outstanding_Debt: 4806.97

           Total_EMI_per_month: 592.72

* Standard deviations decrease notably, which means extremes no longer distort overall distribution.

* Means and medians stay quite similar: This ensures your main data characteristics have not shifted unnaturally.

In [None]:
#Creating new columns for analysis,
# to avoid division by zero, replacing zero or null denominators with NaN
df['Annual_Income'] = df['Annual_Income'].replace(0, np.nan)
df['Monthly_Inhand_Salary'] = df['Monthly_Inhand_Salary'].replace(0, np.nan)
df['Credit_History_Age'] = df['Credit_History_Age'].fillna(0)

# Debt to Income Ratio
df['Debt_to_Income'] = df['Outstanding_Debt'] / df['Annual_Income']
df['Debt_to_Income'] = df['Debt_to_Income'].fillna(0)

# EMI to Salary Ratio
df['EMI_to_Salary'] = df['Total_EMI_per_month'] / df['Monthly_Inhand_Salary']
df['EMI_to_Salary']=df['EMI_to_Salary'].fillna(0)

# High Utilization Flag: 1 if Credit Utilization > 80%, else 0
df['High_Utilization_Flag'] = (df['Credit_Utilization_Ratio'] > 0.8).astype(int)

# Delay Ratio: Number of Delayed Payments normalized by credit history age + 1
df['Delay_Ratio'] = df['Num_of_Delayed_Payment'] / (df['Credit_History_Age'] + 1)

# Negative Balance Flag: 1 if Monthly Balance is negative, else 0
df['Negative_Balance_Flag'] = (df['Monthly_Balance'] < 0).astype(int)


# Risk Score composite index with weights assigned
df['Risk_Score'] = (
    0.4 * df['Debt_to_Income'] +
    0.3 * df['EMI_to_Salary'] +
    0.2 * df['High_Utilization_Flag'] +
    0.1 * df['Negative_Balance_Flag']
)


# Summary to check statistics
print(df[['Debt_to_Income', 'EMI_to_Salary', 'Delay_Ratio','Risk_Score']].describe())

**Insights:**

Most customers have modest debt and EMI relative to income. High outliers in these ratios could flag customers with financial strain or potential fraud risk.

Delay behaviors show some customers have frequent payment delays, which is critical for credit risk.

Risk_Score combines key factors and can be used directly in modeling to classify or predict fraud/credit issues.

In [None]:
# Creating binary fraud flag if “Credit_Score”
df['Risk_Flag'] = df['Credit_Score'].apply(lambda x: 1 if x == 'Poor' else 0)

# # Age Groups (bins)
bins = [18, 25, 35, 45, 60, 100]
labels = ['18-25', '26-35', '36-45', '46-60', '60+']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)



### What all manipulations have you done and insights you found?
# **Steps Done**

* **Removed Unwanted Columns**

  Unique IDs like (customer ID, account number,SSN) are not useful for pattern finding. Similarly constant/redundant column are not useful.
  

* **Data Segrigation**

  Seperated categorical variables and numerical variables columns.


* **Outlier Detection and Treatment**

  Detected outliers in key continuous numeric columns (Annual_Income, Outstanding_Debt, Monthly_Balance, Credit_Utilization_Ratio, Delay_from_due_date, Total_EMI_per_month) using the IQR method.

* **Visualized outliers with Boxplots**

  Winsorized (capped) outliers in major numeric columns at 1st and 99th percentiles to reduce skew and impact of extreme values.

  Dropped extreme outliers only in Credit_Utilization_Ratio to keep dataset balanced.

* **Feature Engineering**

  Created new financial risk-related variables: Debt_to_Income, EMI_to_Salary, Delay_Ratio.

  Created flag variables: High_Utilization_Flag, Negative_Balance_Flag.

  Calculated composite Risk_Score based on weighted sum of above features.

  Handled zeros and missing values in denominators to avoid division errors in these new variables.

* **Binary Target Variable Creation**

  Derived a binary Risk_Flag from Credit_Score (“Poor” labeled as fraud) for modeling.


# Insights:

Outlier Patterns:
Certain customers have significantly higher debt and EMI relative to income, likely signaling financial strain or fraud risk.

Distributions Suggest Skew:
The majority of customers maintain low Debt_to_Income and EMI_to_Salary ratios, but a long tail of high values exists.

Delay Behavior:
Many customers have few delayed payments relative to credit history, but some show frequent delays, important for risk profiling.


Risk Score Utility:
A composite Risk_Score that integrates multiple financial stress indicators offers a meaningful single metric for further analysis and fraud prediction.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# **Univariate Analysis**
* Single variable analysis
* Individual report

Graphs:

Histogram: Shows the frequency distribution.

Box Plot: Highlights the spread and outliers.

countplot : Used to find count of categories.

Pie chart: To find proportion of categories.

#### Chart - 1

What is the distribution of Annual Income?

In [None]:
# Chart - 1 visualization code

# Histogram for Annual Income
plt.figure(figsize=(10,6))
sns.histplot(df['Annual_Income'], bins=30, kde=True)
plt.title('Distribution of Annual Income')
plt.xlabel('Annual Income')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

Annual income distribution is very critical for financial platforms as it directly influences loan eligibility, risk segmentation,return policy and marketing strategies for various financial products.

##### 2. What is/are the insight(s) found from the chart?

The distribution is right-skewed, with most customers clustered in lower annual income brackets (under 40,000). Highest frequency is observed around 18,000-20,000, and then frequency tapers off as income rises.

There are several small peaks at higher income ranges, hinting at potential sub-populations such as salaried professionals, business owners, or self-employed individuals with variable incomes.

The number of customers with annual income above 120,000 is very low compared to the lower income segments, indicating that focus is towards middle and lower-income customers.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive growth by designing products tailored for middle-income groups
increasing product take-up.

These insights are valuable for optimizing loan products, designing tiered interest rates, and targeting credit card promotions for the most dominant customer segments. Understanding that most customers belong to lower income brackets will allow Paisabazaar to prioritize affordable loan offerings and devise appropriate risk profiles.

Recognizing minority segments at higher income ranges can lead to development of premium financial products or exclusive services, enhancing customer retention among more affluent users.


#### Chart - 2

How is Age distributed among customers?

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(10,6))
sns.histplot(df['Age'], bins=40, kde=False, color='skyblue')
plt.title('Age Distribution of Customers')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Age distribution provides insights into the demographic profile that influences borrowing behavior, credit risk, and marketing strategies.



##### 2. What is/are the insight(s) found from the chart?

The chart shows a relatively uniform distribution of customer ages between 18 and 45, with no major skews except one sharp spike around age 35, where customer count jumps significantly. This suggests that a large portion of the customer base is in their mid-30s, which could be either due to marketing focus on this age group, or product eligibility restrictions.

There are noticeably fewer customers in the higher age brackets (45+), with a gradual decline after age 45. Young adults (18-30) also form a substantial segment but without the pronounced spike seen at age 35.

The pronounced outlier at 35 may indicate a heavy marketing push towards this specific age group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights are important for targeting and optimizing campaigns. The spike at age 35 , if valid it signals a potential goldmine for targeted marketing or tailored loan products.

The heavier middle age segments suggest these customers may be more active financially and present less risk compared to younger or older groups, guiding policymakers and marketers on priority.

The drop-off in older age groups could inspire new financial products or outreach programs for older, underrepresented segments, improving inclusion and unlocking new revenue streams.

#### Chart - 3

What is the range and variation of Credit Utilization Ratio?

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(8,5))
sns.boxplot(x=df['Credit_Utilization_Ratio'])
plt.title('Boxplot of Credit Utilization Ratio')
plt.xlabel('Credit Utilization Ratio')
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots illustrate the range, median, and outliers of credit utilization, indicating how customers use credit lines relative to limits, a direct risk factor.


##### 2. What is/are the insight(s) found from the chart?

Most customers show moderate credit utilization, with outliers reflecting either credit overuse or underuse. High utilization indicates potential stress on credit capacity and higher default risk.
The median credit utilization ratio appears to be around 33–35, with the interquartile range (the main box) stretching from roughly 27 to 36.

The whiskers extend from approximately 21 to 49, suggesting relatively spread-out values but with no indication of extreme outliers or problematic values in this sample.

Most customers cluster around a moderate credit usage rate, meaning they are not excessively over-leveraged, which is generally a favorable sign for lenders evaluating credit risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Monitoring and managing customers with high utilization allows proactive risk management and fraud detection, improving portfolio health. Encouraging optimal utilization can enhance creditworthiness and customer satisfaction.

Finance teams can use this boxplot for benchmarking, early warning, and developing targeted strategies for users at the higher or lower ends of utilization.

The absence of extreme outliers means fewer cases for immediate fraud or credit distress intervention, letting Paisabazaar focus resources on marginal risk improvements or personalized recommendations, ultimately supporting better portfolio health and user engagement.

#### Chart - 4

What is the distribution of the Delay Payment Ratio among customers?

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10,6))
sns.histplot(df['Delay_Ratio'], bins=25)
plt.title('Distribution of Delay Ratio')
plt.xlabel('Delay Ratio')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

The histogram of delay ratios helps analyze the frequency and extent of payment delays—a critical indicator of credit risk.
The delay ratio is a key performance metric for credit products, representing the ratio between delayed payments and scheduled payments.

##### 2. What is/are the insight(s) found from the chart?


Most customers maintain low delay ratios, indicative of reliable payment behavior. However, a notable portion exhibits substantial delays, flagging potential defaulters.

The distribution is extremely right-skewed, with a massive concentration at or near zero, and a very fast drop-off as the delay ratio increases.

The overwhelming majority of customers have a delay ratio close to zero, indicating most payments are on time or minimal delays occur.

A very small proportion of the population has delay ratios above 1, and extremely few cases stretch into higher values (out to around 12), implying a handful of potentially high-risk borrowers but not enough to substantially skew the overall portfolio.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Early identification of high-delay customers supports tailored interventions (e.g., payment reminders, restructuring), reducing loan defaults and losses—a clear revenue safeguard.

The high incidence of on-time payments signals good repayment discipline, which can be leveraged in risk modeling, underwriting, and product pricing.

Since most customers pay on time, Paisabazaar could consider incentivizing prompt payment with loyalty rewards, discount or beneficial product features
for such loyal customers.

The small minority with very high delay ratios warrants targeted remediation—such as customized collection efforts or stricter credit terms—helping further reduce losses and enhance profitability.

#### Chart - 5

What is the distribution of Risk Score?

In [None]:
# Chart - 5 visualization code

# Histogram for Risk Score


plt.figure(figsize=(8,5))
sns.histplot(df['Risk_Score'], bins=30, kde=True)
plt.title('Distribution of Risk Score')
plt.xlabel('Risk Score')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

The histogram of risk score helps analyze the frequency and extent of credit risk.It indicates whether it is favorable to give that person loan in future or not.

##### 2. What is/are the insight(s) found from the chart?

Most customers have low to moderate risk scores: The bulk of the population falls in the 0.20–0.25 range, indicating a generally healthy portfolio with lower predicted credit risk.

Tail risk exists: There are some customers with risk scores exceeding 0.30, and a very small number above 0.40. These represent higher credit risk segments and may merit special attention for additional risk controls or monitoring.

Skewness suggests predictive opportunity: The shape of the distribution allows for effective segmentation. Stratifying customers into low, moderate, and high risk score bands can drive targeted policy, product design, and interventions.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, it has positivie impact as we can use low-risk segment for prioritized credit offers and streamlined approval and monitor medium to high-risk segments with stricter loan terms, higher documentation, or enhanced fraud checks.

The distribution serves as a foundation for automated decision rules and risk-adjusted recommendations in the lending workflow.

#### Chart - 6

Distribution of cases having chances of risk ?

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(8,5))
sns.countplot(x='Risk_Flag', data=df, palette='Blues',legend=False)
plt.title('Count of Risky vs Non-Risky Cases')
plt.xlabel('Risk Flag (0=No Risk, 1=Risk)')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?


Countplots succinctly quantify the proportion of case having chances of risk vs. non-risk , measuring fraud prevalence, essential for risk modeling and mitigation.


##### 2. What is/are the insight(s) found from the chart?

Risk cases constitute a smaller, yet significant minority revealing the scope for robust risk detection mechanisms.

The count of non-risky cases (Risk Flag = 0) is much higher than risky cases (Risk Flag = 1), with non-risky cases at around 71,000 and risky ones at approximately 29,000 out of the total sample.

Roughly 29% of cases are flagged as risky, which is quite substantial, indicating the presence of significant vulnerabilities in loan applications or customer onboarding processes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights provide :-

Improved Risk Assessment: The ability to predict credit scores and flag risky customers helps Paisabazaar optimize loan approvals, reduce defaults, and safeguard the loan portfolio.

Personalized Customer Engagement: Customer segmentation based on financial behavior and risk allows tailored product offers, improving customer satisfaction and product conversion.

But some insights point towards potential negative growth risks, such as the sizable proportion of customers flagged as high risk. If Paisabazaar extends credit indiscriminately to these customers without stringent controls, it could lead to increased defaults and financial losses. Furthermore, any modeling biases due to class imbalances could result in undetected risky customers.

# **Bivariate Analysis**

To compare two or more variables and find pattern between them.
* Numerical vs. Numerical:
Scatter Plot
* Numerical vs. Date:
Line Chart
* Categorical vs. Numerical:
Bar Chart,Violin Plot,
Box Plot
* Categoriacal vs. Categorical:
CountPlot

#### Chart - 7

Noticeable trends in credit scores based on the month of the year



In [None]:
# Chart-7 visualization code

#converting credit score category to numeric
df['CreditScoreNumeric'] = df['Credit_Score'].map({'Poor':1, 'Standard':2, 'Good':3}).astype(int)
#getting average score
monthly_avg_score = df.groupby('Month')['CreditScoreNumeric'].mean().reset_index()

plt.figure(figsize=(10,6))
sns.lineplot(data=monthly_avg_score, x='Month', y='CreditScoreNumeric', marker='o')
plt.title('Average Credit Score by Month')
plt.xlabel('Month')
plt.ylabel('Average Credit Score (Numeric)')
plt.xticks(range(1, 13))
plt.show()

1. Why did you pick the specific chart?

I have chosen line chart beacause it provides a temporal perspective, visualizing changes in the average credit score from January to August. Time-based trends are critical for observing seasonal, policy, or business-driven impacts in the dataset, helping reveal improvement or deterioration over months.

##### 2. What is/are the insight(s) found from the chart?



We can observe an upward trend in average credit scores from January to August, rising from around 1.871 to 1.905.

Occasional small dips (notably in March and May) are visible, but the general pattern is consistent improvement throughout the period.

Such a trend indicates business success in customer acquisition, credit improvement efforts, or external economic factors positively affecting customer profiles.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

By monitoring upward credit score trends we can validate the effectiveness of credit-building programs, targeted marketing, or risk-reduction strategies.

Paisabazaar can use this information to communicate improvements in customer financial health to partners, investors, and customers, boosting trust and stimulating business growth.

#### Chart - 8

How seasonal shifts in the distribution of scores is present?

In [None]:
# Chart - 8 visualization code
monthly_credit_counts = pd.crosstab(df['Month'], df['Credit_Score'])

# Plot stacked bar chart
monthly_credit_counts.plot(kind='bar', stacked=True, figsize=(12,7), colormap='Set2')
plt.title('Monthly Distribution of Credit Score Categories')
plt.xlabel('Month')
plt.ylabel('Number of Customers')
plt.xticks(rotation=0)
plt.legend(title='Credit Score')
plt.show()


##### 1. Why did you pick the specific chart?

This graph helps to show how the composition of credit score categories varies monthly, revealing if any category has seasonal peaks.


##### 2. What is/are the insight(s) found from the chart?

The Standard credit score category consistently comprises the largest segment every month, followed by Poor and then Good.

There is a slight increase in the count of "Good" and "Poor" scores from month 4 to month 8, indicating possible improvement or deterioration among specific customer segments during these periods.

The overall distribution is stable across months, suggesting that while absolute numbers change, the relative composition of scores remains steady—no sudden shifts in risk segmentation are visible

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The monthly stability (or slow change) in credit score composition allows Paisabazaar to monitor trends and intervene early if shifts in risk appear.

Business teams can leverage periods of rising "Good" scores for targeted promotions, while periods of increased "Poor" scores might prompt tighter risk management or enhanced customer support.

#### Chart - 9

Does higher income correlate with better credit scores?

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10,6))
sns.boxplot(x='Credit_Score', y='Annual_Income', data=df, order=['Poor', 'Standard', 'Good'], palette='Blues')
plt.title('Annual Income by Credit Score Category')
plt.xlabel('Credit Score Category')
plt.ylabel('Annual Income')
plt.show()


##### 1. Why did you pick the specific chart?

I have choosen boxplot as it compares annual income across different credit score categories (Poor, Standard, Good), which is highly relevant for understanding financial stratification in the dataset. It enables quick assessment of whether income correlates with creditworthiness and helps in segmenting potential customers for targeted products.

##### 2. What is/are the insight(s) found from the chart?

Median and interquartile range for annual income increase as credit score improves: individuals with “Good” scores have higher median incomes than those classified as “Poor” or “Standard”.

All three categories show a wide spread, but the “Good” credit score group also has the most high-income outliers, suggesting a positive association between higher earning power and better creditworthiness.

Outliers in the “Poor” category reveal some high earners with unexpectedly low credit scores, which may indicate unusual financial behavior, recent negative events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

These findings can be used to enable Paisabazaar to prioritize high-income, good credit score individuals for premium financial offerings and cross-sell opportunities, ultimately driving profitability.

The presence of high-income outliers in “Poor” and “Standard” categories suggests hidden opportunities for credit rehabilitation products or services, benefiting both the company and customers.

This stratification aids in refining risk models, customizing credit limits, and developing more personalized customer outreach, supporting smarter portfolio management and more effective marketing.

#### Chart - 10

Does annual income vary with credit score ranges?

In [None]:
# Chart - 10 visualization code

plt.figure(figsize=(8,5))
sns.violinplot(x='Credit_Score', y='Annual_Income', data=df, inner='quartile')
plt.title("Annual Income Distribution Across Credit Score Ranges")
plt.xlabel("Credit Score")
plt.ylabel("Annual Income")
plt.show()

##### 1. Why did you pick the specific chart?

This violin plot clearly visualizes the entire distribution (not just the mean, median, or quartiles) of annual income for each credit score category ("Good," "Poor," and "Standard"). It is highly insightful for understanding how income varies within and between different risk groups in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The “Good” credit score group has the highest concentration of higher annual incomes, with a broader spread and more upper-tail values compared to “Poor” and “Standard” groups.

The “Poor” group’s income distribution is concentrated at lower values, though a small number of higher incomes are present, possibly indicating people with good financial capacity but problematic repayment histories or other risk-inducing factors.

The “Standard” group sits between the two, with moderate spread and density, reinforcing the idea that income positively correlates with better credit quality in this dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Higher-income customers in lower credit bands may represent opportunities for financial inclusion, targeted risk mitigation, or premium up-selling (with education or counseling).

The strong association between higher income and better credit scores supports validating or refining credit modeling, product design, and customer targeting strategies for conversion and retention.

#### Chart - 11

Is there a relationship between the number of loans taken and risk factor?

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10,6))
sns.countplot(x='Num_of_Loan', hue='Risk_Flag', data=df, palette='Blues')
plt.title('Number of Loans by Risk Status')
plt.xlabel('Number of Loans')
plt.ylabel('Count')
plt.legend(title='Risk Flag')
plt.show()


##### 1. Why did you pick the specific chart?

This grouped bar chart visualizes the relationship between the number of loans taken and risk factor corresponding to them , enabling clear assessment if loan frequency correlates with higher risk in the data.



##### 2. What is/are the insight(s) found from the chart?

For loan in range 0-4 , non-risky cases (light blue) clearly predominate, but as the number of loans increases (especially at 5 and above), risky cases (blue ) start to match or even exceed non-risky ones.

Customers with more than 5 loans show a disproportionately higher chance of fraud, suggesting a positive association between loan frequency and risk.

We can say that as the number of loans increases, the risk associated with them also appear to increase.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can flag high-loan-count customers for additional scrutiny or dynamic risk scoring, possibly tightening credit policies for such segments.

Can be used to build business rules for proactive fraud prevention, optimize resource allocation for manual review, and refine customer eligibility checks.

This insight empowers the business to implement early warning systems for repeat borrowers, improving portfolio health while reducing bad debt and fraudulent loss exposure.

#### Chart - 12

Is there a negative correlation between the number of delayed payments and credit scores?

In [None]:
# Chart - 12 visualization code

sns.boxplot(x='Credit_Score', y='Num_of_Delayed_Payment', data=df)


##### 1. Why did you pick the specific chart?

 Box plot is highly relevant for analyzing how payment behavior aligns with assigned creditworthiness in the Paisabazaar dataset.

##### 2. What is/are the insight(s) found from the chart?

Customers with "Poor" or "Standard" credit scores have higher median and interquartile ranges for delayed payments compared to those with "Good" scores.

The "Good" score group has both lower medians and a tighter spread, with only a few high outliers, indicating more disciplined repayment behavior.

The pattern matches expected business logic: more delayed payments associate with poorer credit scores, validating credit risk models and scoring principles.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Can be used to refine recovery and NPA (non-performing asset) strategies, support loan approval automation, and improve prediction accuracy for future creditworthiness.

These results support more personalized customer outreach (e.g., targeted education for Standard/Poor segments), risk-based pricing.

#### Chart - 13

How credit score is ditributed among different ages ?

In [None]:
# Chart - 13 visualization code

# Create a cross-tab for counts

cross_tab = pd.crosstab(df['Age_Group'], df['Credit_Score'])

# Plot stacked bar chart
cross_tab.plot(kind='bar', stacked=True, figsize=(8,5))


plt.title("Distribution of Credit Scores Across Age Groups")
plt.xlabel("Age Group")
plt.ylabel("Count")
plt.legend(title="Credit Score")
plt.show()

##### 1. Why did you pick the specific chart?

This stacked bar chart clearly visualizes the distribution of credit score categories (Good, Poor, Standard) within different age groups, making it ideal for understanding how age relates to creditworthiness and risk segmentation in the Paisabazaar dataset.

##### 2. What is/are the insight(s) found from the chart?

The middle age groups (26–35 and 36–45) have the highest overall counts and the largest proportion of Standard scores, followed by Poor and then Good.

The youngest (18–25) and oldest (46–60) age groups have lower counts overall, and Standard is again the dominant score, though the proportion of Good scores appears slightly higher in the oldest group.

Across all age groups, Standard scores consistently dominate, meaning most customers fall into this moderate risk segment regardless of age.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Paisabazaar tailor products and interventions by age and risk segment—focusing campaigns and risk management on the high-volume, Standard score groups in middle ages while customizing offerings for older, better-scoring customers.

Cross-age targeting can help for credit improvement programs, financial education, and retention strategies.


# **Multivariate Analysis**
Analyzes the relationships among more than two variables.

Graphs:

* Heatmap: Shows correlations between variables.(range -1 to 1)
* Pair Plot: Visualizes relationships across multiple numerical variables.

#### Chart - 14

how credit score is ditributed among different ages

In [None]:
#checking range of utilization ratio
df['Credit_Utilization_Ratio'].unique()

In [None]:
# Chart - 14 visualization code

df_filtered = df[df['Credit_Utilization_Ratio'] < 50] #filtering to get precise result

plt.figure(figsize=(8,5))
sns.stripplot(
    data=df_filtered,
    x='Num_Credit_Card',
    y='CreditScoreNumeric',
    hue='Credit_Utilization_Ratio',
    palette='ocean',
    size=7,
    jitter=True  # Enabled jitter to spread points horizontally
)
plt.title("Credit Score vs Number of Credit Cards ")
plt.xlabel("Number of Credit Cards")
plt.ylabel("Credit Score (Numeric)")
plt.legend(title='Credit Utilization (%)', bbox_to_anchor=(1.05,1), loc='upper left')
plt.show()

##### 1. Why did you pick the specific chart?

This scatterplot is ideal for analyzing the relationship between the number of credit cards and a customer's credit score, with color coding for credit utilization adding a valuable third dimension. This multi-feature view helps Paisabazaar identify complex associations important for credit risk modeling and product targeting.

##### 2. What is/are the insight(s) found from the chart?

The chart shows distinct clusters of credit scores (appearing as three major discrete values), spread across the range of credit card counts (0–11).

Color and size encoding by credit utilization ratio (from 20% to 45%) adds clarity—there is no prominent concentration of higher/lower utilization at any specific score or card count.

The distribution appears uniform, with no obvious trend indicating that more credit cards directly improve or worsen the credit score for customers within the utilization criteria shown.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This visualizaation confirms that, for customers with utilization under 50%, credit score does not vary strongly with the number of credit cards. This means card acquisition isn’t inherently risky for the score segment visualized.

Credit scoring and risk models should focus on utilization and other behavioral factors over simple card counts, helping refine lending and marketing strategies.


#### Chart - 15 - Correlation Heatmap

In [None]:
print(numerical_cols.columns)

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12,10))
corr = df[['Annual_Income', 'Monthly_Inhand_Salary','Age', 'Credit_Utilization_Ratio','Delay_Ratio', 'Outstanding_Debt','Num_of_Loan', 'Num_Credit_Inquiries','Risk_Score', 'Risk_Flag']].corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='Blues', center=0)
plt.title('Correlation Heatmap of Numerical Features ')
plt.show()


##### 1. Why did you pick the specific chart?

A correlation heatmap enables understanding the pairwise linear relationships among multiple numerical features, including fraud and risk indicators. It helps identify which features move together and which are independent.

##### 2. What is/are the insight(s) found from the chart?

Risk Flag is most positively correlated with Outstanding Debt (0.36), Number of Credit Inquiries (0.39), Number of Loans (0.32), and Risk Score (0.30). Delay Ratio also correlates moderately (0.14) with fraud.

Risk Score is very strongly correlated with Outstanding Debt (0.73), showing that debt levels are a core risk indicator, and also correlates well with Number of Loan (0.59),Num_credit_inquiries (0.50) and Delay Ratio (0.35).

Annual Income and Monthly Inhand Salary are very closely correlated (1.00), both negatively associated with risk_flag and risk_score (−0.17, −0.16 and −0.49, −0.49 respectively), confirming that lower income means higher risk and higher fraud probability.

#### Chart - 16 - Pair Plot

In [None]:
# Pair Plot visualization code

cols_to_plot = ['Annual_Income', 'Age', 'Credit_Utilization_Ratio', 'Delay_Ratio', 'Num_of_Loan', 'Credit_Score']

sns.pairplot(df[cols_to_plot], hue='Credit_Score', diag_kind='kde', palette='Blues')
plt.suptitle('Pairplot of Features with respect to Credit Score', y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

A pairplot is used for multivariate visualization to observe the distribution and bivariate relationships of multiple key numerical features—such as annual income, age, credit utilization, delay ratio, and risk score—categorized by credit score . This helps to visually detect patterns, clusters in the Paisabazaar dataset.

##### 2. What is/are the insight(s) found from the chart?

from the above visuals insights show that robust credit modeling at Paisabazaar will need to combine multiple features, as no one variable cleanly separates credit scores—a multivariate approach is essential.

The visual evidence supports focus on income and utilization as primary drivers, assisting product targeting, upsell, and risk segmentation.

The chart style itself is business-friendly for exploratory analysis sessions and communicating feature-power to data science and business stakeholders.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?



To meet Paisabazaar's business objectives effectively, I would like to recommend a strategic approach grounded in data-driven insights and operational precision:

1. Predictive Risk-Based Scoring:
Utilize the engineered features—annual income, credit utilization ratio, delay ratio, outstanding debt, number of credit inquiries, and composite risk score—within machine learning models to automate credit risk classification. This will allow Paisabazaar to rapidly and accurately assess applicants, leading to faster, more objective loan decisioning and reduced manual intervention.

2. Data-Driven Customer Segmentation:
Apply clustering and grouping methods on financial and behavioral variables (e.g., loan and credit card usage, payment history, delay ratios) to identify distinct customer profiles. Use these segments to design targeted product offerings, replicate successful customer journeys.

3. Continuous Risk Monitoring and Early Intervention:
Integrate real-time risk flags and utilization/debt monitoring into loan servicing workflows. When risk indicators (like spikes in utilization or frequent delayed payments) are detected, trigger timely outreach, proactive credit limit management, or tailored repayment options, preventing defaults and preserving portfolio quality.

4. Personalized Financial Solutions and Communication:
Leverage the insights from risk scoring and segmentation to make personalized product recommendations for each customer—such as offering consolidation loans to high-debt users, specialized cards to disciplined spenders, or insurance add-ons for users with volatile income. Complement this with clear, data-driven advice to build trust and empower customers to improve their credit health.

5. Enable Data Transparency and Awareness:
Share summary risk profiles and actionable credit tips with customers to enhance engagement and loyalty. Internally, deploy visual dashboards tracking portfolio risk shifts, top risk drivers, and product uptake, ensuring management can steer strategy with agility.

# **Conclusion**

After performing the required steps we can see that the customer data analysis underscores the importance of both financial and behavioral factors in assessing credit risk at Paisabazaar. Income, repayment patterns, loan exposure, and credit utilization emerged as the principal drivers of creditworthiness. Notably, while middle-income and working-age customers constitute the majority, repayment discipline and debt management, rather than earnings alone, serve as the clearest predictors of strong credit scores.

From visual explorations it is confirmed that customers prone to delayed payments or high utilization present elevated risk, supporting the need for robust risk monitoring and targeted policies for these segments. Conversely, customers with healthy financial ratios and reliable payment histories represent attractive prospects for premium products and cross-sell opportunities.

Overall, these insights provide Paisabazaar with actionable guidance for refining risk models, customizing product strategies, and prioritizing resource allocation to balance growth with prudent risk management.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***