In [None]:
from google.colab import drive
drive.mount('/content/drive')

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name  -**  Jayesh Ambaldahge



# **Project Summary -**

This project focuses on performing an in-depth Exploratory Data Analysis (EDA) on customer credit-related data from Paisabazaar, a financial services company that relies heavily on creditworthiness assessment to approve loans and reduce financial risk. The objective of this analysis was to identify the patterns, anomalies, and key drivers influencing Credit Score categories (Good, Standard, Poor) while generating actionable business insights that can guide both lending and customer engagement strategies.

The dataset contained detailed customer-level attributes such as Age, Occupation, Annual Income, Credit Utilization, Loan History, Repayment Behavior, Delayed Payments, Credit Mix, and Monthly Balances. With over 25 variables, the dataset provided a comprehensive view of customer financial behavior.

The first stage involved understanding the dataset structure, identifying missing values, duplicate entries, and inconsistencies in numeric and categorical variables. Standard cleaning procedures such as filling nulls, normalizing formats, and ensuring consistent data types were applied to make the dataset ready for visualization.

Through 15 well-chosen charts and visualizations, several strong insights were discovered:

Credit Utilization Ratio showed that customers with higher utilization (>70%) mostly fall in the “Poor” credit score group, whereas disciplined users (30–40% utilization) are more likely to have “Good” scores.

Occupation Distribution revealed that professions like Lawyers, Engineers, and Doctors dominate the customer base, with higher-income jobs correlating with better credit scores.

Income and Age Factors highlighted that customers with higher annual income and individuals above 35 years of age tend to maintain healthier credit scores, whereas younger or low-income groups often struggle with repayment discipline.

Repayment Behavior such as skipping minimum due payments or frequent delayed payments strongly aligned with “Poor” credit scores, confirming repayment history as one of the most crucial predictors of financial health.

Credit Mix and Loan Count showed that customers maintaining a balanced mix of loans and credit cards tend to perform better, while those overloaded with multiple loans mostly fall under “Poor.”

Correlation Analysis validated relationships such as debt being tied to EMI payments, and income closely linked with salary — essential for feature selection in future predictive modeling.

These findings have direct implications for Paisabazaar’s business objectives. Customers can be segmented strategically:

Good Credit Score: Offer premium loans, high-value credit cards, and investment products.

Standard Credit Score: Provide credit-building programs, smaller loan options, and personalized financial advice to help them graduate into the “Good” category.

Poor Credit Score: Implement stricter credit controls, limited credit exposure, and monitoring while also offering educational tools to improve financial habits.

To reduce fraud and defaults, the company can give more weight to repayment discipline, utilization ratio, and delay history alongside traditional factors like income. Proactive measures such as automated payment reminders, auto-debit facilities, and reward points for timely repayments can encourage better credit behavior.

In conclusion, the analysis highlighted that creditworthiness is a multi-dimensional outcome shaped by income levels, spending discipline, repayment patterns, and occupational stability. While a majority of customers currently fall under the “Standard” category, this represents an opportunity for Paisabazaar to convert them into “Good” through targeted interventions. By using these insights, the company can enhance risk management, increase lending profitability, and foster long-term customer relationships.

This EDA not only provides clarity on the current customer base but also builds a strong foundation for future predictive modeling and fraud detection systems. Ultimately, the project demonstrates how data-driven decision-making can make lending both safer and more profitable for Paisabazaar.

# **Problem Statement**


**Paisabazaar, a leading financial services company, relies heavily on assessing the creditworthiness of individuals to support loan approvals and manage financial risks. Credit scores play a vital role in this process, as they indicate the likelihood of an individual repaying their debts.

To strengthen decision-making, Paisabazaar seeks to explore and analyze customer data that includes attributes such as income, spending behavior, credit card usage, and payment history. The goal is to perform Exploratory Data Analysis (EDA) to:

Understand the distribution of credit scores (Good, Standard, Poor) across the customer base.

Identify key factors that differentiate individuals with high and low credit scores.

Detect patterns, correlations, and anomalies in the dataset that may influence credit score classification.

Provide data-driven insights that can later support the development of predictive models and personalized financial recommendations.**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset


In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First Look

In [None]:
df=pd.read_csv('/content/drive/MyDrive/dataset(2).csv')

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.head()

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
# No missing values found

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.columns

In [None]:
# Looking for the columns which  can be removed shince it will not be required in the analysis.
'''
dropping the following columns:
1. ID
2. Name
3. SSN
'''
df.drop(columns=['ID','Name','SSN'],axis=1,inplace=True)


In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.dtypes

In [None]:
df['Type_of_Loan'].value_counts()

In [None]:
#replacing No Data and Not Specified with NaN
df['Type_of_Loan'].replace({'No Data':'NaN','Not Specified':'NaN'},inplace=True)


In [None]:
# Payment_of_Min_Amount
df['Payment_of_Min_Amount'].value_counts()
#replacing NM to NaN
df['Payment_of_Min_Amount'].replace({'NM':'NaN'},inplace=True)

In [None]:
#checking if there is any negative value in the age column
df[df['Age']<0]

In [None]:
#Annual_Income
df[df['Annual_Income']<0]


In [None]:
#Monthly_Inhand_Salary
df[df['Monthly_Inhand_Salary']<0]

In [None]:
#Num_Bank_Accounts
df[df['Num_Bank_Accounts']<0]


In [None]:
#Num_Credit_Card
df[df['Num_Credit_Card']<0]

In [None]:
#checking for any duplicate values
df.duplicated().sum()

In [None]:
df.isna().sum()

In [None]:
df.info()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set(style="whitegrid", palette="muted", font_scale=1.2)

# 1. Histogram of Age
plt.figure(figsize=(8,5))
sns.histplot(df["Age"], bins=30, kde=True, color="skyblue")
plt.title("Distribution of Customer Age")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

To see customer age distribution.

##### 2. What is/are the insight(s) found from the chart?

Most customers are 20–45 years.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps tailor financial products for dominant age groups.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# 2. Credit Score Distribution
plt.figure(figsize=(6,4))
sns.countplot(x='Credit_Score', data=df, order=df['Credit_Score'].value_counts().index)
plt.title("Credit Score Distribution")
plt.xlabel("Credit Score")
plt.ylabel("Count")
plt.show()



##### 1. Why did you pick the specific chart?

Shows overall customer segmentation.

##### 2. What is/are the insight(s) found from the chart?

Majority have Standard scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps Paisabazaar identify focus groups (improve “Poor” or target “Good”).

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# 3. Annual Income Distribution
plt.figure(figsize=(6,4))
sns.histplot(df['Annual_Income'], bins=30, kde=True)
plt.title("Annual Income Distribution")
plt.xlabel("Annual Income")
plt.ylabel("Count")
plt.show()

plt.figure(figsize=(6,4))
sns.boxplot(x=df['Annual_Income'])
plt.title("Annual Income Spread (Boxplot)")
plt.show()

##### 1. Why did you pick the specific chart?

Income drives creditworthiness.

##### 2. What is/are the insight(s) found from the chart?

Skewed with few high earners

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identify middle vs premium customer clusters.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# 4. Outstanding Debt Distribution (Histogram)
plt.figure(figsize=(6,4))
sns.histplot(df['Outstanding_Debt'], bins=30, kde=True, color="orange")
plt.title("Outstanding Debt Distribution")
plt.xlabel("Outstanding Debt")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Debt affects risk.

##### 2. What is/are the insight(s) found from the chart?

Many customers have moderate debt.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps refine loan approval rules.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# 5. Credit Utilization Ratio Distribution (Histogram)
plt.figure(figsize=(6,4))
sns.histplot(df['Credit_Utilization_Ratio'], bins=30, kde=True, color="green")
plt.title("Credit Utilization Ratio Distribution")
plt.xlabel("Credit Utilization Ratio")
plt.ylabel("Count")
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Many use >50% credit.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 AnswerRisk control + product recommendations.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# 6. Occupation Distribution (Bar Plot)
plt.figure(figsize=(10,5))
sns.countplot(y='Occupation', data=df, order=df['Occupation'].value_counts().index)
plt.title("Occupation Distribution of Customers")
plt.xlabel("Count")
plt.ylabel("Occupation")
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart makes it simple to compare how many customers belong to each occupation.

##### 2. What is/are the insight(s) found from the chart?

Lawyers and Engineers form the majority of the customer base. Other professions like doctors, or business owners are fewer in number.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# 7. Credit Score vs Annual Income (Boxplot)
plt.figure(figsize=(8,5))
sns.boxplot(x='Credit_Score', y='Annual_Income', data=df, order=["Poor", "Standard", "Good"])
plt.title("Credit Score vs Annual Income")
plt.xlabel("Credit Score")
plt.ylabel("Annual Income")
plt.show()

##### 1. Why did you pick the specific chart?

A boxplot is great for showing income ranges across different credit score categories (Good, Standard, Poor).

##### 2. What is/are the insight(s) found from the chart?

Customers with a “Good” credit score tend to have higher annual incomes compared to those with “Poor” scores.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This means income is strongly linked to creditworthiness, so banks can give higher loan limits to high-income groups while being cautious with low-income customers.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# 8. Credit Score vs Age (Boxplot)
plt.figure(figsize=(8,5))
sns.boxplot(x='Credit_Score', y='Age', data=df, order=["Poor", "Standard", "Good"])
plt.title("Credit Score vs Age")
plt.xlabel("Credit Score")
plt.ylabel("Age")
plt.show()

##### 1. Why did you pick the specific chart?

Age may influence how responsibly people handle credit, and a boxplot highlights patterns across age groups.

##### 2. What is/are the insight(s) found from the chart?

Older customers (35+) generally have better credit scores than younger ones, who are more likely to be in the “Poor” or “Standard” categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Age can be used as an additional profiling factor in credit models. For example, offering credit-building products to younger customers can help grow long-term loyalty.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# 9. Credit Score vs Number of Loans (Bar Plot)
plt.figure(figsize=(8,5))
sns.barplot(x='Credit_Score', y='Num_of_Loan', data=df, order=["Poor", "Standard", "Good"], ci=None)
plt.title("Credit Score vs Number of Loans")
plt.xlabel("Credit Score")
plt.ylabel("Average Number of Loans")
plt.show()

##### 1. Why did you pick the specific chart?

Shows the relationship between loan count and credit scores.

##### 2. What is/are the insight(s) found from the chart?

Customers with multiple loans often fall into the “Poor” category, while those with fewer loans tend to score “Good.”

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps banks decide on loan eligibility rules, if someone already has many loans, their credit risk increases, so stricter checks are needed.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# 10. Credit Score vs Payment of Minimum Amount (Stacked Bar)
payment_min = pd.crosstab(df['Payment_of_Min_Amount'], df['Credit_Score'], normalize='index') * 100
payment_min.plot(kind='bar', stacked=True, figsize=(8,5), colormap="Set2")
plt.title("Credit Score vs Payment of Minimum Amount")
plt.xlabel("Payment of Minimum Amount")
plt.ylabel("Percentage %")
plt.legend(title="Credit Score")
plt.show()

##### 1. Why did you pick the specific chart?

To check if customers are at least paying the minimum due on time.

##### 2. What is/are the insight(s) found from the chart?

Customers who skip minimum payments are far more likely to have “Poor” credit scores. Regular payers cluster in the “Good” score range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This is an early warning signal—if someone stops paying minimum dues, the bank can take preventive steps like reminders or lowering their credit limit.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# 11. Credit Score vs Credit Mix (Stacked Bar)
credit_mix = pd.crosstab(df['Credit_Mix'], df['Credit_Score'], normalize='index') * 100
credit_mix.plot(kind='bar', stacked=True, figsize=(8,5), colormap="Paired")
plt.title("Credit Score vs Credit Mix")
plt.xlabel("Credit Mix")
plt.ylabel("Percentage %")
plt.legend(title="Credit Score")
plt.show()

##### 1. Why did you pick the specific chart?

A mix of credit types (loans, credit cards, mortgages) shows financial balance.

##### 2. What is/are the insight(s) found from the chart?

People with a balanced mix (e.g., some loans + credit cards) usually have better credit scores. Those relying heavily on one type often score lower.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Banks can encourage customers to diversify credit (e.g., offering small personal loans to card-only users) to improve financial stability.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
#12. Credit Score vs Number of Delayed Payments (Boxplot)
plt.figure(figsize=(8,5))
sns.boxplot(x='Credit_Score', y='Num_of_Delayed_Payment', data=df, order=["Poor", "Standard", "Good"])
plt.title("Credit Score vs Number of Delayed Payments")
plt.xlabel("Credit Score")
plt.ylabel("Number of Delayed Payments")
plt.show()

##### 1. Why did you pick the specific chart?

Payment delays are a direct red flag for credit health.

##### 2. What is/are the insight(s) found from the chart?

Customers with frequent delays overwhelmingly fall into the “Poor” category. Those with fewer or no delays mostly score “Good.”

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps banks enforce stricter lending rules for customers with repeated late payments.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# 13. Credit Score vs Credit Utilization Ratio (Boxplot)
plt.figure(figsize=(8,5))
sns.boxplot(x='Credit_Score', y='Credit_Utilization_Ratio', data=df, order=["Poor", "Standard", "Good"])
plt.title("Credit Score vs Credit Utilization Ratio")
plt.xlabel("Credit Score")
plt.ylabel("Credit Utilization Ratio")
plt.show()

##### 1. Why did you pick the specific chart?

Credit utilization (how much of your limit you use) strongly affects credit scores.

##### 2. What is/are the insight(s) found from the chart?

Customers with “Poor” scores typically use more than 70% of their available credit, while “Good” scorers keep usage below 30–40%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Useful for risk assessment—banks can set alerts when utilization is high, as it signals repayment trouble.

#### Chart - 14

In [None]:

# 14. Correlation Heatmap
plt.figure(figsize=(14,10))
sns.heatmap(corr,
            cmap="coolwarm",
            annot=True,
            fmt=".2f",       # 2 decimal places
            linewidths=0.5,
            annot_kws={"size":8})  # smaller text

plt.title("Correlation Heatmap (with values)", fontsize=14)
plt.show()



##### 1. Why did you pick the specific chart?

A heatmap makes it easy to see relationships between numbers (like income, EMI, debt, salary)

##### 2. What is/are the insight(s) found from the chart?

Some features are strongly linked, e.g., Debt and EMI are correlated, Income and Salary are correlated.

#### Chart - 15

In [None]:
plt.figure(figsize=(12,6))
sns.violinplot(x="Occupation", y="Annual_Income", hue="Credit_Score",
               data=df, split=True, inner="quart", palette="Set2")
plt.xticks(rotation=45)
plt.title("Occupation vs Annual Income by Credit Score")
plt.show()

##### 1. Why did you pick the specific chart?

A violin plot is chosen because it not only shows the spread of incomes in each occupation but also compares this spread across different credit score groups (Good, Standard, Poor). It gives a deeper view than a bar chart or boxplot.

##### 2. What is/are the insight(s) found from the chart?

High-income occupations like Doctors, Engineers, and Managers tend to have a larger portion of customers with Good credit scores.

Occupations such as Writers, Musicians, and Mechanics show a wider spread of incomes but also more customers falling in the Standard or Poor score range.

Some professions (e.g., Teachers, Accountants) show a balanced middle-income range, with many in the Standard category.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Segment customers clearly:

Good → Premium offers, high-value loans.

Standard → Credit improvement programs and personalized loan offers.

Poor → Strict monitoring and limited credit exposure.

Improve risk scoring by giving more weight to utilization, repayment discipline, and delay history, alongside traditional factors like income.

Targeted marketing: Promote premium products to high-income, low-risk groups (Doctors, Engineers, Managers) while focusing on affordable, starter-level products for young/mid-income customers.

Encourage repayment discipline through auto-debit, SMS/email reminders, and reward points for timely payments.

By applying these insights, Paisabazaar can lower default rates, improve customer creditworthiness, and increase revenue by matching the right products with the right people. In short, this analysis shows how data-driven credit assessment can make lending both safer and more profitable.Answer Here.

# **Conclusion**

From the analysis, it’s clear that customer credit scores are shaped by multiple factors such as income, age, debt levels, repayment discipline, and even occupation. The data shows that most customers fall into the Standard credit score range, which creates both a challenge and an opportunity — these individuals can be guided towards better scores with the right financial products and awareness.

High credit utilization, delayed payments, and skipping minimum payments are the strongest indicators of poor creditworthiness. On the other hand, customers with steady income, balanced credit mix, and disciplined repayment behavior consistently achieve good credit scores.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***

In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/dataset(2).csv')
