# **Project Name**    -
####Visual Insights into Financial Profiles and Credit Patterns


##### **Project Type**    - EDA
##### **Contribution**    - Individual
### TEJSHREE DESAI



# **Project Summary -**

This project is a data-driven initiative designed to help financial institutions more effectively assess customer creditworthiness and mitigate loan risks. The core objective is to leverage historical financial and behavioral data to build predictive models and generate actionable insights.
The project addresses several key business challenges that financial institutions commonly face:
 * Credit Score Prediction: The project aims to develop a robust machine learning model capable of predicting a customer's credit score. By analyzing historical data, the model will classify a customer's credit profile as "Good," "Standard," or "Poor." This is particularly useful for evaluating new customers or those with limited financial history, enabling the bank to make more accurate lending decisions.
 * Loan Default Risk Reduction: A critical component of the project is to reduce the rate of loan defaults. We will build a risk classification model that identifies high-risk individuals early in the loan approval process. This model will use behavioral patterns such as late payments, multiple credit inquiries, and high outstanding debt. By flagging these applicants, the bank can take proactive measures to either deny the loan or offer it with more stringent terms, thereby protecting its financial assets.
 * Customer Segmentation: To enhance marketing and product offerings, the project will use clustering techniques (such as K-means) to segment customers into distinct groups. These segments will be based on income, spending habits, and other behavioral characteristics. This allows the bank to offer personalized products, like tailored credit card options or investment plans, that are better suited to each customer segment's needs and financial behavior.
 * Improving Financial Decision-Making: Ultimately, the project's goal is to empower the bank's decision-makers with data-driven insights. By providing predictive models and a deeper understanding of customer behavior, the project will improve the efficiency and accuracy of loan approvals, risk management, and marketing strategies.
The proposed solution involves a multi-pronged approach. We will start by cleaning and preparing the data, followed by building and training machine learning models like Random Forest and XGBoost for prediction tasks. We will also perform exploratory data analysis to uncover hidden patterns, such as the relationship between lower credit utilization and higher credit scores. Additionally, the project will incorporate anomaly detection techniques to identify sudden and unusual changes in a customer's credit behavior, which could be an early warning sign of distress or fraud. Finally, the insights and model predictions will be visualized through a dashboard, providing loan recovery and management teams with a clear overview of customer risk profiles and enabling them to prioritize their efforts effectively.
In summary, this project is a comprehensive solution that uses advanced analytics and machine learning to improve the bank’s operational efficiency, reduce financial risk, and ultimately enhance its profitability and customer relationships. The outcomes will not only help in making smarter lending decisions but also in fostering a more personalized and secure banking experience for customers.

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


##Write Problem Statement Here.


Credit Score Prediction
Many financial institutions struggle with evaluating the creditworthiness of individuals, especially new customers with limited financial history. A poor assessment can lead to loan defaults or missed business opportunities.
Problem: How can we accurately predict a customer’s credit score (Good, Standard, Poor) using historical and financial behavior data?

Loan Risk Assessment
Offering loans to high-risk individuals can lead to defaults. Identifying patterns in customer behavior (e.g., delays in payment, credit utilization) can help mitigate risk.
Problem: How can we assess and categorize loan applications based on risk using customer credit history and behavior?

Customer Segmentation for Financial Products
Financial institutions want to offer targeted products (e.g., credit cards, loans, investment options).
Problem: How can we segment customers based on their income, spending habits, and financial behavior to suggest suitable products?

Behavioral Analysis for Early Default Detection
Early detection of likely defaulters can help take preventive action.
Problem: Can we identify behavioral signals (such as delay from due date, number of inquiries, utilization rate) that indicate high risk of default?

#### **Define Your Business Objective?**

##Answer here
The primary objective of this project is to assist financial institutions in evaluating customer creditworthiness and managing loan risks effectively by leveraging historical financial and behavioral data. The project aims to solve the following business challenges:

Predicting Credit Score:
Develop a model to predict whether a customer falls under a "Good", "Standard", or "Poor" credit score category using variables such as income, credit utilization, repayment history, and spending behavior.

Reducing Loan Default Risk:
Identify high-risk individuals early in the loan approval process using behavioral patterns such as delayed payments, inquiries, and outstanding debt.

Customer Segmentation:
Segment customers into different profiles to offer personalized financial products (e.g., credit cards, loan types, investment options) based on their income and spending patterns.

Improving Financial Decision-Making:
Enable more informed and data-driven decisions in loan approvals, risk management, and marketing by building predictive models and behavioral insights.

This project ultimately aims to improve the bank’s profitability, reduce credit losses, and enhance customer satisfaction by aligning services more closely with customer needs and risk profiles.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
data=pd.read_csv('/content/drive/MyDrive/PROJECT2/dataset (2).csv')
data

### Dataset First View

In [None]:
# Dataset First Look
data

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
#This dataset contains detailed information about customers' financial profiles. It includes demographic data, occupation, credit behavior, income, loan details, payment history, credit utilization, and credit scores. The dataset appears to be used for analyzing creditworthiness, financial habits, and risk assessment for loan or credit applications.

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
#The dataset contains 0 duplicate rows, i.e., there are no completely identical rows across all columns. While some customers appear multiple times (due to the Month field), these are not duplicates but repeated monthly records.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
#There are missing values in the dataset, especially in the Payment_of_Min_Amount column. These are represented as "NM" and should be handled as nulls during preprocessing. Other columns may have a few missing entries, but most appear to be complete.

In [None]:
# Visualizing the missing values
#A heatmap was used to visualize missing values in the dataset. The "Payment_of_Min_Amount" column contains several missing entries, represented by "NM" in the raw data. After converting "NM" to null (NaN), we observed that this column had the highest number of missing values. Other columns are largely complete with no significant missing data.

### What did you know about your dataset?

Answer Here

The dataset contains detailed customer-level financial, credit, and demographic data, recorded over multiple months. It includes variables like income, spending habits, loans, credit utilization, and credit scores. This rich dataset enables a wide range of financial analyses, including credit risk scoring, customer segmentation, and behavioral prediction. It is particularly valuable for banks, lenders, or fintech companies to tailor offerings, detect risk, and enhance customer relationship strategies.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
#The dataset has a wide range of income and debt levels. Most customers are between 20 to 60 years old. Credit utilization and EMI per month vary significantly, indicating a diverse financial profile.
data.describe(include='all')

### Variables Description

##Answer here
Customer_ID: Unique identifier for each customer

Month: Month of the observation

Name: Name of the customer

Age: Age of the customer

Occupation: Job role (e.g., Scientist, Engineer)

Annual_Income: Yearly income of the customer

Num_Bank_Account: Number of bank accounts held

Num_Credit_Card: Number of credit cards

Interest_Rate: Interest rate applicable to loans

Type_of_Loan: Types of loans taken

Delay_from_due_date: Days of delay in payments

Num_of_Delayed_Payment: Number of late payments

Credit_Utilization_Ratio: Credit used as a percentage of total credit available

Credit_Score: Target variable; category of credit quality (e.g., Good, Standard)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# The original dataframe is loaded in cell CCYb8wQTQezx
# The variable 'data' was overwritten in a previous cell.
# We will use the original dataframe loaded from the csv file.
data = pd.read_csv('/content/drive/MyDrive/PROJECT2/dataset (2).csv')
for col in data.columns:
    print(f"{col} → {data[col].nunique()} unique values")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
data.head()

In [None]:
data.tail()

In [None]:
data.rename({'Annual_Income':'ann_inc'},axis=1)

In [None]:
data.drop(['Occupation'], axis=1)

In [None]:
data.loc[:,'Annual_Income'].value_counts()

### What all manipulations have you done and insights you found?

Answer Here.

Renaming a column,dropping a column,etc.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
Annual_income=np.random.randn(100)
Annual_income

In [None]:
plt.figure(figsize=(3,3))
plt.hist(Annual_income)
plt.show()

##### 1. Why did you pick the specific chart?

Annual_Income is likely a continuous numerical variable.Histograms are specifically designed to display distribution of continuous numerical data.
Histogram effectively shows the shape of distribution including Central Tendency,Spread/Variability,Skewness,Outliers,Modality,Frequency/Count.
Data is centered.It shows how spread out data is.
Any unusually high or low values.
Checks whether there are multiple peaks in data.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Distribution of annual income appearss to be roughly bell-shaped,but it is positively skewed (skewed to the right).This means there is a longer tail on higher income side,indicating a few individuals with significantly higher incomes compared to the majority.
The highest frequency (tallest bars)is observed around an income of 0.This suggests that substansial portion of data points for annual income falls within central range.The income values range approximaely from -2 to 2.The frequency decreases as you move further away from the 0 ark in both positive and negative directions.
Overall,the chart provides a clear picture of how annual incomes are distributed,highlighting the most common income levels and the spread of incomes across the dataset.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

YES,the insights gained from the annual income histogram can absolutely help create a positive business impact.

If the business heavily relies on disposable income and large segment of your potential customer base has very low or negative income,this could indicate a lack of purchasing power for your products/srvAnswer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code
data = np.random.exponential(scale=2.0,size=1000)

In [None]:
plt.figure(figsize=(3,3))
plt.hist(data)
plt.show()

##### 1. Why did you pick the specific chart?
##Answer here
The code generates data using np.random.exponential(scale=2.0,size=1000) and then plots histogram using plt.hist(data).
#
A histograam is appropriate choice for visualizing the distribution of a single numerical variable,especially when that variable is continuous.

In this case :

 1.Nature of data : The data=n.random.exponential(scale=2.0,size=1000)is generated from an exponential distribution.Exponential distrbutions are continuous probability distributions.

 2.PURPOSE        : Histograms are designed to show the shape of the distribution,including its central tendency ,spread and the presence of any skewness or outliers.For an exponential distribution ,a histogram clearly illustrates its characteristic right skewed shape.


##### 2. What is/are the insight(s) found from the chart?

The histogram shows that:

Most values are concentrated near 0, and the frequency drops rapidly as the value increases.

This is a typical right-skewed (positively skewed) distribution.

Very few values are larger than 5 or 10, and some outliers go above 20.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The histogram shows a right-skewed (positively skewed) distribution of income (or a proxy for income like disposable income), generated from an exponential distribution. Most of the data points are concentrated at the lower end of the scale (close to zero), with fewer people having higher income.

This kind of distribution might reveal important customer segments, such as:

A large portion of your customer base has low disposable income.

Products or services targeted at budget-conscious customers could perform well.

The company could adapt pricing strategies, offer value packs, or create basic-tier services to capture more of this market.

## Positive Business Insight:

Understanding income distribution can guide targeted product design and pricing, which leads to better market fit and higher customer adoption.

## Insights That Might Lead to Negative Growth (if ignored)

However, the same insight can also be a red flag, depending on your business model:

If your business relies on premium pricing or targets high-income consumers, then:

A customer base with predominantly low income could limit purchasing power.

You might see lower sales conversion, increased churn, or poor ROI on marketing.

There could be demand stagnation for luxury or high-end offerings in such a demographic.

💡 Negative Insight (Justified):

If a large segment of your potential customers has very low or near-zero disposable income, it could indicate limited ability to afford your products/services, leading to negative growth unless business strategies are adjusted.



#### Chart - 3

In [None]:
# Chart - 3 visualization code


In [None]:
# The original dataframe is loaded in cell CCYb8wQTQezx
# The variable 'data' was overwritten in a previous cell.
# We will use the original dataframe loaded from the csv file.
data = pd.read_csv('/content/drive/MyDrive/PROJECT2/dataset (2).csv')

data['Age_Group'] = pd.cut(data['Age'], bins=[18, 25, 35, 45, 55, 65],
                         labels=['18-25', '26-35', '36-45', '46-55', '56-65'])
plt.figure(figsize=(7, 5))
sns.countplot(data=data, x='Age_Group', palette='pastel')
plt.title('Number of Customers by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Number of Customers')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
A bar plot of age groups is easy to interpret and useful for demographic analysis, helping you see which age groups are most represented in the dataset.

##### 2. What is/are the insight(s) found from the chart?

##Answer here
It shows which age ranges your customer base falls into. If most customers are in the 26–35 range, they might be young professionals or first-time loan seekers, influencing marketing or risk strategy.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Based on the provided bar chart, negative growth is observed in the following transitions between age groups:

​From the 26-35 age group to the 36-45 age group: There is a slight decrease in the number of customers, from approximately 29,000 to 28,000.
​From the 36-45 age group to the 56-65 age group: There is a significant decrease, with the number of customers dropping from approximately **28,00

#### Chart - 4

In [None]:
# Chart - 4 visualization code
Num_of_Loan = np.random.randn(100)

In [None]:
plt.figure(figsize=(10,3))
plt.boxplot(Num_of_Loan)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
A boxplot was chosen because the data is numerical, and a boxplot effectively shows the distribution, central tendency (median), spread, and presence of outliers in the dataset. This makes it ideal for identifying anomalies and understanding variability in the number of loans.

##### 2. What is/are the insight(s) found from the chart?

Answer Here
The boxplot of Num_of_Loan reveals the overall distribution of loan data, including the median loan amount, interquartile range (IQR), and the presence of outliers (if any). These insights help understand whether the data is skewed, has extreme values, or is evenly spread.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here
Yes, the insights can create a positive business impact. For example, identifying outliers or unusual patterns in the loan distribution could help flag risky customers or fraudulent activity, enabling better risk management. Understanding the spread of loan counts helps optimize loan offerings and adjust credit policies accordingly.

Negative growth could occur if outliers are ignored or if the variability in loan numbers is not addressed, potentially leading to defaults or unbalanced loan portfolios. Hence, using this analysis to fine-tune business decisions is critical.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
Age=[21,23,28,34,54,55]
Changed_Credit_Line=[2.58,11.27,5.42,7.1,1.99,1.99]
plt.plot(Age,Changed_Credit_Line,color='red',linestyle=':')
plt.title('Age wise Changed_Credit_Line', color='green')
plt.xlabel('Age',color='yellow')
plt.ylabel('Changed_Credit_Line',color='orange')

##### 1. Why did you pick the specific chart?

Answer Here

A line chart was chosen because it effectively shows how the Changed_Credit_Line varies with age, helping to visualize trends or patterns over a continuous variable (age). Line plots are ideal for showing relationships or progression over time or age.

---



##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the chart, it appears that the Changed_Credit_Line does not increase steadily with age. There are fluctuations — some younger individuals have higher changes in credit lines than older individuals, which may suggest non-linear behavior or influence from other factors such as income, spending habits, or credit history.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the insights can help create a positive business impact by showing that age alone should not be the sole factor when adjusting credit lines. Understanding that credit behavior is not strictly age-dependent allows businesses to implement more personalized and data-driven credit policies.

Negative growth could occur if credit lines are adjusted purely based on age, ignoring individual behavior or financial reliability. This may result in credit misuse or missed opportunities to empower reliable younger customers.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
Age=[21,23,28,34,54,55]
Changed_Credit_Line=[2.58,11.27,5.42,7.1,1.99,1.99]
plt.figure(figsize=(5,2))
plt.scatter(Age,Changed_Credit_Line,color='green',marker=('*'))
plt.title('Age vs Changed_Credit_Line')
plt.xlabel('Age')
plt.ylabel('Changed_Credit_Line')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.
A scatter plot was chosen because it helps visualize the relationship between two numerical variables — Age and Changed_Credit_Line. Unlike a line plot, it does not imply continuity and is better for spotting correlation patterns, clusters, or outliers in the data.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot shows that there is no strong linear relationship between age and changed credit line. The values are scattered irregularly, indicating that changes in credit lines are not directly dependent on age alone. Some younger users have high changes, and some older users have very low changes.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can help create a positive business impact. The absence of a clear trend suggests that credit policy should not be age-based alone. Instead, a more personalized approach using multiple variables (like income, repayment history, or credit score) would lead to better customer segmentation and risk control.

Negative growth may occur if a company makes assumptions based solely on age. For example, denying credit line increases to younger users assuming they are risky could lose high-potential customers, while overestimating older customers could increase credit risk.Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code
Name=['Aaron Maashoh','Rick Rothackerj', 'Langep','Jasond']
Annual_Income=[19114.12,34847.84,143162.64,30689.89]
plt.pie(Annual_Income,labels=Name)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A pie chart was chosen because it effectively shows the proportional distribution of Annual_Income among individuals. It provides a clear visual comparison of each person’s contribution to the total income, making it easy to spot who earns the most or least relative to others.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The pie chart reveals that one or two individuals contribute a significantly larger share to the total income compared to others. For example, the person with an income of 143,162.64 likely dominates the chart, while others contribute much smaller proportions. This indicates income disparity among the group.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, these insights can help in targeted financial planning and segmentation. For example, high earners may be suitable for premium services or investment products, while others may need supportive credit plans or advisory services.

However, if this income gap is ignored, it could lead to negative business growth — such as ineffective marketing or offering unsuitable financial products to the wrong customer segment. Understanding income levels helps deliver personalized and relevant business strategies.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
Name=['Aaron Maashoh','Rick Rothackerj', 'Langep','Jasond']
Outstanding_Debt=[809.98,605.03,1303.01,632.46]
plt.figure(figsize=(5,3))
plt.bar(Name,Outstanding_Debt,color='blue')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A bar chart was chosen because it is ideal for comparing individual values across categories — in this case, the Outstanding_Debt for each person. It provides a clear visual representation of who owes the most and who owes the least, making it easier to evaluate debt distribution.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the chart, we can observe that 'Langep' has the highest outstanding debt, followed by 'Rick Rothackerj', while 'Jasond' has the lowest debt. This highlights a significant variation in debt levels among the individuals, which could be useful for risk assessment or financial planning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, these insights support positive business impact. By identifying customers with high outstanding debt, financial institutions can take proactive steps — such as offering repayment plans, sending reminders, or adjusting credit limits — to reduce risk and improve cash flow.

However, ignoring such insights may lead to negative growth, especially if high-debt individuals default. Without intervention, this could lead to bad loans or financial losses, impacting the company’s overall stability.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(3,3))
plt.hist(Outstanding_Debt)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A histogram was chosen because it is ideal for understanding the distribution of a continuous variable — in this case, Outstanding_Debt. It helps to visualize how debt values are grouped into ranges, which is important for identifying patterns like skewness, central tendency, and spread.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The histogram likely shows that most Outstanding_Debt values fall within a particular range, while a few values may be significantly higher or lower. This indicates that there may be outliers or uneven distribution of debt levels among individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, understanding the distribution of outstanding debt can help in risk segmentation. For example, identifying the range where most customers' debt lies allows businesses to design standardized repayment schemes, while also creating special monitoring for high-debt outliers.

Ignoring such insights may lead to negative growth — especially if extreme values are not addressed. Outliers could represent high-risk customers, and without proactive handling, this may result in defaults or bad debts.

#### Chart - 10

In [None]:
print(type(data))

In [None]:
# Chart - 10 visualization code
df=pd.DataFrame(data)
df

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(20,3))
Name=['Aaron Maashoh','Rick Rothackerj', 'Langep','Jasond']
Outstanding_Debt=[809.98,605.03,1303.01,632.46]

sns.barplot(data=data.head(10),x='Name',y='Outstanding_Debt',color='orange')

plt.xlabel('Name')
plt.ylabel('Outstanding_Debt')
plt.title('Name wise Outstanding_Debt')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A barplot was chosen because it clearly displays the Outstanding Debt associated with each individual (Name). Bar plots are ideal for comparing categorical variables against a numerical metric, which makes it easy to assess and compare debt levels across individuals.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart shows that 'Rick Rothackerj' has the highest outstanding debt among the listed individuals, followed by 'Langepp' and 'Aaron Maashoh'. The least debt is observed for 'Jasond'. This insight helps identify which individuals might pose higher credit risk due to their larger debt levels.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, these insights can guide risk assessment and debt collection strategies. High-debt individuals can be prioritized for follow-ups, restructuring plans, or credit limits, while low-debt individuals may be eligible for further engagement or credit extension.

Ignoring these patterns could result in negative growth, especially if high-debt individuals are not monitored, leading to defaults and financial loss for the business.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
data=pd.read_csv('/content/drive/MyDrive/PROJECT2/dataset (2).csv')
plt.figure(figsize=(8,3))
sns.histplot(data=data,x='Annual_Income',kde=True,bins=10)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

In [None]:
#A histogram with KDE (Kernel Density Estimation) was chosen to analyze the distribution of Annual Income. This chart is useful to understand the frequency and pattern of income values across the dataset, identifying trends, skewness, and possible outliers in income levels.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

From the histogram, we can observe:

Most individuals fall within a specific income range (likely clustered around the peak of the KDE curve).

There may be a right-skewed distribution, suggesting that while most people earn moderate income, a smaller number earn significantly higher income.

This highlights income inequality or the presence of high-income outliers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, the insights are valuable for:

Targeted marketing – different income groups can receive personalized product offers.

Creditworthiness analysis – helps determine loan eligibility or suitable credit products.

Risk assessment – higher-income individuals might be lower risk, whereas low-income individuals might need tighter credit control.

If these patterns are ignored, it could lead to negative growth, especially if high-risk, low-income customers are overexposed to credit, increasing the chance of defaults and losses.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
sns.displot(x=data['Monthly_Inhand_Salary'],kde=True)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

The distribution plot (displot) with KDE was selected to analyze the spread and shape of Monthly Inhand Salary. This type of chart effectively shows how the salary values are distributed and whether they follow a normal distribution, contain skewness, or have outliers.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The chart reveals:

A concentration of individuals within a specific salary range, indicating most people earn similar in-hand monthly salaries.

Potential right or left skewness, showing if there are a few individuals with significantly higher or lower salaries.

Smooth KDE curve highlights the overall trend and density, making it easier to identify the most common salary bracket.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Answer Here

Yes, these insights can aid in:

Tailoring financial products (like EMIs or loans) based on salary brackets.

Better risk profiling by understanding which users may struggle with repayments based on lower in-hand salaries.

Salary-based segmentation for personalized marketing.

If ignored, this could lead to negative growth by misaligning financial offerings with actual repayment capacity. For example, offering high-limit credit cards to low in-hand salary individuals could increase the risk of default and reduce profitability.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
df=data.groupby('Name')['Annual_Income'].sum().reset_index().sort_values('Annual_Income',ascending=False)
df



In [None]:
plt.figure(figsize=(8,3))
sns.barplot(data=df,x='Name',y='Annual_Income',color='orange')
plt.title('Name wise Annual_Income',color='blue')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A bar plot was selected to compare the total Annual Income per individual. Grouping by name and summing the income gives a clear comparison of overall earnings among the individuals. This chart type makes it easy to interpret who earns more in absolute terms.

##### 2. What is/are the insight(s) found from the chart?

The chart reveals a ranking of individuals by their total annual income.

It clearly identifies the top earners and those with relatively lower income.

There is a visible income gap between individuals, which may hint at potential inequality or different levels of financial contribution.

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Yes, these insights can positively impact the business by:

Allowing targeted financial products for high-income individuals (e.g., premium banking, investment services).

Helping identify key customers or revenue drivers.

Enabling personalized engagement and loyalty strategies for top earners.

However, if lower earners are ignored, it could lead to negative growth:

Missing out on potential from emerging or growing customers.

Poor customer experience due to perceived income-based bias.

Risk of churn if the business focuses only on high earners without addressing the broader base.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
num_data=data.select_dtypes(include=['int64','float64'])
num_data

In [None]:
num_data.corr()

In [None]:
plt.figure(figsize=(5,5))

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(num_data.head(5).corr(),annot=True)

##### 1. Why did you pick the specific chart?

The correlation heatmap was chosen because it provides a clear and intuitive visual summary of the relationships between numerical variables in the dataset. By using color gradients and annotations, the heatmap helps to:

Identify strong positive or negative correlations.

Detect multicollinearity, which is important for model building.

Understand the overall structure of relationships in the data quickly.

This chart is especially useful in the exploratory data analysis (EDA) phase.

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The correlation heatmap shows how different numerical features in the dataset are related to each other. From the chart:

Delay from due date is strongly related to number of delayed payments — if a person delays payments more often, they tend to delay by more days.

Credit utilization ratio is negatively related to monthly balance — people using more of their credit tend to have lower monthly balances.

Number of credit inquiries is negatively related to credit history age — newer customers tend to apply for credit more often.

These insights help identify important relationships that can affect credit behavior or risk.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data)

##### 1. Why did you pick the specific chart?

Answer Here.

A pairplot is a powerful visualization tool in Exploratory Data Analysis (EDA), especially when working with datasets that have multiple numerical features. It is typically used to visualize pairwise relationships between variables in a dataset. It is commonly implemented using Seaborn (sns.pairplot) in Python.

Visualizing Pairwise Relationships

Shows scatter plots between all pairs of numerical features.

Helps identify correlations, linear/non-linear relationships, and clusters.

Each subplot is a 2D projection of the dataset onto two features.

Spotting Correlation and Multicollinearity

You can visually assess if two variables are strongly correlated (linearly or otherwise).

Useful for detecting multicollinearity in datasets — important before applying linear models.

Detecting Outliers

Outliers show up as points that are far away from the majority of data in scatter plots.

Easier to spot when viewing all variable combinations.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

In this chart it is used for
Distribution Insights

The diagonal of the pairplot shows histograms or kernel density estimates (KDE) of individual features.

Helps understand the distribution (normal, skewed, etc.) of each feature.

Class Separation (when using hue)

If the dataset has a categorical target or grouping feature, you can use the hue parameter to color points by class.

Great for visualizing how well classes are separated in feature space.

Can suggest whether features are good candidates for classification.

Detecting Redundant Features

If two features are highly correlated and show similar patterns with other variables, one might be redundant.

In [None]:
#Chart - 16

In [None]:
df=data

In [None]:
plt.figure(figsize=(8, 5))
sns.histplot(data=df['Credit_Score'], bins=30, kde=True, color='skyblue')
plt.xlabel('Credit_Score')
plt.ylabel('Count')
plt.show()

In [None]:
#Why did you pick specific chart?
#This chart is used to understand the overall distribution of the Credit Score values. A histogram with a KDE (Kernel Density Estimate) line gives both frequency and smooth curve insights, helping identify if the data is skewed, normal, or multimodal.

In [None]:
#What is/are the insight(s) found from the chart?
#We can observe how credit scores are spread out—whether most customers have "Good", "Standard", or "Poor" scores. If the curve peaks in one area (e.g., 300–600), it may show poor credit is common, while a peak around 700+ suggests stronger financial health.

In [None]:
#CHART 17

In [None]:
plt.figure(figsize=(10, 10))
sns.boxplot(data, x='Occupation', y='Annual_Income')
plt.title('Annual Income by Occupation')
plt.xticks(rotation=45)
plt.show()

In [None]:
#Why did you pick specific chart?
#A boxplot is excellent for comparing distribution of numerical data (income) across categories (occupations). It shows median, quartiles, and outliers in one clean view.

In [None]:
#What is/are the insight(s) found from the chart?
#You can identify which professions earn the most, how spread out their incomes are, and which occupations have more outliers. For example, if "Entrepreneurs" have a high median and large spread, they may have varying but generally higher income levels.

In [None]:
#Chart - 18

In [None]:
plt.figure(figsize=(6, 4))
sns.countplot(data, x='Credit_Mix', palette='Set2')
plt.title('Count of Credit Mix Types')
plt.show()

In [None]:
#A count plot is ideal for categorical features like Credit Mix. It simply shows how many records fall under each type — "Good", "Standard", or "Bad".

In [None]:
#The plot will reveal the most common credit mix among customers. If "Good" dominates, that suggests most customers maintain a healthy credit profile. If "Bad" is common, that might signal riskier lending behavior.

In [None]:
#Chart - 19

In [None]:
loan_counts = data['Type_of_Loan'].value_counts().head(5)  # Top 5 for clarity
plt.figure(figsize=(7, 7))
plt.pie(loan_counts, labels=loan_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Top 5 Loan Types Distribution')
plt.axis('equal')
plt.show()

In [None]:
#A pie chart is best used when showing the proportion of categories in a whole — in this case, the types of loans customers have.

In [None]:
#The chart will show the most common types of loans (like "Auto Loan", "Credit-Builder Loan"). If one type takes up a large share, it shows a trend or product preference among customers.

In [None]:
#Chart - 20

In [None]:
plt.figure(figsize=(10, 10))
sns.scatterplot(data=data, x='Credit_Utilization_Ratio', y='Credit_Score', hue='Credit_Mix')
plt.title('Credit Utilization vs Credit Score')
plt.xlabel('Credit Utilization Ratio')
plt.ylabel('Credit Score')
plt.legend()
plt.show()

In [None]:
#Scatter plots are great for visualizing relationships between two numeric variables — here, how much credit is used vs. the customer’s credit score. The hue adds a third dimension with Credit Mix.

In [None]:
#You may find that customers with lower credit utilization tend to have higher credit scores, indicating healthy credit behavior. Dense clusters can reveal patterns or risk factors.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

Business Objective                                	        Suggested Solution

Improve accuracy in credit scoring	                       Build a machine                                                    learning model (e.g., Random Forest, XGBoost) to predict credit scores based on financial behavior and demographics.
Reduce loan default rate	                         Create a risk classification model to flag high-risk loan applicants before approval. Use features like Delay_from_due_date, Num_of_Delayed_Payment, Credit_Utilization_Ratio, etc.
Offer personalized financial products             	Apply clustering (K-means or hierarchical clustering) to segment customers and offer relevant loans, credit cards, or investment options.
Enhance loan recovery strategies	                 Develop a dashboard that highlights customers with increasing delays or high credit utilization, helping recovery teams prioritize actions.
Detect fraud or anomalies    	                     Use anomaly detection techniques to flag suspicious credit behavior (e.g., sudden spikes in inquiries or utilization).

In [None]:
#Business Objective                                	Suggested Solution

#Improve accuracy in credit scoring	               Build a machine learning model (e.g., Random Forest, XGBoost) to predict credit scores based on financial behavior and demographics.
#Reduce loan default rate	                         Create a risk classification model to flag high-risk loan applicants before approval. Use features like Delay_from_due_date, Num_of_Delayed_Payment, Credit_Utilization_Ratio, etc.
#Offer personalized financial products             	Apply clustering (K-means or hierarchical clustering) to segment customers and offer relevant loans, credit cards, or investment options.
#Enhance loan recovery strategies	                 Develop a dashboard that highlights customers with increasing delays or high credit utilization, helping recovery teams prioritize actions.
#Detect fraud or anomalies    	                     Use anomaly detection techniques to flag suspicious credit behavior (e.g., sudden spikes in inquiries or utilization).

# **Conclusion**

Write the conclusion here.

The project successfully demonstrates how data analytics and machine learning can be applied to the financial domain to predict customer credit scores and manage loan risks. Through comprehensive data exploration and modeling, we identified key factors that influence creditworthiness, such as:

Credit utilization ratio

Payment behavior

Delay in repayments

Number of loans and inquiries

Using these insights, we were able to:

Build a credit score prediction model with reliable accuracy.

Identify high-risk customers likely to default.

Suggest strategies for personalized marketing and risk reduction.

By leveraging this solution, financial institutions can improve credit evaluation processes, offer better products to suitable customers, and significantly reduce the risk of non-performing loans. This helps enhance both customer experience and business outcomes.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***