<a href="https://colab.research.google.com/github/Nautiyalmukesh2001/Paisabazaar-EDA/blob/main/Paisabazaar_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name : Paisabazaar Banking Fraud Analysis**







**Project Type** - EDA

**Contribution** - Individual

**Team Member 1** - Mukesh Nautiyal

# **Project Summary :**


### **Introduction:**
Paisa Bazaar, a prominent financial services platform, is tasked with accurately assessing customer creditworthiness and identifying fraudulent activities. Misclassification of credit scores leads to increased loan defaults, suboptimal financial product recommendations, and revenue losses. This exploratory data analysis (EDA) project aims to enhance credit risk assessment, fraud detection, and personalized financial services by leveraging customer financial and behavioral data.

### **Business Problem:**
The challenge is to extract actionable insights from a dataset of 100,000 records with 28 features, representing diverse financial and behavioral attributes like income, loans, delayed payments, and credit utilization. Key objectives include:
- Minimizing loan default risks.
- Detecting and preventing fraudulent activities.
- Improving personalized financial recommendations.


### **Methodology:**
The project employed a systematic approach to analyze and address business challenges:

1. **Data Preparation:**
   - Cleaned the dataset to ensure no missing or duplicate values and standardized variable names.
   - Validated critical features like SSN, age groups, and credit mixes.

2. **Feature Engineering:**
   - Created derived variables such as "SSN Validity," "Salary Discrepancy," "High-Risk Loan Flag," and "Expected Monthly Salary."
   - Analyzed financial behaviors, credit utilization, and demographic patterns.

3. **Exploratory Data Analysis:**
   - Used visualizations, statistical analysis, and correlations to uncover key insights.
   - Developed charts to analyze relationships between delayed payments, outstanding debt, and credit scores.

4. **Insights Extraction:**
   - Investigated patterns across demographic groups, payment behaviors, and credit profiles.
   - Highlighted potential red flags for fraud and risky financial behaviors.

5. **Actionable Recommendations:**
   - Proposed targeted strategies for fraud detection, risk mitigation, and product personalization.


### **Assumptions:**
The analysis was based on the following assumptions:
1. **Data Quality:** The dataset is accurate, representative, and free from biases or inaccuracies.
2. **SSN Validity:** Invalid SSNs indicate fraud risks or administrative errors, correlated with poor credit scores.
3. **Salary Discrepancies:** Higher discrepancies signify potential financial mismanagement or reporting errors.
4. **Credit Mix:** Good credit mix reflects financial stability, while bad credit mix is linked to higher defaults.
5. **Demographics:** Age groups and occupations influence credit risk and financial behaviors.
6. **Loan Behavior:** Customers with many loans or credit cards exhibit higher financial risks.



### **Key Dataset Details:**
- **Size:** 100,000 rows, 28 features.
- **Content:** Includes variables like income, loan counts, credit utilization, delayed payments, and payment behavior.
- **Quality:** Clean, with no missing or duplicate values and no significant outliers.



### **Key Insights:**
1. **Fraud Indicators:**
   - 21.6% of SSNs flagged as invalid, suggesting potential fraud or data issues.
   - High-risk behaviors found among customers with excessive loans or credit cards.

2. **Credit Score Patterns:**
   - Standard credit scores dominate (53%), followed by Poor (29%) and Good (18%).
   - Poor scores correlate with delayed payments, high debt, and bad credit mixes.

3. **Demographics:**
   - Young adults (25–33) and middle-aged adults (34–42) show stronger credit profiles.
   - Younger customers (14–24) are more prone to risky financial behaviors.

4. **Financial Behaviors:**
   - Delayed payments and outstanding debt are strongly correlated with poor credit scores.
   - Customers with 6+ credit cards or high salary discrepancies exhibit higher financial risks.



### **Visualization Highlights:**
1. **Credit Score Analysis:**
   - Pie charts and histograms show the distribution of credit scores and age demographics.
   - Delayed payments and credit utilization ratios reveal clear risk thresholds.

2. **Risk Identification:**
   - Heatmaps highlight strong correlations between outstanding debt, delayed payments, and credit utilization ratios.
   - Dual-axis charts uncover trends in loan behaviors, financial strain, and repayment delays.

3. **Demographic Trends:**
   - Young people with invalid loan profiles or bad credit mixes are flagged for fraud risks.
   - Older adults (43–56) show higher debt but stable repayment behaviors.



### **Actionable Recommendations:**
1. **Fraud Detection:**
   - Strengthen SSN validation processes and flag high-risk behaviors (e.g., excessive loans).
   - Monitor credit card ownership to identify over-leveraged customers.

2. **Risk Mitigation:**
   - Incorporate delayed payments and salary discrepancies into risk models.
   - Target high-debt customers with debt consolidation programs.

3. **Personalized Products:**
   - Offer premium loans for high-income, good-credit customers.
   - Provide low-risk products for poor-credit, low-income customers to build creditworthiness.

4. **Interest Rate Policies:**
   - Adjust interest rates for high-risk profiles to reduce defaults.
   - Incentivize good credit mixes with lower interest rates.

5. **Education and Engagement:**
   - Educate young adults (14–24) on financial planning and credit management.
   - Provide counseling for customers with severe financial discrepancies.



### **Conclusion:**
This EDA project provides Paisa Bazaar with valuable insights into credit risk assessment and fraud detection. By implementing targeted strategies and data-driven recommendations, the company can optimize lending policies, enhance customer satisfaction, and build long-term trust, fostering sustainable business growth.




#**GitHub Link -**

https://github.com/Nautiyalmukesh2001/Paisabazaar-EDA

# **Problem Statement**

**BUSINESS PROBLEM OVERVIEW**

Paisabazaar faces the challenge of accurately assessing the creditworthiness of its customers using diverse financial and behavioral data. Misclassification of credit scores can lead to increased loan default risks, suboptimal financial product recommendations, and potential revenue loss. The problem is to identify key factors influencing credit scores and develop a robust framework to analyze and classify customers based on their creditworthiness. This requires extracting actionable insights from customer data, such as income, credit card usage, and payment patterns, to enable better decision-making and deliver personalized services.




### **BUSINESS OBJECTIVE**

To improve credit risk assessment, minimize loan default risks, detect and prevent fraudulent activities, and provide tailored financial products and advice, thereby enhancing decision-making, risk management, and customer satisfaction.

#**1. Dataset Exploration**

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import time
import plotly.express as ps
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import os

## Dataset Loading

In [None]:
# mounting google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# changing the working directory for easy access of files
os.chdir('/content/drive/MyDrive/Paisabazaar_eda')

In [None]:
# Creating pandas dataframe
filepath = 'dataset.csv'
customer_df = pd.read_csv(filepath)

## Dataset First View

In [None]:
# top 5 rows of the data
customer_df.head()

## Dataset Rows & Columns count

In [None]:
# Dataset rows and columns
customer_df.shape

## Dataset Information

In [None]:
# dataset info
customer_df.info()

### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

len(customer_df[customer_df.duplicated()])

### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

print(customer_df.isnull().sum())

In [None]:
# Checking Null Value by plotting Heatmap

sns.heatmap(customer_df.isnull(),cbar=True)


### Outlier *Detection*

In [None]:
for feature in customer_df.columns:
  try:
    fig = ps.box(customer_df,y=feature)
    fig.show()
  except Exception as e:
    print(f"Error occurred while plotting {feature}: {e}")

## What did you know about your dataset?

The dataset provided originates from the financial services industry, specifically from Paisabazaar, and is designed to analyze customer credit behavior and detect potential fraudulent activities. The primary objective is to extract meaningful insights that enhance credit risk assessment, improve personalized recommendations, and strengthen fraud detection mechanisms.

The dataset comprises 100,000 rows and 28 columns, capturing diverse financial and behavioral data. It is well-prepared for analysis, with no missing or duplicate values. Initial exploratory analysis, including boxplots, revealed no significant outliers, indicating a clean and reliable dataset for generating actionable insights.

# **2. Understanding Variables**

In [None]:
# Dataset Columns

customer_df.columns

In [None]:
# Dataset Describe

customer_df.describe(include='all')

## Variable Description



*   ID: Unique ID of the Record.
*   Customer_ID: Unique ID of the Customer.
*   Month: month of the year.
*   Name: Name of the Person.
*   Age: Age of the Person.
*   SSN: Social Security Number of the Person.
*   Occupation: Occupation of the Person.
*   Annual Income: Annual Income of the Person.
*   Monthly_Inhand_Salary: Monthly Inhand Salary of the Person.
*   Num_Bank_Accounts: Number of Bank Accounts of the Person.
*   Num_Credit_Card: Number of Credit Cards Customer having.
*   Interest_Rate: Interest Rate on the Credit Card of the Customer.
*   Num_of_Loan: Number of Loans Taken by the Customer from the Bank.
*   Type_of_Loan: Types of Loans Taken by the Customer from the Bank.
*   Delay_from_due_date: Average Number of Days Delayed by the Person from the Date of Paymnet.
*   Num_of_Delayed_Payment: Number of Payments Delayed by the Person.
*   Changed_Credit_Limit: Percentage Change in the Credit Card Limit of the Customer.
*   Num_Credit_Inquiries: Number of the Credit Card Inquiries by the Person.
*   Credit_Mix: Classification of Credit Mix of the Person (Good,Standard,Poor).
*   Outstanding_Debt: Outstanding Balance of Person.
*   Credit_Utilization_Ratio: Credit Utilization Ratio of Credit Card of the Customer.
*   Credit_History_Age: The Age of Credit History of the Person.
*   Payment_of_Minimum_amount: Yes,if person paid the minimum amount to be paid on the credit, otherwise No.
*   Total_EMI_per_Month: Total EMI Per Month of the Person.
*   Amount_Invested_Monthly: The Monthly Amount Invested by the Person.
*   Payment_Behavior: Payment behavior of person
*   Monthly_Balance: The Monthly Balance Left in the Account of the Person.
*   Credit_Score: Credit Score of person (Good,Standard,Poor).








##  Unique Values for each variable.

In [None]:
# number of unique values
customer_df.nunique()

# **3. Data Wrangling**

## Data Wrangling Code

In [None]:
# creating copy of dataset
customer_df_copy = customer_df.copy()

In [None]:
# changing the Credit_Score column name to Credit_Score_Category to make them more meaningful
customer_df.rename(columns={'Credit_Score':'Credit_Score_Category'},inplace=True)

In [None]:
# creating a function to categorize month from numerical to month name to make it more interpretable
def categorize_month(month):
  try:
    # if-elif statement to catch the exact month number
    if month == 1:
      return 'January'
    elif month == 2:
      return 'February'
    elif month == 3:
      return 'March'
    elif month == 4:
      return 'April'
    elif month == 5:
      return 'May'
    elif month == 6:
      return 'June'
    elif month ==7:
      return 'July'
    elif month ==8:
      return 'August'

  except Exception as e:
    return f"Error: {e}"  # Handle unexpected errors

In [None]:
# creating a month name column
customer_df['Month_Name'] = customer_df['Month'].apply(categorize_month)

In [None]:
# 1. Creditworthiness assessments often rely on the ability to verify an individual's identity, and an invalid SSN may indicate that the person's identity is either unverified,
#    fraudulent, or incorrect as it may suggest the person is attempting to hide or alter their true identity, which could be an indication of fraudulent activity.
# 2. An invalid SSN prevents financial institutions from accurately pulling the individual's credit report from major credit bureaus.
# 3. If a person does not have a valid SSN,they may not have an established credit history. This can lead to difficulties in assessing their creditworthiness.
# 4. Reasons of Invalid SSN can be: Typographical Errors,Identity Theft or Fraud,No Credit History


# creating a function to check SSN validity to assess the creditworthiness of customer.
# SSN like 123456789
def validate_ssn(num):
  try:

    # changing type of num so that i can perform regex
    ssn = str(int(num))
    if not re.match(r"^\d{9}$", ssn):
      return 'Invalid SSN'

    # making groups of SSN in the format: xxx-xx-xxxx
    group_1 = ssn[:3] # xxx
    group_2 = ssn[3:5] # xx
    group_3 = ssn[5:]  # xxxx

    # The SSN should not contain all zeros in any digit group
    if group_1 == '000' or group_2 == '00' or group_3 == '0000':
      return 'Invalid SSN'

    # The first three digits should not be "666" or in the range "900–999".
    if group_1 == '666' or 900 <= int(group_1) <= 999:
      return 'Invalid SSN'

    # if all conditions passed return True
    return 'Valid SSN'

  except ValueError:
    # Handles cases where the input cannot be converted to an integer
    return 'Invalid SSN: Input must be a numeric value'

  except Exception as e:
    # Catches any unexpected errors
    return f"An error occurred: {e}"

In [None]:
# creating a new column to store SSN validity of the customer
customer_df['SSN_Validity'] = customer_df['SSN'].apply(validate_ssn)

In [None]:
# count of valid and invalid ssn of the customer
customer_df['SSN_Validity'].value_counts()

In [None]:
# 1. The invalid SSN might suggest identity verification issues. If the SSN is invalid due to a typo, the person could still have a poor or standard credit score.
#    This situation can be rectified by correcting the SSN.

# 2. If the SSN is invalid due to fraudulent use, the person might be using someone else’s SSN or a fake SSN to apply for credit.
#    In this case, the combination of poor credit score and an invalid SSN suggests the person could be involved in fraudulent activities.
#    The poor credit score could reflect their inability to manage credit or financial responsibility, which is often a result of identity theft,fraudulent behaviour or financial mismanagement.

# 3. If the SSN is invalid because the person is not in the credit system or there are reporting issues (e.g., no credit file),
#    the individual may have a poor or standard credit score due to a lack of credit history or other financial factors.
#    In this case, the person might be a new credit user or someone with a thin credit file who has been using credit irresponsibly,
#    and having Poor credit Score will reflect financial instability or financial distress.


# Creating a crosstab to examine the correlation between invalid SSN and poor credit score
pd.crosstab(index=customer_df['SSN_Validity'],columns=customer_df['Credit_Score_Category'],margins=True)

In [None]:
# creating group by to show relation between annaul income and credit score category
customer_df.groupby('Credit_Score_Category')['Annual_Income'].describe().round(2)

In [None]:
# creating group by to analyze credit score with other features

customer_df.groupby('Credit_Score_Category').agg({'Credit_Utilization_Ratio':'mean','Num_Credit_Card':'mean','Num_of_Loan':'mean','Num_Credit_Inquiries':'mean'}).round()

In [None]:
# creating a column of Expected monthly salary
customer_df['Expected_Monthly_Salary'] = customer_df['Annual_Income']/12

In [None]:
# creating a discrepancy column to use it for further analysis to check income discrepancy
customer_df['Monthly_Salary_Discrepancy'] = abs(customer_df['Expected_Monthly_Salary'] - customer_df['Monthly_Inhand_Salary'])

In [None]:
# creating a percentage discrepancy column
customer_df['Monthly_Salary_Percentage_Discrepancy'] = (customer_df['Monthly_Salary_Discrepancy']  / customer_df['Expected_Monthly_Salary']) *100

In [None]:
# creating a category of percentage discrepancy to take it for further analysis
def categorize_discrepancy(discrepancy):
  if discrepancy == 0:
    return 'No Discrepancy'
  elif 0 <= discrepancy <= 5:
    return 'Minimal Discrepancy'
  elif 5 < discrepancy <= 15:
    return 'Moderate Discrepancy'
  elif 15 < discrepancy <= 30:
    return 'High Discrepancy'
  elif 30 < discrepancy <= 50:
    return 'Severe Discrepancy'

In [None]:
# creating a category column to put categories of monthly discrepancy
customer_df['Monthly_Salary_Discrepancy_Category'] = customer_df['Monthly_Salary_Percentage_Discrepancy'].apply(categorize_discrepancy)

In [None]:
# count of monthly salary discrepancy category
customer_df.loc[:,'Monthly_Salary_Discrepancy_Category'].value_counts()

In [None]:
# creating a crosstab to show Monthly_Salary_Discrepancy_Category and credit score category
pd.crosstab(customer_df['Monthly_Salary_Discrepancy_Category'],customer_df['Credit_Score_Category'],margins=True)

In [None]:
# 1. Included the monthly salary discrepancy category alongside the outstanding debt of customers to better understand
#    the relationship between income stability and debt obligations.
# 2. This comparison is crucial for credit score and fraud analysis, as significant salary discrepancies may indicate financial distress
#    or potential fraudulent behavior. By examining how salary discrepancies align with outstanding debts,
#    we can identify patterns that signal higher credit risk or areas that require closer scrutiny for fraud detection.


# creating a groupby to show monthly salary discrepancy category and outstanding debt of the Customer
customer_df.groupby('Monthly_Salary_Discrepancy_Category')['Outstanding_Debt'].describe().round(2)

In [None]:
# unique credit mix
print("The Unique Credit Mix are: ",customer_df['Credit_Mix'].unique())

In [None]:
# replacing the values of Credit_Mix column to make them more understandable or interpretable
customer_df['Credit_Mix'] = customer_df['Credit_Mix'].replace({'Good':'Good_Credit_Mix','Standard':'Standard_Credit_Mix','Bad':'Bad_Credit_Mix'})

In [None]:
# creating crosstab to show realtion between credit mix and credit score category
pd.crosstab(index=customer_df['Credit_Mix'],columns=customer_df['Credit_Score_Category'],margins=True)

In [None]:
# 1.Raw age values can be difficult to analyze directly, as they are continuous and do not easily provide insights at a glance.
#   By categorizing individuals into specific age ranges, we can better understand how different age groups behave in relation to other variables

# 2.Categorizing individuals into age groups (14-24, 25-33, 34-42, 43-56) helps to break down the data into manageable segments.
#   This segmentation is useful when analyzing how certain behaviors, like loan-taking or spending, may vary across different stages of life.


# 14-24: Young People
# 25-33: Young Adults
# 34-42: Middle-Aged Adults
# 43-56: Older Adults
# creating a function to categorize age
def categorize_age(age,lower_fence,Q1,mid,Q3,upper_fence):
  try:
    if lower_fence <= age <= Q1:
      return f'Young_People (Age {lower_fence}-{Q1})'
    elif Q1 < age <= mid:
      return f'Young_Adults (Age {Q1+1}-{mid})'
    elif mid < age <= Q3:
      return f'Middle_Aged_Adults (Age {mid+1}-{Q3})'
    elif Q3 < age <= upper_fence:
      return f'Older_Adults (Age {Q3+1}-{upper_fence})'

  except TypeError:
    # Handles cases where inputs are not of expected types (e.g., non-numeric values)
    return 'Invalid input: Age and fence values must be numbers'

  except Exception as e:
    # Catches any other unexpected errors
    return f"An error occurred: {e}"


In [None]:
# calculating lower_fence__age,Q1__age,mid__age,Q3__age,upper_fence__age
lower_fence_age = int(customer_df['Age'].min()) # to get min age
Q1_age = int(np.percentile(customer_df['Age'],25)) # to get Q1 age
mid_age = int(customer_df['Age'].median())  # to get median value
Q3_age = int(np.percentile(customer_df['Age'],75))  # to get Q3 of age
upper_fence_age = int(customer_df['Age'].max())  # to get max of age

# creating a new column named Age_Category
customer_df['Age_Category'] = customer_df['Age'].apply(categorize_age,lower_fence=lower_fence_age,Q1=Q1_age,mid=mid_age,Q3=Q3_age,upper_fence=upper_fence_age)

In [None]:
# creating a crosstab to analyze age category with credit mix of the customer
pd.crosstab(index=customer_df['Age_Category'],columns=customer_df['Credit_Mix'],margins=True)

In [None]:
# creating groupby to analyze age category with number of payment delayed
customer_df.groupby('Age_Category')['Num_of_Delayed_Payment'].describe().round()

In [None]:
customer_df.groupby('Age_Category')['Num_of_Delayed_Payment'].mean().round()

In [None]:
# getting unique types of loan taken by the customer
unique_loan_type_set = set()
for loan in customer_df['Type_of_Loan']:
  for i in loan.split(','):
    unique_loan_type_set.add(i.replace('and','').strip())

unique_loan_type = list(unique_loan_type_set)
print('The unique type of loan are: ',unique_loan_type)
print('The total number of unique type of loan are: ',len(unique_loan_type))

In [None]:
# unique occupation of the customer
customer_df['Occupation'].unique()

In [None]:
# 1.The function was created to identify potential fraudulent activity by checking for student loans taken out by individuals with professional occupations.
#   Given that students are typically in early stages of their careers or education,
#   it is unusual for individuals with established professional roles (such as doctors, engineers, or managers) to have student loans.
#   This function helps flag such instances where a person with a professional occupation has taken a student loan,
#   which could be a red flag for potential misuse or fraudulent behavior.

# 2.In the dataset, the "Student" term is not present in the occupation column, which creates a gap in identifying people who are actually students
#   and might have valid student loans. The function fills this gap by flagging loans for people in professional roles,
#   who are less likely to be students but may still have taken out a student loan. If such a case arises, it could indicate fraudulent or suspicious activity.

# creating a function to categorize valid and invalid student loan presence
def validate_occup_std_loan(loans):
  try:
    if 'Student Loan' in loans:
      return 'Invalid: Student Loan present'

    return 'Valid'

  except Exception as e:
    # Catches any other unexpected errors
    return f"An error occurred: {e}"


In [None]:
# creating a new column to check for valid loan
customer_df['Loan_Profile_Validity'] = customer_df['Type_of_Loan'].apply(validate_occup_std_loan)

In [None]:
# count the values of valid and invalid student loans
customer_df['Loan_Profile_Validity'].value_counts()

In [None]:
# creating a crosstab to show age and loan profile validity as person with age more than 23 indicates fraudlent behaviour
pd.crosstab(index=customer_df['Loan_Profile_Validity'],columns=[customer_df['Age_Category'],customer_df['Credit_Score_Category']],margins=True)

In [None]:
# 1.The function was created to identify potential high-risk loans among individuals aged 14-24,
#   which is often considered a vulnerable age group in terms of financial maturity and decision-making.
#   Given that young adults may not yet have established stable credit histories or sufficient financial stability,
#   loans such as Mortgage Loans, Home Equity Loans, and Personal Loans could indicate irregular or potentially fraudulent financial behavior.

# 2.These loan types(Personal Loan,Home Equity Loan, Mortgage Loan) are typically associated with significant financial commitments,
#   and it is unusual for individuals in the 14-24 age range to be taking on such substantial debt,
#   especially when they are in the early stages of their financial lives.
#   As a result, these loan types might be red flags for potential fraud, financial distress, or misuse of credit.

# By implementing this function, it flag records where individuals in the 14-24 age group have taken out any of these high-risk loans.
def checkLoanFlag(age,loan_types,lower_fence):
  try:
    if lower_fence <= age <= 24:
      target_loans = ['Personal Loan', 'Home Equity Loan', 'Mortgage Loan']
      loan_types_list = list()
      for loan in loan_types.split(','):
        loan_types_list.append(loan.replace('and','').strip())

      # Check if any of the target loan types are in the list
      matching_loans = [loan for loan in target_loans if loan in loan_types_list]

      # if loan found then it will be red flag that person with age of 14-24 take these loans
      if matching_loans:
        return 'Red Flag: High-Risk Loan'

      else:
        return 'No Red Flag: Low-Risk Loan'

    else:
      return 'No Red Flag: Low-Risk Loan'

  except Exception as e:
    # Catch all unexpected errors
    return f"An error occurred: {e}"


In [None]:
# creating a column of high risk loan flag for the age of 14-24
customer_df['high_risk_loan_flag_14_24'] = customer_df.apply(lambda row: checkLoanFlag(row['Age'],row['Type_of_Loan'],lower_fence=lower_fence_age), axis=1)

In [None]:
# count of values of red flag and no red flag
customer_df['high_risk_loan_flag_14_24'].value_counts()

In [None]:
# creating a crosstab to show realtion between credit score and person who take high risk loans at the age of 14-24
pd.crosstab(index=customer_df['high_risk_loan_flag_14_24'],columns=customer_df['Credit_Score_Category'],margins=True)

In [None]:
# creating a groupby to show correlation of credit score and high risk loan with other features
customer_df.groupby(['high_risk_loan_flag_14_24','Credit_Score_Category']).agg({'Num_of_Loan':'mean','Num_of_Delayed_Payment':'mean','Num_Credit_Inquiries':'mean','Outstanding_Debt':'mean','Credit_Utilization_Ratio':'mean','Credit_History_Age':'mean'}).round()

In [None]:
# unique values of Payment_of_Min_Amount column
customer_df['Payment_of_Min_Amount'].unique()

In [None]:
# Handling the 'NM' value by replacing it with 'NaN'
customer_df['Payment_of_Min_Amount'] = customer_df['Payment_of_Min_Amount'].replace('NM', np.nan)

In [None]:
# creating a crosstab to show relation between Payment_of_Min_Amount and Credit_Score_Category
pd.crosstab(customer_df['Payment_of_Min_Amount'],customer_df['Credit_Score_Category'],margins=True)

In [None]:
# unique values in Payment_Behaviour
customer_df['Payment_Behaviour'].unique()

In [None]:
# creating a crosstab to show relation between Payment_Behaviour and Credit_Score_Category

pd.crosstab(index=customer_df['Payment_Behaviour'],columns=customer_df['Credit_Score_Category'],margins=True).sort_values(by='Good',ascending=False)

In [None]:
# unique credit card count
customer_df['Num_Credit_Card'].unique()

In [None]:
# creating a crosstab to analyze the relation between number of credit card customer having and the credit Score category
pd.crosstab(index=customer_df['Num_Credit_Card'],columns=customer_df['Credit_Score_Category'],margins=True).sort_values(by='Good',ascending=False)

In [None]:
# creating groupby to analyze number of credit card inquiries and number of credit cards
customer_df.groupby('Num_Credit_Card')['Num_Credit_Inquiries'].describe().round(2)

In [None]:
# creating groupby to analyze number of credit card and credit utilization ratio
customer_df.groupby('Num_Credit_Card')['Credit_Utilization_Ratio'].describe().round(2)

In [None]:
# creating a groupby to analyze number of loan taken by the customer with other features
customer_df.groupby('Num_of_Loan').agg({'Monthly_Inhand_Salary':'mean','Num_of_Delayed_Payment':'mean','Num_Credit_Inquiries':'mean','Outstanding_Debt':'mean'}).round()

In [None]:
# unique number of loans
customer_df['Num_Bank_Accounts'].unique()

In [None]:
# creating group by to analyze the number of bank accounts and outstanding debt
# as multiple bank accounts combined with high debt might be used for shuffle funds in suspicious manner
customer_df.groupby('Num_Bank_Accounts')['Outstanding_Debt'].describe().round(2)

## What all manipulations have you done and insights you found?

**In this Exploratory Data Analysis (EDA) project on Paisa Bazaar Credit Score and Fraud Detection, several data preprocessing steps and transformations were carried out to make the dataset more understandable and to uncover meaningful insights.**

**Manipulations which have Done:**
1. Renaming Columns for Better Clarity:
The "Credit Score" column was renamed to "Credit Score Category" to better represent the values (e.g., Good, Bad, Standard) and make it more intuitive for further analysis.

2. Formatting the 'Month' Column:
The "Month" column, which had numerical values (1-8), was converted into month names (e.g., January, February) to improve readability and provide better insights when analyzing monthly trends.

3. Validating SSN (Social Security Number):
*   A function was created to check the validity of the SSN for each customer. A new column, "SSN Validity", was added to indicate whether the SSN was valid or invalid.
*   The distribution of valid and invalid SSNs was analyzed to detect any anomalies or potential data issues related to fraud.
*   The relationship between SSN validity and the credit score was analyzed to check if there was any correlation between the two, particularly to identify if invalid SSNs were associated with poor credit scores.

5. Comparing Credit Score and Annual Income:
The Credit Score was compared with the Annual Income of customers to investigate any patterns between income levels and credit score categories, which could reveal insights into financial behaviors.

6. Creating Salary-Related Columns:
 A new column, "Expected Monthly Salary," was created based on each customer's reported annual income. Additionally, two new columns—"Monthly Salary Discrepancy" and "Percentage Discrepancy"—were introduced to highlight the differences between the expected and reported monthly salaries. To further enhance the analysis, a function was developed to categorize these discrepancies, enabling later exploration of their relationship with other variables.This discrepancy can be visualized to identify outliers and investigate potential fraudulent activities or data inconsistencies.

7. Cleaning 'Credit Mix' Column:
The values in the "Credit Mix" column were adjusted to make them more understandable, e.g., changing “Good” to “Good Credit Mix”. This transformation helped make the analysis of credit behavior more meaningful.

8. Categorizing Customer Age:
A function was created to categorize the raw age data into age categories (e.g., 14-23, 24-35, etc.), allowing for more effective segmentation and better analysis of how credit score correlates with different age groups.

9. Extracting Unique Loan Types:
A function was used to extract unique loan types from the "Types of Loan" column where multiple loan types were listed. This allowed for the analysis of individual loan types and their distribution across the dataset.

10. Identifying Fraudulent Behavior Based on Occupation and Loan Type:
*   Upon examining the unique occupations in the dataset, it was found that the term "Student" was missing, yet student loans were listed. A function was created to identify any occupation that might be incorrectly associated with student loans, leading to a new column for categorizing valid and invalid student loans.
*  The validity of these loans was analyzed  and compared with age categories to understand any age-related patterns.

11. High-Risk Loan Flag Based on Age:
*   A function was developed to flag individuals in the age group 14-23 who had taken mortgage loans, home equity loans, or personal loans, as these types of loans are unusual for this age group and could suggest fraudulent behavior. A new column, "High-Risk Loan Flag", was created to categorize loans as either "Red Flag" or "No Red Flag".
*   The distribution of red flag loans was compared against credit score categories and credit utilization ratios to identify any patterns that suggest a higher likelihood of fraud.


**Key Insights Found:**

*  The data reveals that 21.6% of records have an invalid SSN, which could be a potential red flag for fraud risk. Individuals with a valid SSN dominate across all credit score categories, especially in the "Standard" category, which accounts for over half of the dataset. The prevalence of valid SSNs in higher credit score categories suggests a correlation between data reliability and creditworthiness. This distribution indicates the need for further investigation into the potential relationship between SSN validity and fraudulent activity, particularly within the "Poor" and "Standard" credit score groups.

*  The data reveals that most customers have "Minimal Discrepancy" in salary, linked to better ("Standard") credit scores. As salary discrepancies increase, credit scores worsen, especially for those in "High" or "Severe Discrepancy" groups. Additionally, higher discrepancies correlate with greater outstanding debt, indicating increased financial risk and potential for fraud concerns.

*  The analysis of the credit score categories shows distinct financial characteristics. The "Good" category has the highest mean value (65,203.67) and a wider range of variation, suggesting a strong financial profile but also higher dispersion, which might warrant further scrutiny for outliers or inconsistencies. The "Poor" category has the lowest mean (40,584.52) and a smaller range, indicating more concentrated, lower financial values. The "Standard" category falls in between, with a moderate mean (50,987.16) and distribution. These differences emphasize the potential for fraud analysis by comparing data spread and variability across credit categories to identify unusual patterns or outliers that could indicate fraudulent behavior.

*  The comparison of credit score categories highlights key differences in financial behavior. Individuals in the "Good" category have a slightly higher credit utilization ratio (33%) and fewer credit cards (4 on average), loans (2), and credit inquiries (3), indicating responsible credit management. The "Poor" category shows a higher average number of credit cards (7), loans (5), and credit inquiries (8), suggesting potential financial instability or a higher risk profile. The "Standard" group is moderate across all variables, but with elevated credit inquiries (5) compared to the "Good" category. These patterns are crucial for fraud analysis, as a high number of credit inquiries or loans might flag risky or potentially fraudulent financial activity.

*   Individuals with a Bad Credit Mix are more likely to have Poor credit scores (14,289 out of 23,768), suggesting that poor credit mix is a risk factor for low creditworthiness. In contrast, a Good Credit Mix is strongly associated with Good credit scores (14,848 out of 30,384), indicating that a favorable credit mix often corresponds with higher creditworthiness. The Standard Credit Mix has a more balanced distribution but is concentrated in the Standard credit score category (33,361 out of 45,848). These insights are useful for identifying patterns in credit risk and may help detect profiles with a higher likelihood of fraud based on credit mix and score categories.

*  The data reveals that middle-aged (34-42) and young adults (25-33) have the highest counts of good and standard credit mixes, indicating stronger credit profiles. Older adults (43-56) also show a stable good credit mix. However, young people (14-24) exhibit a higher bad credit mix, signaling potential credit risk. In terms of delayed payments, most age groups (14-42) average 14 delays, while older adults average 11, suggesting similar risk profiles for younger and middle-aged groups, with slightly lower risk in older adults. These insights can help target high-risk groups for credit improvement and fraud prevention strategies.

*   There is a notable presence of “Invalid” loan profiles (associated with student loans) among all age groups, especially in the “Poor” credit score category. Middle-aged adults (34-42) and young adults (25-33) with poor credit scores show a particularly high count of invalid loan profiles. This trend suggests that individuals in these age groups with poor credit scores might be at higher risk for credit issues, and the presence of invalid loan profiles could signal potential red flags for fraud detection. Identifying these patterns can help in refining risk assessment and detecting potentially fraudulent loan profiles.

*   The analysis reveals that most loans have no red flags, with lower risk generally associated with good and standard credit scores. Poor credit scores show a higher proportion of red flags, indicating greater risk. Key risk indicators include more delayed payments, higher outstanding debt, and increased credit inquiries in red-flagged cases. Credit utilization appears consistent, but longer credit histories are linked to lower risk. Payment behaviors, like paying only the minimum amount, may also influence risk assessment.

*   The analysis reveals as the number of loans (Num_of_Loan) increases, the Monthly_Inhand_Salary generally decreases. Additionally, a higher number of loans correlates with more Num_of_Delayed_Payment and Num_Credit_Inquiries, as well as a higher Outstanding_Debt. This trend may indicate increased financial strain or risk among individuals with multiple loans, which could be relevant for assessing creditworthiness and the likelihood of fraud. For example, individuals with 8-9 loans show significantly higher outstanding debt and delayed payments, suggesting they may have lower credit scores and higher risk factors.

*  Higher numbers of credit cards are associated with lower credit scores, more credit inquiries, and slightly reduced credit utilization ratios. People with 8+ cards generally fall into poorer credit categories and have many inquiries, indicating potential financial strain or credit risk. This information is valuable for assessing credit score and identifying potential fraud risks.

*   From this data, we observe a general upward trend in outstanding debt as the number of bank accounts increases. This pattern could indicate that individuals with more bank accounts tend to carry higher debt, possibly affecting their credit score and fraud risk. The steep increase in debt after five bank accounts suggests a potential threshold, which might be relevant for identifying high-risk profiles. This trend could be an indicator to monitor for potential creditworthiness and fraud likelihood in credit score and fraud analysis.











# **4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables**

## Chart 1: Pie Chart



In [None]:
# creating a pie chart to show distributions of credit score,credit mix and payment behaviour of the customer

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# plot 1: credit score distribution
axes[0, 0].pie(x=customer_df.groupby('Credit_Score_Category')['Customer_ID'].count().reset_index()['Customer_ID'],
        labels=customer_df.groupby('Credit_Score_Category')['Customer_ID'].count().reset_index()['Credit_Score_Category']
        ,autopct='%1.1f%%')

axes[0, 0].set_title('Distribution of Customer Credit Scores')

# plot 2: credit mix distribution
axes[0, 1].pie(x=customer_df.groupby('Age_Category')['Customer_ID'].count().reset_index()['Customer_ID'],
        labels=customer_df.groupby('Age_Category')['Customer_ID'].count().reset_index()['Age_Category']
        ,autopct='%1.1f%%')

axes[0, 1].set_title('Age Distribution')

# plot 3: distribution of payment behaviour of the customer
axes[1, 0].pie(x=customer_df.groupby('Payment_Behaviour')['Customer_ID'].count().reset_index()['Customer_ID'],
        labels=customer_df.groupby('Payment_Behaviour')['Customer_ID'].count().reset_index()['Payment_Behaviour']
        ,autopct='%1.1f%%')

axes[1, 0].set_title('Distribution of Customer Payment Behaviours')

# Turn off the axis for the subplot at axes[1, 1]
axes[1, 1].axis('off')

plt.tight_layout()
plt.show()


###1. Why did you pick the specific chart?

I chose a pie chart to display the distribution of credit score categories, age groups, and payment behavior because it effectively visualizes the proportional breakdown of these customer attributes. A pie chart allows for an easy comparison of how each category contributes to the overall dataset, helping to quickly identify trends and patterns in customer profiles, such as the most common credit scores or age groups, and common payment behaviors.

###2. What is/are the insight(s) found from the chart?

1. Customer Credit Scores:
*   Standard Credit Scores Dominate: Over half of the customers (53.2%) have a "Standard" credit score, representing the average customer base.
*   Poor Credit Scores (29%): A significant portion of customers falls into the "Poor" credit score category, requiring attention for risk management.
*   Good Credit Scores (17.8%): A relatively small portion, suggesting limited high-quality credit customers.

2. Age Distribution:
*   Young Adults (25-33): The largest group (26.2%), suggesting financial products could focus on this demographic.
*   Middle-Aged Adults (34-42) and Older Adults (43-56): Together, they form a significant customer base (48.5%).
*   Young People (14-24): At 25.2%, this group is comparable to Middle-Aged Adults, likely representing early-stage financial product users.

3. Customer Payment Behaviors:
*   Largest Category: Low_spent_Small_value_payments (28.6%) suggests a significant portion of customers engage in low-value, low-spending behaviors.
*   High Spending Insights: High_spent_Medium_value_payments (19.7%) and High_spent_Large_value_payments (14.7%) indicate a smaller but notable segment of high spenders.
*   Low Spenders Across Values: Low_spent_Large_value_payments (10.8%) is the smallest segment, indicating fewer customers in this category.








###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights enable targeted strategies like offering tailored financial products (spending behavior), improving customer credit scores, and age-specific services. These actions can reduce fraud risks and enhance customer satisfaction, driving positive growth.

The high proportion of "Poor" credit scores (29%) and low-value spenders (28.6%) may indicate a risk-prone customer base, leading to higher defaults or fraud if not addressed. Mitigating these risks with proactive measures is crucial to avoid negative impact.

## Chart 2: Line Chart

###A. Number of Delayed Payments vs (Outstanding Debt and Changed Cr Card Limit)

In [None]:
# dual axis line chart showing number of payments delayed and outstanding debt, changed cr card limit
fig, ax1 = plt.subplots(figsize=(15, 6))

# plotting 1st axis
sns.lineplot(x='Num_of_Delayed_Payment', y='Outstanding_Debt',marker='o',data=customer_df,
         linewidth=2,ax=ax1,color='blue',label='Outstanding Debt')
ax1.set_ylabel("Outstanding Debt of the Customer", color="blue")
ax1.tick_params(axis='y', labelcolor="blue")

ax2 = ax1.twinx()

# plotting 2nd axis
sns.lineplot(x='Num_of_Delayed_Payment', y='Changed_Credit_Limit',marker='o',linewidth=2, data = customer_df,ax=ax2,
         color='red',label='Changed Cr Limit')
ax2.set_ylabel("Change in Cr Card Limit", color="red")
ax2.tick_params(axis='y', labelcolor="red")


plt.title('Number of Delayed Payments vs (Outstanding Debt and Changed Credit Limit)')


plt.tight_layout()
plt.show()



#### 1. Why did you pick the specific chart?

This chart provides a comprehensive view of how delayed payments correlate with both debt accumulation and credit limit changes, making it valuable for identifying patterns or anomalies in customer behavior.

#### 2. What is/are the insight(s) found from the chart?



*   Outstanding debt increases significantly as the number of delayed payments rises. This indicates a direct relationship between delayed payments and financial stress on the customer.
*   The change in credit card limits also increases with delayed payments, suggesting credit providers may be extending higher limits to customers who already have delayed payments, possibly to manage short-term defaults.Alternatively, this might reflect poor credit risk management policies and potentially signaling a fraud scheme involving rapidly increasing credit limits.
*   Both the outstanding debt and the credit limit show a leveling-off trend when the number of delayed payments exceeds 20, indicating a possible threshold where customers face capped debt or credit limit adjustments.






#### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Risk Mitigation: Identifying customers with increasing delayed payments and outstanding debts early can help implement measures like stricter credit limits or financial counseling.
*   Policy Optimization: Reviewing credit limit changes for high-risk customers can prevent default escalation.
*   If the observed increase in credit limits for customers with delayed payments represents a deliberate policy, it may lead to increased default risks. Prolonging credit to struggling customers can exacerbate their financial instability and harm the business.
*   Allowing high debt accumulation without addressing delayed payments can create a credit bubble, leading to losses in the long run. Tightening policies for high-risk customers might result in reduced short-term revenues but ensures long-term financial health for the company.





### B. Number of Cr Card Inquiries vs(Outstanding Debt and Credit History Age)

In [None]:
# dual axis line chart showing number of cr card inquiries and outstanding debt,credit history age
fig, ax1 = plt.subplots(figsize=(15, 6))

# plotting 1st axis
sns.lineplot(x='Num_Credit_Inquiries', y='Outstanding_Debt',marker='o',data=customer_df,
         linewidth=2,ax=ax1,color='blue',label='Outstanding Debt')
ax1.set_ylabel("Outstanding Debt of the Customer", color="blue")
ax1.tick_params(axis='y', labelcolor="blue")



ax2 = ax1.twinx()

# plotting 2nd axis
sns.lineplot(x='Num_Credit_Inquiries', y='Credit_History_Age',marker='o',linewidth=2, data = customer_df,ax=ax2,
         color='red',label='Credit History Age')
ax2.set_ylabel("Credit History Age of the Customer", color="red")
ax2.tick_params(axis='y', labelcolor="red")



plt.title('Number of Cr Card Inquiries vs (Outstanding Debt and Credit History Age)')


plt.tight_layout()
plt.show()



#### 1. Why did you pick the specific chart?

This dual-axis line chart was chosen to analyze the relationship between Variables,this visualization highlights how customers' outstanding debts and credit history age vary as they inquire about credit more frequently. The dual-axis format is effective in comparing these trends side by side.

#### 2. What is/are the insight(s) found from the chart?



*   Outstanding debt starts low and increases sharply with more credit inquiries (up to around 7 inquiries), suggesting that customers making frequent inquiries are likely taking on more debt. Beyond this point, outstanding debt stabilizes and begins to decline slightly.

*   Credit history age decreases consistently as the number of credit inquiries increases. This indicates that customers with frequent inquiries are relatively newer borrowers with shorter credit histories.
*   At approximately 7 credit inquiries, the trends for outstanding debt and credit history age intersect. This point may represent a transition where risk profiles change significantly (e.g., newer borrowers with increasing debt).

#### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Frequent credit inquiries combined with rising debt suggest potential financial distress.Customers with low credit history age and high inquiries should be flagged for further monitoring.Insights can guide policies for limiting additional credit to customers with high inquiries but short credit histories.

*   Providing credit to customers with frequent inquiries and increasing debt may lead to higher default rates, especially if their credit history age is short.Restricting credit based solely on high inquiries might exclude potentially reliable customers (e.g., individuals actively shopping for better financial products).
*   These insights emphasize the need for balanced credit policies. Evaluating creditworthiness should incorporate both the number of inquiries and other factors like repayment history or income stability to avoid negative growth while maintaining a positive business impact.






### C. Month vs(Number of Cr Card Inquiries and Cr Utilization Ratio)

In [None]:
# dual axis line chart showing number of cr card inquiries and outstanding debt,credit history age
fig, ax1 = plt.subplots(figsize=(15, 6))

# plotting 1st axis
sns.lineplot(x='Month_Name', y='Num_Credit_Inquiries',marker='o',data=customer_df,
         linewidth=2,ax=ax1,color='blue',label='Number of Cr Card Inquiries')
ax1.set_ylabel("Number of Cr Card Inquiries by the Customer", color="blue")
ax1.tick_params(axis='y', labelcolor="blue")
ax1.legend(loc='upper left')

ax2 = ax1.twinx()

# plotting 2nd axis
sns.lineplot(x='Month_Name', y='Credit_Utilization_Ratio',marker='o',linewidth=2, data = customer_df,ax=ax2,
         color='red',label='Credit Utilization Ratio')
ax2.set_ylabel("Credit Utilization Ratio", color="red")
ax2.tick_params(axis='y', labelcolor="red")
ax2.legend(loc='upper right')


plt.title('Month vs (Num_Credit_Inquiries and Credit_Utilization_Ratio)')


plt.tight_layout()
plt.show()



#### 1. Why did you pick the specific chart?

This dual-axis line chart was chosen to analyze how 'Number of Credit Card Inquiries by the Customer' and 'Credit Utilization Ratio' vary across different months. This chart enables comparison of two metrics that are critical for understanding borrowing behavior and risk trends over time.

#### 2. What is/are the insight(s) found from the chart?



*   The number of credit card inquiries shows a consistent increase from January to August, suggesting either a growing demand for credit or increased financial stress.The credit utilization ratio remains relatively stable, fluctuating slightly but generally staying in a narrow range between 32.15% and 32.45%.The peaks and troughs of the utilization ratio do not align directly with credit inquiries, indicating a weak correlation.
*   While credit inquiries increase steadily, the credit utilization ratio shows more fluctuations without a consistent trend, suggesting that customers may be managing their credit utilization independently of their inquiry frequency.



#### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The upward trend in inquiries can help the business prepare for growing customer demand by ensuring adequate resources for credit evaluation.The stable credit utilization ratio indicates that most customers are maintaining responsible credit usage despite increasing inquiries, which can inform policies for approving loans.Insights can be used to segment customers into categories based on inquiry frequency and credit utilization behavior for targeted financial products.
*    If credit inquiries reflect financial stress rather than new borrowing opportunities, there could be a risk of defaults, especially if the trend continues over time.The weak correlation between credit utilization and inquiries may suggest that inquiries alone are not a sufficient indicator of financial health, requiring deeper analysis to avoid inaccurate credit decisions.
*   These insights highlight the need for balancing customer support with risk assessment. Monitoring trends in credit utilization alongside inquiries over more extended periods can help prevent mismanagement of resources or defaults.



###D: Interest Rate vs(Delay from Due Date and Number of Payments Delayed)

In [None]:
# dual axis line chart showing interes rate and delay from due date and number of payments delayed
fig, ax1 = plt.subplots(figsize=(15, 6))

# plotting 1st axis
sns.lineplot(x='Interest_Rate', y='Delay_from_due_date',marker='o',data=customer_df,
         linewidth=2,ax=ax1,color='blue',label='Delay from Due Date')
ax1.set_ylabel("Number of Delayed Days from Due Date", color="blue")
ax1.tick_params(axis='y', labelcolor="blue")

ax2 = ax1.twinx()
# plotting 2nd axis
sns.lineplot(x='Interest_Rate', y='Num_of_Delayed_Payment',marker='o',linewidth=2, data = customer_df,ax=ax2,
         color='red',label='Number of Delayed Payments')
ax2.set_ylabel("Number of Delayed Payments", color="red")
ax2.tick_params(axis='y', labelcolor="red")

plt.title('Interest Rate vs (Delay from due date and Num of Delayed Payment)')

plt.tight_layout()
plt.show()



#### 1. Why did you pick the specific chart?

The chart was chosen because it effectively compares the relationship between interest rates and two metrics: (1) the number of delayed payments and (2) the number of delayed days from the due date. By using dual y-axes, the chart provides a clear way to visualize and interpret how these variables correlate with the interest rate. The line chart is particularly suitable for showing trends and changes in these metrics as the interest rate increases.

####2. What is/are the insight(s) found from the chart?



*    As the interest rate increases, there is a noticeable rise in both the number of delayed payments and the number of delayed days. This indicates that higher interest rates might make it harder for borrowers to make timely payments.Around a specific interest rate (likely between 15-20%), there is a sharp increase in the number of delayed payments and delayed days, suggesting a threshold where borrowers begin to struggle significantly.
*   Plateau Effect, after a certain point, the metrics seem to stabilize despite the rising interest rates, implying that there might be an upper limit to delays regardless of the rate.




#### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   By identifying the interest rate thresholds where delays increase significantly, lenders can adjust their interest policies or provide targeted interventions (e.g., payment reminders, financial counseling) to reduce delayed payments and defaults.Insights from this chart could help develop risk-adjusted interest rate models that balance profitability with borrower affordability.
*   Higher interest rates correlate with increased delays, which could lead to higher default rates and bad debts. This adversely impacts cash flow and customer retention.
*  Borrowers facing excessive delays may lose trust in the institution, leading to reputational damage and reduced market competitiveness.
*   Balancing interest rates to ensure profitability while minimizing delays is crucial. Identifying and acting on these insights enables businesses to design strategies that reduce risks and maintain borrower satisfaction, thereby fostering long-term growth.



### E: Number of Cr Card vs(Delay from Due Date and Number of Payments delayed)

In [None]:
# creating dual axis line chart to show number of credit card and number of delayed days and number of payment delays
fig, ax1 = plt.subplots(figsize=(15, 6))

# plotting 1st axis
sns.lineplot(x='Num_Credit_Card', y='Delay_from_due_date', data=customer_df.groupby('Num_Credit_Card')['Delay_from_due_date'].mean().round().reset_index(),
             ax=ax1, marker='o', color='blue',label='Delay from Due Date')
ax1.set_ylabel("Number of Delayed Days from Due Date", color="blue")
ax1.tick_params(axis='y', labelcolor="blue")

# Create a second y-axis sharing the same x-axis
ax2 = ax1.twinx()

# plotting 2nd axis sharing the same x-axis
sns.lineplot(x='Num_Credit_Card', y='Num_of_Delayed_Payment', data=customer_df.groupby('Num_Credit_Card')['Num_of_Delayed_Payment'].mean().round().reset_index(),
             ax=ax2, marker='o', color='red',label='Number of Delayed Payments')
ax2.set_ylabel("Number of Delayed Payments", color="red")
ax2.tick_params(axis='y', labelcolor="red")

plt.title('Number of Cr Card vs (Delay from due date and Num of Delayed Payment )')

plt.tight_layout()
plt.show()

#### 1. Why did you pick the specific chart?

This chart was selected because it compares the relationship between the number of credit cards owned by a customer and two critical metrics: (1) the number of delayed payments and (2) the number of delayed days from the due date. A dual-axis line chart provides a clear view of how these metrics change as the number of credit cards increases, offering insights into customer behavior based on their credit portfolio.

#### 2. What is/are the insight(s) found from the chart?



*   There is a general trend where an increase in the number of credit cards corresponds to higher delays in both payments and the number of delayed days.A significant rise is observed when the number of credit cards exceeds a certain level (around 6-8 credit cards), indicating a tipping point where customers may face difficulties managing multiple credit accounts. After 10 credit cards, the number of delayed payments slightly decreases while delayed days stabilize, suggesting that customers with very high credit card counts might have alternative strategies to manage payments or fewer active cards.






####3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   These insights can help financial institutions identify customers at risk of payment delays based on their number of credit cards. For example, customers with more than 6 cards may benefit from targeted interventions like reminders, flexible repayment plans, or credit consolidation offers.Credit card issuers can use this data to recommend optimal card limits or improve risk models for approving additional cards.
*   The increase in delayed payments and days with more credit cards indicates potential customer over-leveraging, which can lead to higher default rates and financial losses for institutions.Issuing too many credit cards to a single customer without proper risk assessment can negatively impact both the customer’s financial health and the lender’s portfolio quality.
*   Proactively managing and advising customers with high credit card counts can reduce delayed payments, foster loyalty, and improve financial outcomes for both the business and customers. However, failing to address this issue could lead to an increase in defaults and damage the lender's profitability and reputation.





## Chart 3: Annual Income Distribution (Univariate)

In [None]:
# creating a histogram to show distribution of annual income
plt.figure(figsize=(18,6))

sns.histplot(data=customer_df, x="Annual_Income",  kde = True, color  = 'red',hue='Credit_Score_Category')
plt.xlabel('Annual Income of the Customers')
plt.ylabel('Frequency of Customers')
plt.title('Distribution of Annual Income')
plt.show()

###1. Why did you pick the specific chart?

This chart was chosen to analyze the distribution of annual income across different credit score categories (Good, Standard, and Poor). By combining a histogram with KDE (Kernel Density Estimation) curves for each category, the chart effectively illustrates both the frequency distribution and the relative income patterns of customers within each credit score group.

### 2. What is/are the insight(s) found from the chart?

*   Customers with Good credit scores are generally concentrated in higher income brackets, with their frequency peaking around ₹50,000 to ₹75,000. Customers with Standard credit scores dominate the mid-income range, with their peak frequency around ₹25,000 to ₹50,000.Customers with Poor credit scores are primarily in the lower-income brackets, peaking around ₹20,000 to ₹30,000.
*   The distribution for "Good" credit scores is more right-skewed, indicating higher-income customers are more likely to maintain a good credit score."Poor" credit scores are tightly concentrated in lower-income ranges, suggesting a possible correlation between income constraints and poor credit behavior.
*  There is significant overlap in the mid-income range (₹25,000-₹50,000) between customers with Standard and Poor credit scores, suggesting that factors beyond income (e.g., spending habits, repayment behavior) influence the credit score.

### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.





*   The insights can help financial institutions better segment their customers.High-income customers with good credit scores could be targeted with premium credit cards or investment opportunities.Mid-income customers in the Standard category could benefit from personalized financial planning or credit improvement programs to boost their credit scores.Low-income customers in the Poor category can be offered credit counseling or lower-risk financial products to improve their financial health and reduce default risks.The data can also be used to refine lending criteria by incorporating income brackets into credit risk assessment models.
*   Customers in lower-income brackets with poor credit scores represent a higher default risk, which could lead to financial losses for lenders if these customers are over-leveraged or provided credit without adequate risk assessment.Targeting high-income customers exclusively might ignore opportunities for growth in the mid-income segments, potentially leading to a stagnation in customer base expansion.
*   Leveraging the data to design tailored financial products for different segments can enhance customer satisfaction, reduce credit risk, and increase profitability. However, failing to address the specific needs of at-risk customers in the low-income bracket could result in increased default rates, leading to negative financial impacts.








## Chart 4: Monthly Salary Discrepancy (Univariate)

In [None]:
# creating a boxplot to show distribution of monthly salary discrepancy
fig = ps.box(customer_df,y='Monthly_Salary_Discrepancy')
fig.show()

###1. Why did you pick the specific chart?

This box plot was chosen because it effectively visualizes the distribution of the Monthly_Salary_Discrepancy variable. It highlights the central tendency (median), the spread of the data (quartiles), and the presence of any outliers. This type of visualization is ideal for identifying anomalies and understanding variability in numerical datasets.

###2. What is/are the insight(s) found from the chart?



*   The median salary discrepancy is 152, indicating that most discrepancies cluster around this value.The data is positively skewed with outliers extending beyond the upper fence (453.13), reaching a maximum of 1747.213.The first and third quartiles are 76 and 228, respectively, showing that 50% of the data lies between these values.The lower fence and minimum value are 0, suggesting that there are instances with no discrepancy.
*    The presence of significant outliers suggests that some individuals may have abnormally high salary discrepancies. These could signal potential fraud, data errors, or unusual transactions.



###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*  The identification of outliers can help the business flag potentially fraudulent or erroneous transactions for further investigation. This reduces financial risk and improves fraud detection mechanisms.Understanding the range and distribution of discrepancies allows for better calibration of risk assessment models, enhancing decision-making processes.
*   If a high percentage of outliers represents errors in data recording or systemic issues, it might reflect poor operational processes, leading to inefficiencies or mistrust among clients.Overreacting to minor discrepancies that are not truly fraudulent could alienate legitimate customers, harming the brand's reputation.
*  By analyzing the outliers closely, the business can determine whether these are genuine fraud cases, system errors, or operational anomalies. Proactive measures to address these issues can enhance trust and operational efficiency. However, misinterpreting these insights or failing to act on them might negatively affect growth.





## Chart 5: SSN Validity (Bivariate)

In [None]:
# Creating subplots
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1 - boxplot to show ssn validity and outstanding debt
sns.barplot(x='SSN_Validity',y='Outstanding_Debt',data=customer_df,ax=axes[0],
            estimator='mean',hue='Credit_Score_Category')
axes[0].set_xlabel('SSN Validity of the Customer')
axes[0].set_ylabel('Outstanding Debt of the Customer')
axes[0].set_title('Outstanding Debt by SSN Validity')
axes[0].get_legend().set_visible(False)

# Plot 2 - boxplot to show ssn validity and number of credit card inquiries by the person
sns.barplot(x='SSN_Validity',y='Num_Credit_Inquiries',data=customer_df,ax=axes[1],
            estimator='mean',hue='Credit_Score_Category')
axes[1].set_xlabel('SSN Validity of the Customer')
axes[1].set_ylabel('Number of Cr Card Inquiries by the Customer ')
axes[1].set_title('Number of Cr Card Inquiries by SSN Validity')
axes[1].legend(loc='upper left',bbox_to_anchor=(1, 1),title='Credit Score')

# Adjust layout
plt.tight_layout()

# Show plot
plt.show()


###1. Why did you pick the specific chart?

This combination of bar plots was chosen to compare Outstanding Debt and Number of Credit Card Inquiries based on SSN Validity and categorized by Credit Score. The chart provides a clear and intuitive visualization of the relationships among these variables. The grouping by SSN Validity and credit score allows us to identify patterns and contrasts efficiently.

###2. What is/are the insight(s) found from the chart?



*   Customers with Invalid SSNs have lower outstanding debt overall compared to those with Valid SSNs.Individuals with a Poor Credit Score consistently show the highest outstanding debt, regardless of SSN validity, suggesting a strong correlation between poor credit management and debt accumulation.
*   Customers with Invalid SSNs also have fewer credit card inquiries overall.Similar to the debt trend, customers with a Poor Credit Score exhibit the highest number of credit inquiries for both valid and invalid SSNs. This indicates potential credit-seeking behavior or financial distress.



###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*  The clear distinction in behavior between customers with valid and invalid SSNs can help identify fraudulent accounts or customers attempting to manipulate the system.Customers with Poor Credit Scores and high debt levels or frequent inquiries can be flagged as high-risk, enabling the company to improve its lending policies and risk mitigation strategies.The trend of higher inquiries and debt among poor credit score groups provides an opportunity to design financial counseling or repayment assistance programs, enhancing customer trust and retention.
*   Over-reliance on SSN validity might lead to false negatives, where legitimate customers with invalid SSNs (e.g., due to administrative errors) are excluded or treated unfairly.Customers with Poor Credit Scores and high inquiries may face stricter loan approvals, potentially driving them to competitors if alternative support mechanisms are not provided.
*   By leveraging these insights to balance risk management with customer support, the business can enhance both security and customer experience. However, failure to account for nuances (e.g., administrative errors in SSN validation) may lead to customer dissatisfaction and loss of market share.





## Chart 6: Age Category vs Loan Profile Validity (Bivariate with Categorical-Categorical)

In [None]:
# creating a bar chart to show age category and loan profile validity
plt.figure(figsize=(15,6))
sns.barplot(data=customer_df.loc[:,['Age_Category','Loan_Profile_Validity']].value_counts().reset_index(),x='Age_Category',y='count',hue='Loan_Profile_Validity')
plt.xlabel('Age Category')
plt.ylabel('Frequency of Customers')
plt.title('Age Category vs. Loan_Profile_Validity')
plt.legend(title='Loan_Profile_Validity', loc='upper left', bbox_to_anchor=(1, 1), fontsize=12, borderpad=2)

plt.show()

###1. Why did you pick the specific chart?

This grouped bar chart was selected because it provides a comparative view of Loan Profile Validity across different Age Categories, highlighting the presence of valid profiles versus those marked invalid due to student loans. This visualization is effective for identifying patterns in loan validity distribution across age groups and pinpointing potential target audiences or risk segments.

###2. What is/are the insight(s) found from the chart?



*   The majority of loan profiles are valid across all age categories, with Valid Loan Profiles outnumbering invalid ones.Young Adults (25–33) and Middle-Aged Adults (34–42) show the highest frequency of valid profiles, likely reflecting their peak earning and borrowing capacity.Young People (14–24) have a relatively high proportion of Invalid Profiles due to student loans compared to other age groups. This reflects their financial dependency on education loans, limiting their eligibility for other financial products.Among Older Adults (43–56), the frequency of invalid profiles is notably lower, indicating reduced reliance on student loans in this demographic.




###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Tailored financial products like student loan refinancing or education-specific credit options can be designed for Young People (14–24), addressing their unique needs and building long-term customer relationships.Focusing on Middle-Aged Adults (34–42) for premium loan products or investment schemes aligns with their lower invalidity rates and stable financial profiles.By identifying invalid profiles associated with student loans, the business can streamline risk assessment procedures and prioritize valid profiles for quicker approvals.
*   Overemphasis on rejecting Young People (14–24) due to their invalid profiles could alienate a significant customer base that may mature into valuable clients.If the company overlooks the financial needs of Young Adults (25–33) with valid profiles, competitors may attract this demographic, leading to lost opportunities.
*    The insights provide a clear roadmap for product development and risk management. However, any exclusionary practices targeting younger demographics (e.g., rejecting student loan holders outright) must be balanced with supportive initiatives, such as flexible repayment options or financial education programs, to prevent long-term negative impacts on growth.






## Chart 7: Age Category vs Outstanding Debt (Bivariate with Categorical-Continuous)

In [None]:
# creating a box plot to show age category and outstanding debt of the customer

plt.figure(figsize=(12,6))
sns.boxplot(x='Outstanding_Debt', y='Age_Category', data=customer_df)
plt.title('Age vs Outstanding Debt')
plt.xlabel('Outstanding Debt of the Customer')
plt.ylabel('Age Category')

# Show plot
plt.show()


###1. Why did you pick the specific chart?

This box plot was chosen because it effectively shows the distribution of outstanding debt across different age categories. It highlights variations, central tendencies, and outliers, making it easier to compare debt levels between age groups.

###2. What is/are the insight(s) found from the chart?



*   Older adults (43–56) have a higher median outstanding debt compared to other age groups.Younger people (14–24) have the lowest median debt.Older adults have the widest range of debt, indicating significant variability.Younger people and young adults (25–33) have relatively lower variability in debt.All age groups show outliers with higher debt values, especially prominent in middle-aged and older adults.



###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Older adults could benefit from tailored debt management solutions as they tend to have higher outstanding debts.Younger people may require educational initiatives or smaller loan offerings to build creditworthiness.Financial products can be designed to suit the age group’s debt profiles. For instance, flexible repayment terms for older adults or small, low-interest loans for younger people.
*   The presence of frequent outliers in debt could indicate potential risks of default, especially among older or middle-aged customers.If not addressed, this could lead to financial losses for the organization.
*   Understanding debt patterns ensures the institution can balance profitability with risk mitigation. For example, aggressive loan offerings to older adults without analyzing repayment capacity might result in negative outcomes.






## Chart 8: Age Category vs Number of Loans Taken (Bivariate with Categorical-Continuous)

In [None]:
# creating a bar chart to show age category and number of loans taken

plt.figure(figsize=(18,6))
sns.barplot(x='Age_Category', y='Num_of_Loan', data=customer_df,hue='Credit_Mix')
plt.title('Age category vs Number of Loans Taken')
plt.xlabel('Age Category')
plt.ylabel('Number of Loans Taken by Customer')
plt.legend(loc='upper left',bbox_to_anchor=(1, 1), fontsize=12, borderpad=2,title='Credit Mix')
plt.show()


###1. Why did you pick the specific chart?

The bar chart was selected because, it clearly visualizes the relationship between age categories and credit mix categories in terms of the number of loans taken, making it easier to spot trends and anomalies.It is effective for comparing categories, especially when the focus is on distinct age and credit mix groupings.The stacked representation with colors differentiates the credit mix types, providing a holistic view of each age category’s behavior.

###2. What is/are the insight(s) found from the chart?



*  Across all age categories, loans with a bad credit mix are disproportionately high compared to standard or good credit mixes.Younger groups (14–24 and 25–33) have a noticeable preference for bad credit mix loans. This may indicate poor financial planning or potential risky behavior, making these groups prone to defaults.The older age group (43–56) shows a better balance, with an increased share of loans in the standard credit mix, although bad credit mix still dominates.This group has relatively lower loan counts across all categories, showing fewer risky loans compared to younger groups.
*   List item



###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Insights can help financial institutions design targeted interventions, such as offering younger groups education on maintaining a good credit mix or incentivizing them with products like secured loans to improve their mix.Older adults can be targeted with advisory services to convert bad credit mix loans into more stable products.The disproportionate number of bad credit mix loans among young individuals may also signal potential fraudulent activities or high-risk borrowers. Strict verification processes can mitigate such risks.Banks can segment customers by age and loan behavior to develop specialized credit packages that align with the needs of each group.
*  A persistent reliance on bad credit mix loans across all age groups might indicate underlying systemic issues in the lending process or poor financial literacy among borrowers.This could lead to increased loan defaults, which would negatively affect profitability.Younger groups showing poor credit behavior could result in a higher loan write-off rate. If not addressed, this could reduce trust in this customer segment and lead to cautious lending practices, ultimately impacting growth in the youth loan market.
*   The insights reveal actionable areas to improve financial health and reduce risk. However, failing to address the high bad credit mix could result in rising defaults and reduced profitability. By leveraging these insights, businesses can proactively manage risks and design solutions to drive sustainable growth







## Chart 9: Occupation vs Loan Profile Validity (Bivariate with Categorical-Categorical)

In [None]:
# creating a bar chart to show occupation and  loan profile validity

plt.figure(figsize=(12, 6))

sns.barplot(data=customer_df.loc[:,['Occupation','Loan_Profile_Validity']].value_counts().reset_index(),
            x='count',y='Occupation',hue='Loan_Profile_Validity')
plt.xlabel('Frequency of Customers')
plt.ylabel('Occupation of the Customer')
plt.title('Occupation vs Loan Profile Validity')
plt.legend(title='Loan Profile Validity', loc='upper left', bbox_to_anchor=(1, 1))

plt.show()


###1. Why did you pick the specific chart?

The horizontal stacked bar chart was chosen because, it effectively visualizes the distribution of loan profile validity across different occupations.The stacked representation differentiates between valid and invalid profiles, offering a comparative view.Occupations are listed on the y-axis for better readability, especially when dealing with many categories.The horizontal layout ensures that longer labels (e.g., occupation names) do not overlap, maintaining clarity.





###2. What is/are the insight(s) found from the chart?



*   A significant portion of customers across all occupations has invalid profiles due to student loans being present.Professions like writers, musicians, and journalists have a higher ratio of invalid profiles compared to others.Conversely, occupations like engineers, lawyers, and doctors show a relatively higher count of valid loan profiles.The presence of student loans affects customers across both creative and technical professions, highlighting its pervasive impact.







###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The data allows financial institutions to identify occupations where student loans are disproportionately high (e.g., writers, musicians). Tailored loan products, like student loan consolidation or refinancing, can be introduced for these segments.Insights help assess risk by occupation. For example, targeting professionals like doctors, lawyers, or engineers with additional financial products may lead to better returns due to their higher proportion of valid profiles.Occupation-specific campaigns can encourage repayment planning and proper loan structuring, particularly for groups with a higher percentage of invalid profiles.
*   A large portion of customers having invalid profiles indicates challenges in compliance or eligibility, possibly leading to defaults or poor repayment rates.This trend, if unchecked, can reduce lender profitability and discourage investment in these occupations.High dependency on student loans across professions reflects broader financial stress, which may limit customers' ability to take additional loans, affecting cross-selling opportunities.
*  The chart highlights actionable insights for creating specialized loan products and risk assessment frameworks. While it presents opportunities to address pain points, failing to act on these insights (e.g., tackling the high invalid profile ratio) could result in financial losses and reduced customer trust.













## Chart 10: Annual Income vs Credit Score (Bivariate with Categorical-Continuous)

In [None]:
# creating a box plot to show relation between annual income and monthly inhand salary
sns.boxplot(x='Credit_Score_Category',y='Annual_Income',data=customer_df)
plt.title('Annual Income vs Credit Score ')
plt.xlabel('Credit Score of Customer')
plt.ylabel('Annual Income of Customer')
plt.show()

###1. Why did you pick the specific chart?

The box plot was chosen to visualize the relationship between annual income and credit scores because it effectively displays the spread, median, and presence of outliers in the income distribution for different credit score categories ("Good," "Standard," and "Poor"). This allows for easy comparison of income levels across groups and highlights variations and trends.

###2. What is/are the insight(s) found from the chart?



*  Customers with "Good" credit scores tend to have higher annual incomes, with a wider range and higher median compared to the "Standard" and "Poor" credit score categories.Customers in the "Standard" credit score category show a slightly lower median income and a tighter range than those in the "Good" category.Customers with "Poor" credit scores have the lowest median income, with more outliers on the higher end of the income spectrum.These insights suggest a positive correlation between annual income and credit score quality, implying that higher incomes might be associated with better financial health and creditworthiness.








###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The insights can help businesses like Paisa Bazaar tailor financial products (e.g., loans or credit cards) for customers based on their credit scores and income brackets. For instance, premium financial products could be targeted toward customers with "Good" credit scores and higher incomes.Identifying that "Poor" credit score customers typically have lower incomes can guide risk management teams to design products with stricter risk mitigation strategies for this segment.
*   Overemphasis on higher-income, good credit score customers might alienate the "Poor" credit score group, reducing market inclusivity and overall business growth potential.The presence of high-income outliers in the "Poor" credit category suggests there could be valuable customers who are being overlooked due to traditional credit score reliance.
*   By balancing inclusive financial offerings while managing risk, businesses can maximize customer reach and maintain sustainable growth. These insights provide a foundation to create tailored strategies that minimize potential negative impacts.



## Chart 11: Monthly Salary vs Number of Loans Taken (Bivariate with Categorical-Continuous )

In [None]:
# bar chart showing monthly salary relation with number of loans customer having

plt.figure(figsize=(12,6))

sns.barplot(x='Num_of_Loan', y='Monthly_Inhand_Salary', data=customer_df, estimator = 'mean',hue='Credit_Mix')
plt.xlabel('Number of Loans Customer Having')
plt.ylabel('Monthly Inhand Salary of the Customer')
plt.title('Monthly Salary vs Number of Loans')

plt.show()



###1. Why did you pick the specific chart?

The bar chart was chosen because it clearly illustrates the relationship between monthly salary and the number of loans a customer has, with a focus on different credit mix categories. This is important for understanding how customers' financial behavior (in terms of loans and salary) is affected by their credit mix (Standard, Good, or Bad). The use of different colors (for credit mix) in the hue helps segment the data and compare how monthly salary changes with the number of loans across different types of credit mix.

###2. What is/are the insight(s) found from the chart?



*   Standard Credit Mix (blue) consistently shows a higher monthly salary across all loan counts, with salary remaining stable or slightly decreasing as the number of loans increases.
*   Good Credit Mix (orange) also shows high monthly salaries, with the highest salary at 2 loans, then a slight decrease as the number of loans increases, suggesting a potential balancing between salary and loan capacity.
*   Bad Credit Mix (green) shows the lowest monthly salaries, with significant drops in salary as the number of loans increases. This could indicate that customers with poor credit mixes are taking more loans despite having lower incomes, which could lead to financial stress or indicate riskier borrowing behavior.
*   Customers with more loans (5 or more) tend to have relatively lower monthly salaries across all credit mix types, suggesting potential financial strain as loan numbers increase.





###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Customers with a Bad Credit Mix and multiple loans but low monthly salary should be flagged for closer scrutiny to prevent over-leveraging and financial stress.A strategy could be to offer lower-risk products to customers with a good or standard credit mix, while providing more regulated loans or financial counseling to customers with poor credit mixes.By understanding that customers with good credit mixes tend to manage their finances more effectively, businesses can target them with higher-value loans or premium financial products.Businesses can use this data to adjust interest rates, loan limits, or repayment plans based on credit mix and salary patterns, enhancing overall portfolio health.
*   Customers with a Bad Credit Mix and multiple loans are at higher risk of default or fraud, especially when loans outweigh their salaries. Over-focusing on Good or Standard Credit Mix customers may exclude legitimate borrowers facing temporary challenges, limiting market reach. While stricter credit policies are crucial to managing risk, they must balance inclusion to avoid alienating lower-salary or high-loan customers. These insights enable precise loan offerings but require careful management to mitigate risks and ensure sustainable growth.



## Chart 12: Salary Discrepancy Category vs Outstanding Debt (Bivariate with Categorical-Continuous)

In [None]:
# bar chart showing salary discrepancy and outstanding debt of the customer
plt.figure(figsize=(15,6))
sns.barplot(x='Outstanding_Debt', y='Monthly_Salary_Discrepancy_Category', data=customer_df, estimator = 'mean',hue='Credit_Score_Category')
plt.title('Monthly Salary Discrepancy vs Outstanding Debt')
plt.xlabel('Outstanding Debt of the Customer')
plt.ylabel('Monthly Salary Discrepancy Category')
plt.show()

###1. Why did you pick the specific chart?

The horizontal grouped bar chart was chosen to effectively compare the relationship between Monthly Salary Discrepancy Categories and Outstanding Debt, segmented by Credit Score Categories (Good, Standard, Poor). This format highlights patterns and disparities across multiple variables, making it easier to identify trends in outstanding debt levels relative to salary discrepancies and creditworthiness.

###2. What is/are the insight(s) found from the chart?



*   Customers with a Poor Credit Score generally have higher outstanding debt across all salary discrepancy categories.Those with Good Credit Scores have the lowest outstanding debt, even in higher salary discrepancy categories.
*   Severe and High Discrepancies, 'Poor' credit score customers exhibit the largest outstanding debts, indicating financial strain or mismanagement.No or Minimal Discrepancies, customers across all credit score categories show significantly lower debt levels, suggesting better alignment of income and expenses.
*  Poor credit score customers accumulate more debt even when their income is misaligned, while customers with good credit scores manage their debts effectively regardless of discrepancies.









###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The chart enables targeted identification of high-risk customers (e.g., Poor Credit Scores with Severe Salary Discrepancy). By flagging these cases, businesses can implement stricter lending criteria or offer financial assistance programs to mitigate default risks.Insights can guide the development of products tailored to customers with Minimal or No Discrepancy, encouraging stable borrowers to access premium loan products.
*   Overly conservative lending policies for customers with Poor Credit Scores and high discrepancies could alienate potential borrowers who might recover financially.Focusing on safer customer segments (e.g., good credit scores with minimal discrepancies) may limit the company's market share, especially in underserved or emerging segments.
*   The insights enable precise risk profiling and resource allocation, improving business efficiency. However, balancing risk aversion with customer inclusivity is essential to prevent alienating struggling yet recoverable borrowers.












## Chart 13: Number of Bank Account  vs Outstanding Debt (Bivariate with Categorical-Continuous)

In [None]:
# creating a bar chart to show number of bank accounts customer have and the outstanding debt of the Customer
plt.figure(figsize=(15,6))
sns.barplot(data=customer_df,x='Num_Bank_Accounts',y='Outstanding_Debt',hue='high_risk_loan_flag_14_24')
plt.xlabel('Number of Bank Accounts Customers Having')
plt.ylabel('Outstanding Debt of the Customer')
plt.title('Number of Bank A/C vs. Outstanding Debt')
plt.legend(loc='upper left',bbox_to_anchor=(1, 1), fontsize=12, borderpad=2,title='Loan Risk Category')


plt.show()

###1. Why did you pick the specific chart?

This bar chart was chosen because it effectively visualizes the relationship between the number of bank accounts customers have and their outstanding debt, segmented by loan risk category (low-risk vs. high-risk loans). The dual bars for each account number clearly distinguish patterns for high-risk and low-risk loans, making it easier to spot trends and differences.

###2. What is/are the insight(s) found from the chart?



*   Customers with more than six bank accounts tend to have higher outstanding debt, especially in the high-risk loan category (red flag loans).For customers with six or fewer accounts, outstanding debt remains relatively consistent and lower across both risk categories.The gap between high-risk loans and low-risk loans widens significantly as the number of accounts increases, indicating that customers with more accounts are likelier to default or take on higher-risk loans.





###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*  The insights can help in risk assessment models. For example, focus on customers with more than six bank accounts as they show a higher probability of being flagged for high-risk loans and design personalized repayment plans or introduce monitoring mechanisms for this segment to reduce defaults.
*   If these trends are not addressed, customers with higher outstanding debts and more accounts may lead to higher default rates, which could negatively impact the business.Blanket restrictions on providing loans to customers with multiple accounts may alienate potential low-risk customers, causing a loss of profitable business opportunities.
*   The insights justify targeted strategies to manage high-risk customers while still supporting low-risk ones. This dual approach ensures that the business minimizes risk without losing out on potential revenue opportunities.







## Chart 14: Number of Credit Card vs Credit Score (Bivariate with Categorical-Categorical)

In [None]:
# creating a bar chart to show number of credit card and credit score

plt.figure(figsize=(15,6))

sns.barplot(data=customer_df.loc[:,['Credit_Score_Category','Num_Credit_Card']].value_counts().reset_index(),x='Num_Credit_Card',y='count',hue='Credit_Score_Category')
plt.xlabel('Number of Credit Cards Customers Having')
plt.ylabel('Frequency of Customers')
plt.title('Credit_Score vs. Number_credit_card')
plt.show()

###1. Why did you pick the specific chart?

This chart was selected to explore the relationship between the number of credit cards customers have and their credit score category (Good, Standard, Poor). The grouped bar representation allows for easy comparison of how the frequency of customers in different credit score categories changes as the number of credit cards increases.

###2. What is/are the insight(s) found from the chart?



*   The majority of customers fall under the Standard credit score category across all credit card counts, peaking for 3–6 credit cards.This suggests that having a moderate number of credit cards correlates with maintaining a standard credit score.Customers with good credit scores are significantly fewer compared to other categories, especially as the number of credit cards increases.This could indicate that excessive credit card ownership might make it challenging to maintain a good credit score.Poor credit scores dominate for customers owning 7 or more credit cards, showing a possible trend of financial overextension or mismanagement.








###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Customers with Standard credit scores and moderate credit card ownership can be targeted for credit improvement plans to help them move into the Good category.Customers with Poor credit scores and high credit card counts can be offered tailored debt consolidation or advisory services to reduce the risk of defaults.These insights can help identify customers at risk (those with Poor credit scores and high credit card counts), enabling banks to implement stricter credit checks or modify lending criteria.
*   Focusing too heavily on customers with Poor credit scores for high-risk products (like loans) without proper assessment may increase default risks and lead to financial losses.Alienating customers in the Standard credit score category by offering overly conservative credit terms could hinder growth opportunities.
*  The chart supports the development of strategies to improve customer creditworthiness, optimize lending policies, and manage risks effectively, which can create positive business outcomes if implemented thoughtfully.






## Chart 15: Number of Credit Card vs Outstanding Debt (Bivariate with Categorical-Continuous)

In [None]:
# Box Plot to show number of credit card and outstanding debt
plt.figure(figsize=(15,6))
sns.barplot(x='Num_Credit_Card', y='Outstanding_Debt', data=customer_df,estimator = 'mean',hue='high_risk_loan_flag_14_24')

plt.title('Number of Cr Card vs Outstanding Debt')
plt.xlabel('Number of Cr card held by Customer ')
plt.ylabel('Outstanding Debt')
plt.legend(loc='upper left',bbox_to_anchor=(1, 1), fontsize=12, borderpad=2,title='Loan Risk Category')

# Show plot
plt.show()

###1. Why did you pick the specific chart?

This chart effectively compares the outstanding debt based on the number of credit cards held by customers, categorized by loan risk (red flag or no red flag).A grouped bar chart is ideal for visualizing two categorical variables (Loan Risk Category and Number of Cards) against a numerical variable (Outstanding Debt).It shows trends in outstanding debt for both high-risk and low-risk loans across different levels of credit card ownership.



###2. What is/are the insight(s) found from the chart?



*   Customers with a greater number of credit cards (6–11) tend to have higher outstanding debt levels.As the number of credit cards increases, the outstanding debt associated with high-risk loans (red flag) consistently surpasses that of low-risk loans (no red flag).For customers holding 1–4 credit cards, both high-risk and low-risk loans have relatively lower outstanding debt, indicating better financial management or less exposure to debt.Starting from 5 credit cards, the gap between high-risk and low-risk loan debt becomes pronounced, suggesting that customers with many cards are at greater risk of default.






###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*  Insights can guide financial institutions to implement stricter loan eligibility criteria for customers with many credit cards, reducing exposure to high-risk loans.Customers with 6+ credit cards and low-risk loans can be targeted with financial planning tools to avoid slipping into high-risk categories.Debt restructuring or repayment plans can be offered to customers already in the high-risk category to improve recovery rates.Banks can develop customized credit card limits or loan offers based on outstanding debt trends and credit card usage.
*   Excessive focus on mitigating high-risk loans might lead to neglect of low-risk customers, who represent a more stable revenue stream.Overly cautious lending policies for individuals with moderate debt and credit card counts could result in lost opportunities for issuing profitable loans.
*  The chart provides actionable insights into customer debt behavior across loan risk categories and credit card usage. Careful application of these insights can help maintain portfolio health while capitalizing on lending opportunities. Balancing risk mitigation and opportunity capture is key to avoiding negative growth.





## Chart 16:  Number of Loans Taken vs Credit Score (Bivariate with Categorical-Categorical)



In [None]:
# creating a bar chart showing number of loans taken vs credit score

plt.figure(figsize=(15, 6))

sns.barplot(data=customer_df.loc[:,['Credit_Score_Category','Num_of_Loan']].value_counts().reset_index(),x='Num_of_Loan',y='count',hue='Credit_Score_Category')
plt.xlabel('Number Of Loans Taken By The Customer')
plt.ylabel('Frequency of Customers')
plt.title('Credit_Score vs. Num_of_Loan')
plt.legend(loc='upper left',bbox_to_anchor=(1, 1), fontsize=12, borderpad=2,title='Credit Score')

plt.show()


###1. Why did you pick the specific chart?

This bar chart was chosen to compare the number of loans taken by customers across different credit score categories (Standard, Good, and Poor). It clearly visualizes the frequency distribution of customers for each category, making it easy to observe patterns and relationships between credit score levels and loan-taking behavior.

###2. What is/are the insight(s) found from the chart?



*   Customers with a Standard credit score form the majority across all loan categories, particularly for 0 to 4 loans.The proportion of customers with a Good credit score decreases as the number of loans increases, indicating they are more likely to manage fewer loans.Customers with a Poor credit score are more frequent in higher loan categories (6+ loans), suggesting that taking multiple loans might correlate with a higher risk of poor credit scores.





###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


*   The insights can guide targeted financial strategies. For example, Customers with a Standard credit score represent a stable and low-risk segment for offering additional financial products.Good credit score customers, who tend to take fewer loans, can be approached with exclusive, smaller loan products or incentives to boost engagement.Identifying customers in the Poor credit score group with many loans can help prioritize debt recovery strategies or design customized repayment plans.
*   Customers with Poor credit scores and multiple loans represent a higher risk. If not managed effectively, this segment could lead to increased defaults, negatively affecting revenue. Proactive risk assessment and credit monitoring are essential to mitigate this risk.






## Chart 17: Number of Loans Taken (Bivariate)

In [None]:
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# plot 1: bar chart showing number of loan and number of delayed payments
sns.barplot(x='Num_of_Loan', y='Num_of_Delayed_Payment', data=customer_df,estimator='mean',ax=axes[0],hue='Credit_Mix')
axes[0].set_title('Number of Loans vs Number of Delayed Payments')
axes[0].set_xlabel('Number of Loans Taken by Customer')
axes[0].set_ylabel('Number of Delayed Payments by Customer')
axes[0].get_legend().set_visible(False)

# plot 2: bar chart showing number of loan and number of credit card inquiries
sns.barplot(x='Num_of_Loan', y='Num_Credit_Inquiries', data=customer_df,estimator='mean',ax=axes[1],hue='Credit_Mix')
axes[1].set_title('Number of Loans vs Number of Cr Card Inquiries')
axes[1].set_xlabel('Number of Loans Taken by Customer')
axes[1].set_ylabel('Number of Cr Card Inquiries by Customer')

plt.legend(loc='upper left',bbox_to_anchor=(1, 1), fontsize=12, borderpad=2,title='Credit Mix')

# Adjust layout
plt.tight_layout()

# Show plot
plt.show()



###1. Why did you pick the specific chart?

The first chart, "Number of Loans vs Number of Delayed Payments", was chosen to understand how the frequency of delayed payments varies with the number of loans across different credit mixes (Standard, Good, and Bad).
The second chart, "Number of Loans vs Number of Credit Card Inquiries", was selected to explore the relationship between the number of loans and credit card inquiries for customers with varying credit mixes. Both charts help analyze customer behavior patterns based on their credit mix and loan activity.


###2. What is/are the insight(s) found from the chart?



*   Customers with a Bad Credit Mix consistently show the highest number of delayed payments across all loan categories, emphasizing higher financial risk.
Customers with a Good Credit Mix have significantly fewer delayed payments, indicating better loan repayment discipline.Customers with a Standard Credit Mix show moderate delays but an increasing trend with the number of loans.
*   Customers with a Bad Credit Mix have more credit card inquiries, which might indicate financial instability or aggressive loan-seeking behavior.Customers with a Good Credit Mix exhibit the fewest credit card inquiries, implying responsible financial management.A steady increase in credit card inquiries is observed as the number of loans rises, particularly for the Bad Credit Mix category.






###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The insights can help target and segment customers more effectively. For example, customers with a Good Credit Mix can be prioritized for premium financial products and lower interest rates, fostering loyalty and profitability.Customers with a Bad Credit Mix and high delayed payments can be flagged for risk management strategies, including stricter loan approval processes or customized repayment plans to minimize defaults.
*   Customers with a Bad Credit Mix showing frequent delayed payments and high credit card inquiries pose a risk of defaults. If not monitored, they could lead to revenue loss.Increased credit card inquiries from this group might reflect credit-seeking desperation, suggesting the need for stricter credit checks and better education on responsible borrowing.





## Chart 18: Loan Risk Category (Bivariate)

In [None]:
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# plot 1: bar chart showing loan risk category and number of delayed payments
sns.barplot(x='high_risk_loan_flag_14_24',y='Num_of_Delayed_Payment',data=customer_df,hue='Credit_Score_Category',ax=axes[0])
axes[0].set_title('Loan Risk Category vs Number of Delayed Payments')
axes[0].set_xlabel('Loan Risk Category')
axes[0].set_ylabel('Number of Delayed Payments by Customer')

# hiding legend of chart 1 to make visuals clear
axes[0].legend().set_visible(False)

# plot 2: bar chart showing loan risk category of delayed days from due date
sns.barplot(x='high_risk_loan_flag_14_24',y='Delay_from_due_date',data=customer_df,hue='Credit_Score_Category',ax=axes[1])
axes[1].set_title('Loan Risk Category vs Delay from Due Date')
axes[1].set_xlabel('Loan Risk Category')
axes[1].set_ylabel('Average Number of Days Delayed from Due Date')

# adjust legend
plt.legend(loc='upper left',bbox_to_anchor=(1, 1), fontsize=12, borderpad=2,title='Credit Score')


# Adjust layout
plt.tight_layout() # Automatically Adjust

# Show plot
plt.show()


###1. Why did you pick the specific chart?

The first chart, "Loan Risk Category vs Number of Delayed Payments", was chosen to evaluate the relationship between loan risk categories (High-Risk or Low-Risk) and the frequency of delayed payments across customers with varying credit scores (Good, Standard, and Poor).
The second chart, "Loan Risk Category vs Delay from Due Date", was selected to analyze how the average number of days delayed from the due date differs across loan risk categories and customer credit scores. These insights are critical for understanding risk and repayment behaviors.


###2. What is/are the insight(s) found from the chart?



*   In High-Risk Loans, Customers with a Poor Credit Score have the highest number of delayed payments in this category, indicating a significant risk.Customers with a Standard Credit Score show moderate delays, while those with a Good Credit Score have the least delays.And in Low-Risk Loans, Even in low-risk loans, customers with a Poor Credit Score have more delayed payments compared to others, but overall delays are fewer than in high-risk loans.
*   Customers with a Poor Credit Score take the longest time to repay, even for low-risk loans, with an average delay exceeding 30 days for high-risk loans.Customers with a Good Credit Score generally repay on time or with minimal delays, even for high-risk loans, showing better repayment discipline.




###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   These insights enable the business to implement a risk-based pricing model, offering better loan terms to customers with Good Credit Scores while imposing stricter conditions for those with Poor Credit Scores.By identifying high-risk customers (e.g., Poor Credit Score in high-risk loans), proactive measures such as enhanced monitoring or financial counseling can reduce default rates and operational losses.
*  Customers with a Poor Credit Score and frequent delays (especially in high-risk loans) could lead to increased non-performing assets (NPAs) if not managed effectively.Delays exceeding 30 days may necessitate higher provisions for bad loans, impacting profitability.
*   A focused approach to improving risk management for high-risk loans while leveraging good credit customers for cross-selling opportunities can create a balance between growth and risk mitigation. However, overlooking repayment behavior in low-risk loans for poor credit customers could lead to a misestimation of risks.





## Chart 19: Loan Risk Category vs number of Cr Card Inquiries (Bivariate with Categorical-Continuous)

In [None]:
# creating a bar chart to show loan risk category and credit card inquiries

plt.figure(figsize=(12,6))

sns.barplot(x='high_risk_loan_flag_14_24',y='Num_Credit_Inquiries',data=customer_df,hue='Credit_Score_Category')
plt.title('Loan Risk Category vs Number of Cr card Inquiries')
plt.xlabel('Loan Risk Category')
plt.ylabel('Number of Cr card Inquiries by Customer')
plt.legend(loc='upper left',bbox_to_anchor=(1, 1), fontsize=12, borderpad=2,title='Credit Score')

plt.show()

###1. Why did you pick the specific chart?

The bar chart was chosen because it effectively compares the relationship between loan risk categories (high-risk vs. low-risk loans) and the number of credit card inquiries across different credit score segments (Good, Standard, Poor). This format visually represents variations and trends, making it easier to identify patterns and anomalies across categories.

###2. What is/are the insight(s) found from the chart?



*   Customers with a poor credit score tend to have significantly higher credit card inquiries, regardless of the loan risk category, compared to those with standard or good credit scores.For high-risk loans (red flag), customers with poor credit scores exhibit the highest average inquiries, followed by those with standard scores, while customers with good scores have the lowest.Similarly, for low-risk loans (no red flag), customers with poor credit scores still dominate in terms of inquiries, but the gap between credit score categories is narrower compared to high-risk loans.These insights suggest that credit card inquiry patterns are strongly influenced by credit scores and loan risk classifications.






###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*  The insights can be leveraged to strengthen credit risk assessment models. Identifying that high-risk loan customers with poor credit scores have more credit card inquiries could indicate potential financial instability or a high likelihood of default. This information can help businesses fine-tune loan approval criteria and minimize defaults.Customers with good credit scores and low credit card inquiries in the low-risk loan category can be targeted with premium loan products or exclusive offers, maximizing profitability while minimizing risks.
*   Stricter approval policies for high-risk loans or customers with poor credit scores may reduce the overall customer base, especially among those with frequent credit card inquiries. However, this could be offset by the reduction in loan defaults, ultimately supporting long-term growth.If poor credit score customers are categorically denied loans, potential future borrowers (who may improve their creditworthiness) could be lost, affecting customer lifetime value.
*   By understanding the patterns of credit card inquiries in conjunction with loan risk categories and credit scores, businesses can make data-driven decisions to enhance loan portfolio quality. However, there’s a trade-off between risk mitigation and customer acquisition that must be carefully balanced.






## Chart 20: Credit Mix vs Outstanding Debt (Bivariate with Categorical-Continuous)

In [None]:
# creating bar chart to show credit mix and oustanding debt
plt.figure(figsize=(12,6))
sns.barplot(x='Credit_Mix',y='Outstanding_Debt',data=customer_df,estimator='mean',hue='Credit_Score_Category')
plt.xlabel('Credit Mix of The Customer')
plt.ylabel('Outstanding Debt')
plt.title('Outstanding Debt vs. Credit_Mix')
plt.legend(loc='upper left',bbox_to_anchor=(1, 1), fontsize=12, borderpad=2,title='Credit Score')

plt.show()

###1. Why did you pick the specific chart?

The bar chart was selected because it clearly demonstrates the relationship between the credit mix of customers (Good, Standard, Bad) and their outstanding debt levels while segmenting by credit scores (Good, Standard, Poor). This visualization is effective in comparing how the credit mix influences outstanding debt levels across different credit score groups.

###2. What is/are the insight(s) found from the chart?



*   Customers with a bad credit mix have the highest outstanding debt levels, regardless of their credit score.Among customers with a bad credit mix, those with good credit scores have slightly higher outstanding debt compared to those with standard or poor credit scores.Customers with a good credit mix have consistently low outstanding debt levels, irrespective of their credit score category.A standard credit mix results in moderate outstanding debt levels, with those having poor credit scores showing a slightly higher debt than others.
These insights reveal a strong correlation between a customer's credit mix and their outstanding debt, suggesting that a bad credit mix is a primary driver of high debt levels.





###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The insights highlight the risks associated with customers having a bad credit mix. Lenders can use this information to implement stricter monitoring or loan approval conditions for such customers, regardless of their credit score, minimizing potential defaults.Customers with a good credit mix and low outstanding debt can be targeted with favorable interest rates or loan offers, leading to increased customer satisfaction and retention.
*   Over-reliance on excluding customers with a bad credit mix might limit opportunities for growth in customer acquisition. These individuals may still be creditworthy but require customized solutions or credit improvement programs.Customers with a standard credit mix may represent a middle ground that requires nuanced analysis. Overlooking this group may result in lost business opportunities from relatively stable customers.
*   Understanding the interplay between credit mix and debt levels provides actionable insights for mitigating credit risks while ensuring that creditworthy customers are not excluded. Careful balancing of risk-based strategies with inclusive credit policies can maximize both profitability and customer growth.






## Chart 21: Credit Score Vs. Payment Behaviour Of Customer (Bivariate with Categorical-Categorical)

In [None]:
# creating a bar chart to show payment behaviour and credit score
plt.figure(figsize=(12,6))
sns.barplot(data=customer_df.loc[:,['Credit_Score_Category','Payment_Behaviour']].value_counts().reset_index(),x='count',y='Payment_Behaviour',hue='Credit_Score_Category')
plt.xlabel('Frequency of Customers')
plt.ylabel('Payment Behaviour Of The Customer')
plt.title('Credit_Score vs. Payment_behaviour')
plt.legend(loc='upper left',bbox_to_anchor=(1, 1),title='Credit Score', fontsize=12, borderpad=2)
plt.tight_layout()
plt.show()

###1. Why did you pick the specific chart?

This bar chart was selected because it effectively compares the payment behavior of customers (spending and payment value categories) against their credit score categories (Good, Standard, Poor). The chart allows for a clear visualization of how different customer segments behave in terms of their financial spending and payment habits, making it easier to draw insights.

###2. What is/are the insight(s) found from the chart?



*   Standard Credit Score Customers Dominate: Customers with a standard credit score are the most frequent across all payment behavior categories. They show particularly high frequencies in low-spent, small-value payments, and high-spent, large-value payments.
*   Poor Credit Score Customers: These customers show significant frequency in low-spent, small-value payments but taper off in other categories, indicating limited financial engagement or riskier financial behavior.
*   Good Credit Score Customers: Customers with good credit scores have lower frequencies overall but are present across all payment behavior categories. This could indicate more stable but limited spending habits compared to those with standard scores.
*   High-Spent, Large-Value Payments: This category has a high representation from customers with standard credit scores, highlighting their financial engagement compared to other groups.



###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   The insights provide a clear view of customer behavior, enabling segmentation based on spending and payment value. For example, high-spent, large-value customers with standard or good credit scores can be targeted for premium loan products or credit cards.Poor credit score customers engaging in low-spent, small-value payments may represent low-revenue, high-risk individuals. This group can be monitored more closely for potential default risks, helping reduce exposure.Good credit score customers, despite being fewer, show balanced spending habits across categories. This segment represents an opportunity for upselling financial products like investment services or high-value credit cards.
*  Focusing only on high-spending behavior might lead to the exclusion of low-spent customers (who may improve over time), reducing long-term customer lifetime value.Standard credit score customers dominate the dataset, but over-reliance on this group could lead to saturation in product offerings without addressing the growth of other segments, such as good score customers.
*   These insights balance the identification of high-value customer groups with risk management strategies. While there are risks of excluding certain segments, strategic policies can mitigate negative impacts and maximize profitability.






## Chart 22: Payment of Minimum Amount (Bivariate)

In [None]:
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# plot 1: bar chart showing payment of minimum amount and oustanding debt
sns.barplot(x='Payment_of_Min_Amount',y='Outstanding_Debt',data=customer_df,estimator='mean',hue='Credit_Score_Category',ax=axes[0])
axes[0].set_title('Payment_of_Min_Amount vs Outstanding_Debt')
axes[0].set_xlabel('Payment_of_Min_Amount')
axes[0].set_ylabel('Outstanding Debt of Customer')
axes[0].get_legend().set_visible(False)


# plot 2: bar chart showing payment of minimum amount and number of credit card inquiries
sns.barplot(x='Payment_of_Min_Amount',y='Num_Credit_Inquiries',data=customer_df,estimator='mean',hue='Credit_Score_Category',ax=axes[1])
axes[1].set_title('Payment_of_Min_Amount vs Number of Cr Card Inquiries')
axes[1].set_xlabel('Payment_of_Min_Amount')
axes[1].set_ylabel('Number of Cr Card Inquiries by Customer')
plt.legend(loc='upper left',bbox_to_anchor=(1, 1),title='Credit Score', fontsize=12, borderpad=2)

# Adjust layout
plt.tight_layout() # Automatically Adjust

# Show plot
plt.show()

###1. Why did you pick the specific chart?

The chosen charts effectively illustrate the relationship between "Payment of Minimum Amount" (categorical variable) and two key metrics: "Outstanding Debt" and "Number of Credit Card Inquiries" (numerical variables). Using bar plots with a categorical breakdown of credit scores (Good, Standard, Poor) ensures a clear comparative analysis across groups, providing insights into financial behavior patterns.

###2. What is/are the insight(s) found from the chart?



*  Customers who pay only the minimum amount tend to have higher outstanding debts, particularly among those with poor credit scores.For customers not paying the minimum amount, the outstanding debt is relatively lower and consistent across credit score categories.
*   Customers paying only the minimum have a higher number of credit card inquiries, particularly for those with poor credit scores.In contrast, those who do not pay the minimum amount have fewer inquiries, indicating less aggressive credit-seeking behavior.


###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Insights about the correlation between "Minimum Payment" behavior and financial stress indicators (outstanding debt and credit inquiries) can help design targeted financial products. For instance, Customers with poor credit scores and high inquiries may benefit from educational initiatives or personalized repayment plans.Identifying high-risk customers enables better risk management and improved loan approval processes.
*   Customers with poor credit scores making minimum payments may signal potential defaults. If such patterns go unchecked, it could lead to higher non-performing assets for the company.High credit card inquiries combined with minimal payments suggest credit dependency, which might increase risk exposure for the business.
*   The analysis provides actionable insights for tailoring customer-centric strategies, such as offering lower-risk financial products or early intervention mechanisms. However, failing to address red flags like high outstanding debts and credit inquiries could adversely affect the business's profitability and reputation.





## Chart 23: Correlation Heatmap (Multivariate-Comapring Monthly Salary with other Features)

In [None]:
# creating a correlation heatmap to show correlation of monthly inhand salary with other important metrics

mnthly_Salary_corr = customer_df.loc[:,['Monthly_Inhand_Salary','Outstanding_Debt','Credit_Utilization_Ratio',
                                        'Total_EMI_per_month','Amount_invested_monthly','Monthly_Balance',
                                        'Expected_Monthly_Salary','Monthly_Salary_Discrepancy']].corr()

z_values = mnthly_Salary_corr.values  # Extracting the correlation values
x_labels = mnthly_Salary_corr.columns  # Getting the variable names for the x-axis
y_labels = mnthly_Salary_corr.columns  # Getting the variable names for the y-axis (same as x-axis)

# Create the heatmap
fig = go.Figure(data=go.Heatmap(
    z=z_values,
    x=x_labels,
    y=y_labels,
    colorscale='RdYlBu',  # color scale
    zmin=-1, zmax=1,  # limits for color scale
))

# setting the title
fig.update_layout(title='Correlation Heatmap')
fig.show()

###1. Why did you pick the specific chart?

The correlation heatmap was chosen to visually represent the relationships between numerical variables in the dataset. It provides a quick overview of how strongly variables like "Monthly Balance," "Credit Utilization Ratio," "Outstanding Debt," and others are correlated. This is crucial in identifying key drivers or potential multicollinearity issues for further analysis and modeling.

###2. What is/are the insight(s) found from the chart?



*   "Monthly Inhand Salary" is strongly correlated with "Expected Monthly Salary," which is expected as these metrics are directly related."Outstanding Debt" is positively correlated with "Credit Utilization Ratio," indicating that higher debt leads to higher credit usage.
*  "Monthly Salary Discrepancy" has a strong negative correlation with "Expected Monthly Salary" and "Monthly Balance," implying discrepancies arise when expected earnings are higher, potentially due to mismanagement or errors.
*  Variables like "Total EMI per month" show weak correlations with most other variables, suggesting minimal impact on other financial indicators.






###3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



*   Understanding the correlation between "Outstanding Debt" and "Credit Utilization Ratio" can help identify high-risk customers who may default due to financial strain. This can guide credit limit adjustments or targeted financial education.The insights into salary discrepancies can be leveraged to improve customer trust by addressing discrepancies, refining salary predictions, or implementing stricter data validation.
*   Weak correlations between "Total EMI per month" and other variables may indicate limited predictive value of EMI for understanding customer risk. This could lead to ineffective prioritization if used as a key metric.
*   The insights allow for tailored strategies that mitigate risk and enhance customer satisfaction. However, ignoring weakly correlated variables or over-relying on them may lead to misguided business decisions. Proper weighting of impactful factors like debt and utilization will ensure effective outcomes.






## Correlation plot

In [None]:
# Correlation Heatmap visualization code
corr = customer_df.select_dtypes(include=['int64','float64']).corr()
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)

def magnify():
    return [dict(selector="th",
                 props=[("font-size", "7pt")]),
            dict(selector="td",
                 props=[('padding', "0em 0em")]),
            dict(selector="th:hover",
                 props=[("font-size", "12pt")]),
            dict(selector="tr:hover td:hover",
                 props=[('max-width', '200px'),
                        ('font-size', '12pt')])
]

corr.style.background_gradient(cmap, axis=1)\
    .set_properties(**{'max-width': '80px', 'font-size': '10pt'})\
    .set_caption("Hover to magify")\
    .format("{:.2f}")\
    .set_table_styles(magnify())



###1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

###2. What is/are the insight(s) found from the chart?

The heatmap reveals key correlations among financial variables in the dataset. Strong positive relationships are observed between Annual Income and Monthly Inhand Salary, as well as Expected Monthly Salary and Amount Invested Monthly, indicating predictable financial patterns. Risk-related behaviors are highlighted through positive correlations of Number of Delayed Payments with Outstanding Debt and Delay from Due Date, while the Credit Utilization Ratio negatively correlates with Credit History Age, suggesting that longer credit histories are associated with better credit management. Additionally, Monthly Balance shows a negative relationship with Outstanding Debt and Total EMI per Month, reflecting the impact of high financial obligations on liquidity. Features like SSN, ID, and Customer_ID are independent and serve as identifiers without influencing financial behavior. For fraud detection, focusing on variables such as Credit Utilization Ratio, Number of Delayed Payments, and Credit History Age could provide valuable insights into risky or fraudulent patterns.

## pair plot

In [None]:
# Pair Plot visualization code
numeric_cols = customer_df.select_dtypes(include=['int64','float64'])
sns.pairplot(numeric_cols)

plt.show()

###1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

###2. What is/are the insight(s) found from the chart?



*   The scatterplots in the pair plot allow us to observe linear or non-linear relationships. Variables with diagonal or curved scatter distributions indicate correlations.The histograms along the diagonal provide an overview of the distribution of individual variables (e.g., normal, skewed, uniform). Outliers can be easily identified as points that deviate significantly from the general trend.Patterns in scatter plots may hint at the presence of natural clusters or groupings in the data, which can be further explored with clustering techniques.



#**5. Solution to Business Objective**

**Solution to enhance credit assessment processes, minimize loan default risks, and provide customers with personalized financial advice tailored to their needs:**

1. Fraud Detection and Prevention
*   Ensure all records undergo validation during onboarding to flag invalid entries (21.6% flagged currently).
*   Leverage machine learning to identify patterns of invalid loan profiles, especially among high-risk age groups (14-24, 34-42).
*   Monitor credit card ownership to cap risks for customers with 6+ cards.
*   Strengthen the verification process for loan applications to minimize invalid profiles.


2. Risk Management
*   Incorporate delayed payments, loan history, and salary discrepancies into risk models to improve predictions.
*  Focus on customers with >6 credit cards, >6 bank accounts, or high outstanding debts. Develop targeted interventions like debt consolidation.


3. Financial Education and Support
*   Target young adults (14-24) with education programs to reduce bad credit mix issues.
*   Provide financial planning assistance for customers with severe discrepancies to reduce financial strain.
*   Educate customers on the importance of maintaining a good credit mix.
*  Provide credit consolidation services for customers with excessive credit cards.
*   Target high-income customers in the poor credit score category with financial improvement plans.
*   Offer support for legitimate customers to rectify SSN issues.

4. Personalized Product Offerings
*   Premium loans and investment opportunities for high-income, good-credit customers.
*   Focus premium loan products on middle-aged customers with stable profiles.
*  Low-risk, entry-level products for poor-credit, low-income customers to build creditworthiness.
*   Use data insights on spending behavior and credit mix to recommend appropriate products.


5. Interest Rate Optimization
*   Implement interest rate thresholds to prevent delayed payments among financially strained customers.
*  Offer custom repayment schedules to high-risk borrowers to reduce defaults.
*  Introduce risk-adjusted interest rates for loans based on salary discrepancy categories.
*   Incentivize customers to improve their credit mix by offering lower interest rates for balanced portfolios.


6. Debt Management and Credit Utilization
*   Identify and flag customers showing increasing debt and frequent credit inquiries for early intervention.
*   Introduce flexible repayment plans to reduce delayed payments and defaults.
*   Refine loan offerings to balance income levels and credit risks.
*   Use data-driven insights to identify cross-sell opportunities for existing customers.


7. Proactive Customer Engagement
*   Leverage stable profiles (34-42 age group) for premium product offerings and relationship-building.
*    Provide counseling and assistance to those with poor credit scores and high loans to prevent defaults.











#**Conclusion**

The PaisaBazaar Banking Fraud Analysis project successfully identified critical factors influencing credit risk and fraud detection, providing actionable insights to strengthen decision-making and customer engagement. By leveraging a clean and comprehensive dataset, the analysis revealed key patterns, such as the strong correlation between salary discrepancies, SSN validity, and credit scores, as well as the impact of high credit card ownership and delayed payments on financial risk.

The study emphasized the importance of tailored financial products, proactive customer support, and enhanced fraud detection mechanisms to address the challenges of creditworthiness misclassification and loan defaults. With targeted interventions, such as flexible repayment plans, risk-adjusted interest rates, and financial literacy programs, the project offers a robust framework for improving credit risk assessment and customer satisfaction.

This analysis empowers PaisaBazaar to better serve its customers while minimizing operational risks, positioning the organization as a leader in delivering secure and personalized financial solutions.