<a href="https://colab.research.google.com/github/Dipu1764/Paisabazaar-banking-Fraud-Analysis/blob/main/Paisabazaar_banking_Fraud_Analysis_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Paisabazaar banking Fraud Analysis



# **Project Summary -**

In this project, the aim was to analyze customer data from Paisabazaar to detect patterns and potential indicators of banking fraud. The dataset contained 100,000 entries and 28 columns related to customer profiles, financial habits, credit history, and loan details. Various features such as Age, SSN, Occupation, Annual Income, Credit Utilization Ratio, Payment Behavior, Credit Score, and others were analyzed to understand the customers' behavior and identify anomalies that could suggest fraudulent activity.

The key objectives of the analysis were:

**Data Wrangling:** Cleaning and preparing the data for analysis by handling missing values, correcting data types, and ensuring consistency.
Exploratory Data Analysis (EDA): Using visualizations to explore relationships between different features, identify patterns, and understand the dataset structure.

**Feature Engineering:** Creating new variables or transforming existing ones to improve the ability to identify fraudulent patterns.
Outlier Detection: Identifying unusual patterns that could indicate fraudulent behavior, such as extreme credit utilization ratios, unusually high outstanding debt, or delayed payments.

**Credit Score Analysis:** Investigating the distribution of credit scores to see if specific profiles or occupations are more likely to have low or suspicious credit scores.

**Loan and Payment Behavior:** Examining loan types, the number of delayed payments, and payment behavior to spot inconsistencies.

**Risk Analysis:** Segmenting customers based on risk factors such as credit utilization, loan history, and debt, which can be indicators of financial distress or fraud.

# **GitHub Link -**

 GitHub Link here :- https://github.com/Dipu1764



# **Problem Statement**


Given a dataset of 100,000 banking customers with detailed information about their financial activities, credit history, and personal profiles, how can we effectively analyze this data to detect and flag potential fraudulent behavior? The analysis should focus on identifying patterns and anomalies that deviate from typical customer behavior, which could indicate fraudulent activities such as credit card fraud, loan exploitation, or unusual financial transactions.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
df = pd.read_csv('/content/dataset.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

# Dataset Overview

**Total Entries (Rows):** 100,000 (RangeIndex: 0 to 99,999)

**Total Columns:** 28 columns with a variety of data types, including numerical, categorical, and string types.


# Key Insights and Observations:
**Diverse Financial Information:** The dataset contains a wide range of financial data, including income, loans, debt, credit inquiries, and payment behavior, which will be crucial for fraud analysis.

**Fraud Indicators:** Columns such as 'Num_of_Delayed_Payment', 'Outstanding_Debt', 'Changed_Credit_Limit', and 'Payment_Behaviour' could be strong indicators of risky or fraudulent behavior.

**Categorical Data:** Several columns are categorical (e.g., 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Payment_Behaviour', 'Credit_Score'). These need to be encoded (e.g., using one-hot encoding or label encoding) if used in machine learning models.

**Potential Redundancies:** Some columns may have closely related or redundant information. For example, 'Annual_Income' and 'Monthly_Inhand_Salary' are related. Also, 'Num_of_Loan' and 'Type_of_Loan' might be correlated.

**Credit Score Insights:** The 'Credit_Score' column is a crucial feature, possibly reflecting the risk of a customer being involved in fraudulent activities.

**Customer Profile Analysis:** Using columns like 'Age', 'Occupation', 'Annual_Income', and 'Credit_History_Age', we can build detailed profiles of customers, which can help in detecting anomalies or fraudulent patterns.

**Data Quality**: There is no indication of missing values (based on the column descriptions), but data types need attention. For instance, 'SSN' should be a string, and certain numerical columns might contain outliers that need handling.

# Areas for Further Analysis:

**Feature Engineering:** Create new features such as 'Debt-to-Income Ratio' (Outstanding Debt/Annual Income) and 'Credit Usage Ratio' (Credit Utilization Ratio/Credit Limit) to better understand customer financial health.

**Fraud Detection:** Focus on the behavior-related columns (e.g., 'Payment_of_Min_Amount', 'Payment_Behaviour') to predict and detect potentially fraudulent customers.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# Example columns (replace 'YourColumnName' with actual column names)
# Fill missing values with default values for specific columns
df.fillna({
    'Age': df['Age'].median(),
    'Annual_Income': df['Annual_Income'].median()
}, inplace=True)

# Drop rows where specific columns have NaN
# Replace 'YourColumnName' with actual column names that you want to check
df.dropna(subset=['Age', 'Annual_Income'], inplace=True)

# Check if any missing values remain
print("\nMissing values after handling:")
print(df.isnull().sum())

# Save the cleaned DataFrame
df.to_csv('cleaned_data.csv', index=False)


In [None]:
# Example: Removing outliers in 'Annual_Income' column using IQR
Q1 = df['Annual_Income'].quantile(0.25)
Q3 = df['Annual_Income'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['Annual_Income'] >= (Q1 - 1.5 * IQR)) & (df['Annual_Income'] <= (Q3 + 1.5 * IQR))]


In [None]:
# Example: Creating a new feature 'Income_per_Account'
df['Income_per_Account'] = df['Annual_Income'] / df['Num_Bank_Accounts']

# Example: Creating a feature 'Credit_Score_Level'
df['Credit_Score_Level'] = df['Credit_Score'].apply(lambda x: 'High' if isinstance(x, int) and x > 700 else 'Low')



In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Annual_Income', 'Monthly_Inhand_Salary', 'Outstanding_Debt']] = scaler.fit_transform(
    df[['Annual_Income', 'Monthly_Inhand_Salary', 'Outstanding_Debt']]
)


In [None]:
# Example: Distribution of Annual Income
sns.histplot(df['Annual_Income'])
plt.title('Distribution of Annual Income')
plt.show()

# Example: Violin Plot
plt.figure(figsize=(12, 6))
sns.violinplot(x='Occupation', y='Monthly_Inhand_Salary', data=df)
plt.xticks(rotation=45)
plt.title('chart of Monthly Inhand Salary by Occupation')
plt.show()




In [None]:
from sklearn.model_selection import train_test_split

X = df[['Annual_Income', 'Monthly_Inhand_Salary', 'Outstanding_Debt']]  # Example features
y = df['Credit_Score']  # Example target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
df.to_csv('cleaned_data_final.csv', index=False)
df.head()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  Visualize the distribution of Annual_Income across different Occupation categories

In [None]:
# Chart - 1 visualization code
# Create a box plot for Annual Income across Occupation
plt.figure(figsize=(12, 8))
sns.boxplot(x='Occupation', y='Annual_Income', data=df)
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.title('Distribution of Annual Income Across Different Occupations')
plt.xlabel('Occupation')
plt.ylabel('Annual Income')
plt.show()

#### Chart - 2 Visualize the relationship between Credit_Utilization_Ratio and Credit_Score to understand how credit utilization impacts credit scores.

In [None]:
# Chart - 2 visualization code
# Check data types
print(df['Credit_Score'].dtype)

# Convert Credit_Score to numeric if necessary
# If Credit_Score is categorical or has non-numeric values, use the following:
df['Credit_Score'] = pd.to_numeric(df['Credit_Score'], errors='coerce')

# Drop rows where conversion to numeric failed (resulting in NaN values)
df.dropna(subset=['Credit_Score'], inplace=True)

sns.pairplot(df[['Credit_Utilization_Ratio', 'Credit_Score']])
plt.suptitle('Pair Plot of Credit Utilization Ratio and Credit Score', y=1.02)
plt.show()



#### Chart - 3  visualize the average Annual_Income by Occupation.

In [None]:
# Chart - 3 visualization code

# Ensure df is not empty before proceeding
if not df.empty:
    # Calculate average Annual_Income by Occupation
    avg_income_by_occupation = df.groupby('Occupation')['Annual_Income'].mean().sort_values()

    # Plot bar chart
    plt.figure(figsize=(12, 6))
    avg_income_by_occupation.plot(kind='bar', color='skyblue')
    plt.title('Average Annual Income by Occupation')
    plt.xlabel('Occupation')
    plt.ylabel('Average Annual Income')
    plt.xticks(rotation=45)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
    plt.show()
else:
    print("The DataFrame is empty. Please check your data loading or processing.")

#### Chart - 4  Correlation Matrix of Income, Loans, and Debts (Heatmap)

In [None]:
# Chart - 4 visualization code
selected_columns = ['Annual_Income', 'Num_of_Loan', 'Outstanding_Debt', 'Monthly_Balance']
corr_matrix = df[selected_columns].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Financial Attributes')
plt.show()


#### Chart - 5  Distribution of Number of Loans vs Credit Score (Joint Plot)

In [None]:
# Chart - 5 visualization code
# Convert 'Num_of_Loan' and 'Credit_Score' columns to numeric type
df['Num_of_Loan'] = pd.to_numeric(df['Num_of_Loan'], errors='coerce') # Convert to numeric, replace errors with NaN
df['Credit_Score'] = pd.to_numeric(df['Credit_Score'], errors='coerce') # Convert to numeric, replace errors with NaN

# Drop rows with missing values in 'Num_of_Loan' and 'Credit_Score' columns
df = df.dropna(subset=['Num_of_Loan', 'Credit_Score'])

# Check if the dataframe is empty after dropping rows
if df.empty:
    print("DataFrame is empty after dropping missing values. Check data for errors or missing values.")
else:
    sns.jointplot(x='Num_of_Loan', y='Credit_Score', data=df, kind='hex', cmap='Blues')
    plt.title('Joint Distribution of Number of Loans and Credit Score')
    plt.show()

#### Chart - 6  Explore the relationship between Age, Annual Income, Monthly Balance, and Outstanding Debt. Highlight the data points by Credit Score to see if there is any noticeable pattern.

In [None]:
# Chart - 6 visualization code
# Subset the relevant columns and filter the dataset to remove invalid or unknown values in Credit_Score
df_filtered = df[df['Credit_Score'].isin(['Good', 'Standard', 'Poor'])]

# Select columns of interest for the pairplot
cols_of_interest = ['Age', 'Annual_Income', 'Monthly_Balance', 'Outstanding_Debt', 'Credit_Score']

# Create a pairplot with hue based on Credit Score
sns.pairplot(df_filtered[cols_of_interest], hue='Credit_Score', palette='coolwarm')
plt.suptitle('Pairplot of Age, Income, Balance, and Debt Colored by Credit Score', y=1.02)
plt.show()

#### Chart - 7 Examine the relationship between the number of delayed payments and outstanding debt across different occupations using a bubble plot.

In [None]:
# Chart - 7 visualization code
# Group by occupation and calculate mean of 'Num_of_Delayed_Payment' and 'Outstanding_Debt'
occupation_stats = df.groupby('Occupation').agg({'Num_of_Delayed_Payment': 'mean', 'Outstanding_Debt': 'mean'}).reset_index()

# Normalize the size of the bubbles by scaling the number of delayed payments
occupation_stats['Size'] = occupation_stats['Num_of_Delayed_Payment'] * 10  # Scale the bubble size

# Create a bubble plot
plt.figure(figsize=(12, 8))
plt.scatter(occupation_stats['Num_of_Delayed_Payment'], occupation_stats['Outstanding_Debt'],
            s=occupation_stats['Size'], alpha=0.6, c='purple', edgecolor='black')

# Add labels to the bubbles
for i in range(occupation_stats.shape[0]):
    plt.text(occupation_stats['Num_of_Delayed_Payment'][i], occupation_stats['Outstanding_Debt'][i],
             occupation_stats['Occupation'][i], fontsize=10)

# Add axis labels and title
plt.xlabel('Average Number of Delayed Payments')
plt.ylabel('Average Outstanding Debt')
plt.title('Relationship Between Delayed Payments and Outstanding Debt by Occupation')
plt.grid(True)
plt.show()


#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Count the occurrences of each credit score per occupation

# Ensure df is not empty and contains the necessary columns
if not df.empty and 'Occupation' in df.columns and 'Credit_Score' in df.columns:
    occupation_credit_score = df.groupby(['Occupation', 'Credit_Score']).size().unstack(fill_value=0)

    # Check if occupation_credit_score has any numeric data
    if not occupation_credit_score.empty:
        # Plot a stacked bar chart
        occupation_credit_score.plot(kind='bar', stacked=True, figsize=(12, 8), colormap='Set3')

        # Add labels and title
        plt.title('Distribution of Credit Scores by Occupation')
        plt.xlabel('Occupation')
        plt.ylabel('Number of Customers')
        plt.xticks(rotation=45, ha='right')
        plt.legend(title='Credit Score', loc='upper right')
        plt.tight_layout()

        # Show the chart
        plt.show()
    else:
        print("No data available for plotting after grouping.")
else:
    print("Original DataFrame is empty or missing required columns.")

#### Chart - 9

In [None]:
# Chart - 9 visualization code
import matplotlib.pyplot as plt

# Create a hexbin plot to visualize the relationship between Num_Bank_Accounts and Outstanding_Debt
plt.figure(figsize=(10, 8))
plt.hexbin(df['Num_Bank_Accounts'], df['Outstanding_Debt'], gridsize=30, cmap='Blues', mincnt=1)

# Add color bar and labels
plt.colorbar(label='Count')
plt.title('Hexbin Plot: Num_Bank_Accounts vs. Outstanding_Debt')
plt.xlabel('Number of Bank Accounts')
plt.ylabel('Outstanding Debt')

plt.show()


#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Set the figure size
plt.figure(figsize=(12, 8))

# Create a violin plot to visualize the distribution of Credit_Score across Occupation
sns.violinplot(x='Occupation', y='Credit_Score', data=df)

# Add title and labels
plt.title('Credit Score Distribution Across Occupations', fontsize=16)
plt.xlabel('Occupation', fontsize=12)
plt.ylabel('Credit Score', fontsize=12)

# Rotate x-axis labels for better readability
plt.xticks(rotation=45)

plt.show()


# **Conclusion**

After analyzing the data, several insights emerged that could help in detecting fraud:

**Occupation and Credit Score Patterns:** Some occupations showed unusual distributions of credit scores. This could indicate potential fraudulent activities, particularly if certain professions are overrepresented in low-credit-score categories.

**Credit Utilization Ratio:** Customers with extremely high credit utilization were flagged as potential risks. High credit utilization can indicate financial distress, a key indicator of possible fraud.

**Payment Behavior:** Customers with a history of missed payments or delays in settling debts were found to have lower credit scores and higher outstanding debts. These behavioral patterns could be used to monitor and predict fraud.

**Loan Details:** The analysis of loan types and the number of loans revealed that some customers have excessive loan numbers, which could be a sign of fraudulent attempts to exploit the banking system.

**Final Thoughts:**
The project successfully identified key areas of risk and potential fraud indicators within the customer base. By combining data wrangling, exploratory data analysis, and advanced visualizations, we were able to highlight patterns that can be further investigated to develop fraud detection systems. Moving forward, machine learning algorithms can be employed to automate and improve the accuracy of fraud detection based on the features identified in this analysis.








### ***Hurrah! You have successfully completed your Paisabazaar banking Fraud Analysis Capstone Project !!!***