# Exploratory Data Analysis (EDA) for Insurance Fraud Dataset

This notebook explores the cleaned **Insurance Claims** Dataset to understand:
- **Data Structure** and **Summary Statistics**
- `Fraud` vs `non-fraud` **Distribution** and **Imbalance**
- **Distribution** of key **Numeric Features**
- **Relationships** between **Features** and **Fraud**
- **Categorical** **Feature Analysis**
- **Time-based** **Trends** (if available)
- **Outlier** **Detection**

I use `pandas` for **Data Manipulation** and `seaborn`/`matplotlib` for **Visualization**.

## Table of Contents
1. Load Data and Basic Inspection  
2. Data Overview and Summary Statistics  
3. Fraud Class Distribution  
4. Convert Fraud Labels for Numeric Analysis  
5. Numeric Feature Exploration  
6. Categorical Feature Exploration  
7. Time-Based Trends 
8. EDA Summary and Next Steps

In [None]:
# Importing essential Python libraries
import pandas as pd                                 # For data manipulation
import seaborn as sns                               # For statistical visualizations
import matplotlib.pyplot as plt                     # For basic plotting
import os                                           # For file path handling
from IPython.display import display, Markdown       # For displaying Markdown and DataFrames in Jupyter

# Set Seaborn's visual style for cleaner plots
sns.set(style="whitegrid")

# Enable inline plotting for Jupyter
%matplotlib inline

## 1. Load Data and Basic Inspection

This step I load the cleaned insurance claims dataset from a CSV file saved from the ETL process into a pandas DataFrame. I also inspect the shape (rows, columns) to understand dataset size and display the first few rows to get an initial feel for the data’s structure and content.

In [None]:
# Define the project directory path (update as per your environment)
project_dir = r"C:\Users\Cloud\OneDrive\Desktop\Fraud_Analytics_Project"

# Construct full path to the cleaned CSV file within the project folder
cleaned_file = os.path.join(project_dir, "data", "cleaned", "cleaned_insurance_claims.csv")

# Read the CSV file into a pandas DataFrame
df = pd.read_csv(cleaned_file)

# Display the shape of the dataframe (rows, columns)
display(Markdown("**Dataframe shape:**"))
display(df.shape)

# Show the first 5 rows to get a sense of the data
display(Markdown("**First 5 rows:**"))
display(df.head())

# Provide info about data types and non-null counts per column
display(Markdown("**Dataframe info:**"))
df.info()

## 2. Data Overview and Summary Statistics

Here, I review the dataset's structure using `info()` to check data types and missing values. I then display summary statistics (mean, median, quartiles, etc.) for numeric columns, helping me understand feature distributions and spot anomalies.

In [None]:
# Count how many missing values in each column, sorted descending
missing_counts = df.isna().sum().sort_values(ascending=False)

# Calculate percentage of missing values relative to total rows, rounded to 2 decimals
missing_percent = (missing_counts / len(df) * 100).round(2)

# Combine missing counts and percentages into a DataFrame for easier viewing
missing_summary = pd.DataFrame({'Missing Count': missing_counts, 'Missing %': missing_percent})

# Display missing value summary table
display(missing_summary)

# List all columns that have any missing values
cols_with_missing = missing_counts[missing_counts > 0].index.tolist()
display(Markdown(f"**Columns with missing values:** {cols_with_missing}"))

## 3. Fraud Class Distribution

I examine how many claims are labeled as fraudulent vs non-fraudulent, including the percentage imbalance. This is crucial because many fraud datasets are imbalanced, which can affect model training and evaluation. I then visualize this imbalance with a count plot.

In [None]:
# Count number of fraud ('Y') and non-fraud ('N') cases
fraud_counts = df['fraud_reported'].value_counts()

# Display raw counts for each fraud class
display(fraud_counts)

# Calculate percentage of fraud cases (if 'Y' present in data)
fraud_percentage = fraud_counts.get('Y', 0) / fraud_counts.sum() * 100 if 'Y' in fraud_counts else 0

# Display fraud percentage as markdown for clarity
display(Markdown(f"**Percentage of fraud cases:** {fraud_percentage:.2f}%"))

# Plot a bar chart showing counts of fraud vs non-fraud claims
plt.figure(figsize=(6, 4))
sns.countplot(x='fraud_reported', data=df, palette=['#66b3ff', '#ff6666'])
plt.title('Fraud vs Non-Fraud Claim Counts')
plt.xlabel('Fraud Reported (Y/N)')
plt.ylabel('Count')
plt.show()

# Calculate and display class imbalance ratio (majority class count divided by minority class count)
imbalance_ratio = fraud_counts.max() / fraud_counts.min()
display(Markdown(f"**Class imbalance ratio (majority/minority):** {imbalance_ratio:.2f}"))

## 4. Convert Fraud Labels for Numeric Analysis

I stored fraud labels as text, here (`"Y"`, `"N"`) are converted into a numeric format (`1` for `fraud`, `0` for `non-fraud`). This numeric encoding simplifies later statistical calculations and visualizations involving fraud.

In [None]:
# Map string labels 'Y'/'Yes' to 1 (fraud), and 'N'/'No' to 0 (non-fraud)
df['fraud_numeric'] = df['fraud_reported'].map({'Y': 1, 'N': 0, 'Yes': 1, 'No': 0})

# Show sample of the original and converted fraud columns to confirm mapping
display(Markdown("**Sample conversion:**"))
display(df[['fraud_reported', 'fraud_numeric']].head())

## 5. Numeric Feature Exploration

In this section, I analyze key numeric features to understand their:

- **Overall distribution** (using histograms and KDE)
- **Relationship with fraud** (via boxplots split by fraud class)
- **Presence of extreme values/outliers** (especially in claim amounts)

This helps identify skewed variables, potential transformation needs, and features that may differentiate fraud from non-fraud claims.

### Features Analyzed:
- `total_claim_amount`  
- `incident_hour_of_the_day`  
- `risk_score` *(engineered during ETL)*

Key components:
- **Histogram & Boxplot per feature**
- **Descriptive statistics**
- **Top 5 highest claim amounts**
- **Invalid hour check**

In [None]:
# Numeric Feature Exploration (Distribution + Fraud Comparison + Outliers)
numeric_features = ['total_claim_amount', 'incident_hour_of_the_day', 'risk_score']

# Check for invalid values in incident_hour_of_the_day
invalid_hours = df[(df['incident_hour_of_the_day'] < 0) | (df['incident_hour_of_the_day'] > 23)]
display(Markdown(f"**Invalid 'incident_hour_of_the_day' entries:** {len(invalid_hours)}"))

# Summary Statistics
display(Markdown("### Descriptive Statistics"))
display(df[numeric_features].describe().T)

# Combined Distribution (Histogram + Boxplot)
for feature in numeric_features:
    fig, axs = plt.subplots(1, 2, figsize=(12, 4))

    # Histogram
    sns.histplot(df[feature], bins=30, kde=True, color='skyblue', ax=axs[0])
    axs[0].set_title(f'Distribution of {feature}')
    axs[0].set_xlabel(feature)
    axs[0].set_ylabel('Count')

    # Boxplot by Fraud
    sns.boxplot(x='fraud_numeric', y=feature, data=df, palette=['#66b3ff', '#ff6666'], ax=axs[1])
    axs[1].set_title(f'{feature} by Fraud Reported')
    axs[1].set_xlabel('Fraud Reported (0=No, 1=Yes)')
    axs[1].set_ylabel(feature)

    plt.suptitle(f'{feature}: Distribution & Fraud Comparison', fontsize=14)
    plt.tight_layout()
    plt.show()

# Outlier Detection: Top 5 largest total claims
display(Markdown("### Top 5 Largest 'total_claim_amount' Claims"))
top_5_claims = df['total_claim_amount'].sort_values(ascending=False).head()
display(top_5_claims)

## 6.Categorical Feature Exploration

This section investigates how key **categorical features** relate to fraud, using both visual and statistical analysis.

### Goals:
- Explore how categories (e.g., collision type) distribute across fraud and non-fraud cases
- Identify categories with **higher fraud rates**
- Visualize **fraud proportions** using stacked bar charts
- Examine whether **missing categorical values** signal higher fraud likelihood

### Features Analyzed:
- `incident_type`
- `collision_type`
- `police_report_available`

Key components:
- **Countplots split by fraud**
- **Fraud rate summary tables** (sorted by fraud likelihood)
- **Stacked bar charts** showing proportional fraud
- **Fraud rate comparison between missing vs. non-missing groups**

In [None]:
# Categorical Feature Exploration (Distribution + Fraud Rates + Missingness)
display(Markdown("##Categorical Feature Exploration"))

categorical_features = ['incident_type', 'collision_type', 'police_report_available']

# Countplots Split by Fraud
plt.figure(figsize=(18, 5))
for i, feature in enumerate(categorical_features, 1):
    plt.subplot(1, 3, i)
    sns.countplot(x=feature, hue='fraud_reported', data=df, palette=['#66b3ff', '#ff6666'])
    plt.title(f'{feature} by Fraud Reported')
    plt.xlabel(feature)
    plt.ylabel('Count')
    plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Fraud Rate Summary Tables
for feature in categorical_features:
    summary = df.groupby(feature).agg(
        count=(feature, 'size'),
        fraud_rate=('fraud_numeric', 'mean')
    ).sort_values('fraud_rate', ascending=False)

    display(Markdown(f"###Fraud Rate Summary for `{feature}`"))
    display(summary)

# Stacked Bar Charts (Fraud Proportions)
for feature in categorical_features:
    crosstab = pd.crosstab(df[feature], df['fraud_reported'], normalize='index')
    crosstab.plot(kind='bar', stacked=True, figsize=(8, 4), colormap='coolwarm')
    plt.title(f'Fraud Proportions by {feature}')
    plt.xlabel(feature)
    plt.ylabel('Proportion of Claims')
    plt.legend(title='Fraud Reported')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Missingness Impact on Fraud Rate
display(Markdown("###Impact of Missing Values on Fraud Rate"))

# Create flags for missing values
df['collision_type_missing'] = df['collision_type'].isna()
df['police_report_missing'] = df['police_report_available'].isna()

# Fraud rate by missingness
missing_features = ['collision_type_missing', 'police_report_missing']
for missing_col in missing_features:
    summary = df.groupby(missing_col)['fraud_numeric'].mean().to_frame('fraud_rate')
    label = missing_col.replace('_missing', '')
    display(Markdown(f"#### Fraud Rate: `{label}` Missing vs Not Missing"))
    display(summary)

## 7. Time Based Trends

If `incident_date` data exist, this section will analyze fraud trends over time. This can reveal patterns, seasonality, or sudden spikes in fraud occurrences, useful for fraud detection and resource planning.

In [None]:
if 'incident_date' in df.columns:
    # Convert incident_date to datetime (handle errors by coercing)
    df['incident_date'] = pd.to_datetime(df['incident_date'], errors='coerce')
    
    # Drop rows where incident_date could not be parsed
    time_df = df.dropna(subset=['incident_date'])
    
    # Group by date and fraud status to count claims per day
    fraud_over_time = time_df.groupby(['incident_date', 'fraud_reported']).size().unstack(fill_value=0)
    
    # Plot fraud and non-fraud claims over time
    plt.figure(figsize=(12, 6))
    fraud_over_time.plot(ax=plt.gca())
    plt.title('Fraud vs Non-Fraud Claims Over Time')
    plt.xlabel('Incident Date')
    plt.ylabel('Number of Claims')
    plt.show()
else:
    display(Markdown("**No 'incident_date' column available to analyze time trends.**"))

# EDA Summary

- Checked Data Quality and Missing Values  
- Explored Fraud Class Imbalance  
- Analyzed Numeric and Categorical Feature Distributions  
- Examined Feature Relationships with Fraud  
- Investigated Missing Data Impact  
- Visualized Time Trends and Outliers

## Next Steps
- Feature Engineering and Selection  
- Train/Test Split and Model Development  
- Model Evaluation and Tuning

---

## Transition to Modeling

The fraud prediction pipeline continues in [`model_training.ipynb`](./model_training.ipynb), where we:
- Engineer new fraud-predictive features  
- Split data into train/test sets  
- Train baseline and advanced models  
- Evaluate fraud detection performance