# üè• **Insurance Premium Prediction - Exploratory Data Analysis (EDA)**


<h2 style="font-family: 'poppins'; font-weight: bold;">üë®‚ÄçüíªAuthor: Muhammad Hassan Saboor</h2>

[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/MuhammadHassanSaboor) 
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/mhassansaboor) 
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](https://www.linkedin.com/in/muhammad-hassan-saboor/)  
[![Facebook](https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook)](https://www.facebook.com/profile.php?id=61555194218257) 
[![Twitter/X](https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter)](https://twitter.com/MUHAMMA84929767) 
[![Instagram](https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram)](https://www.instagram.com/m_hassan_saboor/) 

## üóÇÔ∏è **Metadata**

### Dataset:
- **Dataset Name**: `train..com)
- **File Type**: CSV

### üìä **Columns:**
1. **id**: Unique identifier for each customer
2. **Age**: Age of the customer (in years)  
3. **Gender**: Gender of the customer (Male, Female)  
4. **Annual Income**: Annual income of the customer (in USD)  
5. **Marital Status**: Marital status of the customer (Married, Single)  
6. **Number of Dependents**: Number of dependents the customer has  
7. **Education Level**: Education level of the customer (e.g., High School, Bachelor‚Äôs, Master‚Äôs)  
8. **Occupation**: Occupation of the customer  
9. **Health Score**: A score representing the health condition of the customer  
10. **Location**: Customer's location (e.g., Region, City)  
11. **Policy Type**: Type of insurance policy (e.g., Health, Life, Vehicle)  
12. **Previous Claims**: The number of claims made by the customer in the past  
13. **Vehicle Age**: Age of the vehicle being insured (in years)  
14. **Credit Score**: Customer's credit score (numeric)  
15. **Insurance Duration**: Duration of the insurance policy (in years)  
16. **Policy Start Date**: The date when the insurance policy started  
17. **Customer Feedback**: Feedback provided by the customer (e.g., Positive, Negative)  
18. **Smoking Status**: Whether the customer smokes or not (Yes, No)  
19. **Exercise Frequency**: How often the customer exercises (e.g., Regularly, Occasionally, Never)  
20. **Property Type**: Type of property the customer owns (e.g., House, Apartment)  
21. **Premium Amount**: The premium paid by the customer (in USD)

### üéØ **Goals:**
1. **Understand the distribution** of the target variable (`Premium Amount`).
2. **Examine relationships** between different features (numerical and categorical).
3. **Identify data quality issues**, such as missing values, outliers, and duplicates.
4. **Visualize trends** and feature interactions that influence the premium amount.

### üîç **Data Quality Checks:**
- **Missing Values**: Identification of missing values and their impact on analysis.
- **Outliers**: Detection of outliers in key numerical features like `Annual Income`, `Health Score`, and `Credit Score`.
- **Duplicates**: Checking for duplicate entries in the dataset.

### üìö **Libraries Used:**
- **pandas**: Data manipulation and analysis
- **plotly**: Interactive viorn**: Statistical data visualization (optional)

### üìä **Analysis Sections:**
1. **Missing Values Analysis**: Investigate the percentage and patterns of missing data.
2. **Target Variable Analysis**: Analyze the distribution and segments of the `Premium Amount`.
3. **Feature Relationships**: Explore correlations and int
4. **Categorical Feature Analysis**: Analyzing the categorical features in the dataset.e5act5ons between numerical and categorical features.
4. **Outlier Detection**: Identify and analyze t6e i6ects the premium amount, segmented by categories.
6. **Policy Type and Insurance Duration**: Analyze ho8 po7icy types and insurance durations impact premiums.
7. **Location-Based Analysi9**:8Investigate regional variations in premium amounts.
8. **Health and Lifestyle Features**: Analyze the effect of health, s10oki19g status, and exercise frequency on premium amounts.
9. **Dependents and Family Dyna11css*a0ms**: Analyze how past claims affect premium amounts.
12. **Time-Based Trends**: Exam14e and
---


# Important Note

I have made a sample of original Dataset because of large size of the dataset. includes relevant icons to make it visually appealing.


# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# import matplotlib.pyplot as plt
# import seaborn as sns

# Loading the Dataset

In [None]:
df_train = pd.read_csv("/kaggle/input/playground-series-s4e12/train.csv")

In [None]:
df_sample = df_train.sample(120000)

# Dataset Overview

In [None]:
df_train.info()

In [None]:
df_sample.info()

# Exploratory Data Analysis

## 1. Missing Values Analysis

In [None]:
missing_values = df_train.isnull().mean() * 100
missing_values_df = missing_values.reset_index()
missing_values_df.columns = ["Feature", "Missing_Percentage"]

fig1 = px.bar(
    missing_values_df,
    x="Feature",
    y="Missing_Percentage",
    text="Missing_Percentage",
    title="Missing Values Percentage in Each Feature",
    template="plotly_dark",
)
fig1.update_traces(marker_color="cyan", 
                   texttemplate="%{text:.2f}%", 
                   textposition="outside")
fig1.show()

## 2. Target Variable: Premium Amount

In [None]:
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        "Distribution of Premium Amount",
        "Premium Amount by Policy Type",
        "Premium Amount by Smoking Status"
    ),
    vertical_spacing=0.2  
)

fig.add_trace(
    go.Histogram(
        x=df_sample["Premium Amount"],
        nbinsx=50,
        marker=dict(color="cyan"),
        name="Premium Amount"
    ),
    row=1, col=1
)

fig.add_trace(
    go.Box(
        x=df_sample["Policy Type"],
        y=df_sample["Premium Amount"],
        marker=dict(color="cyan"),
        name="Policy Type"
    ),
    row=1, col=2
)

fig.add_trace(
    go.Box(
        x=df_sample["Smoking Status"],
        y=df_sample["Premium Amount"],
        marker=dict(color="orange"),
        name="Smoking Status"
    ),
    row=2, col=1
)

fig.update_layout(
    title_text="Premium Amount Analysis",
    template="plotly_dark",
    height=800, width=1000
)

fig.update_xaxes(title_text="Premium Amount", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=1)

fig.update_xaxes(title_text="Policy Type", row=1, col=2)
fig.update_yaxes(title_text="Premium Amount", row=1, col=2)

fig.update_xaxes(title_text="Smoking Status", row=2, col=1)
fig.update_yaxes(title_text="Premium Amount", row=2, col=1)

fig.show()

## 3. Feature Relationships

In [None]:
numerical_features = [
    "Age",
    "Annual Income",
    "Health Score",
    "Number of Dependents",
    "Previous Claims",
    "Vehicle Age",
    "Credit Score",
    "Insurance Duration",
    "Premium Amount",
]

In [None]:
correlation_matrix = df_sample[numerical_features].corr()

heatmap = go.Heatmap(
    z=correlation_matrix.values,
    x=correlation_matrix.columns,
    y=correlation_matrix.columns,
    colorscale="viridis",
    colorbar=dict(title="Correlation"),
    text=correlation_matrix.round(2),  
    texttemplate="%{text}",  
)

fig1 = go.Figure(heatmap)

fig1.update_layout(
    title_text="Correlation Heatmap",
    template="plotly_dark",
    showlegend=False,
)

fig1.update_xaxes(title_text="Features")
fig1.update_yaxes(title_text="Features")

fig1.show()

In [None]:
box_plot = go.Box(
    x=df_sample["Education Level"],
    y=df_sample["Premium Amount"],
    marker=dict(color="cyan"),
    name="Education Level",
    boxmean=True,
)

fig2 = go.Figure(box_plot)

fig2.update_layout(
    title_text="Premium Amount by Education Level",
    template="plotly_dark",
    showlegend=True,
)

fig2.update_xaxes(title_text="Education Level")
fig2.update_yaxes(title_text="Premium Amount (USD)")

fig2.show()

## 4. Categorical Feature Analysis

In [None]:
fig = make_subplots(
    rows=2, cols=2,  
    subplot_titles=[
        "Distribution of Gender",
        "Distribution of Marital Status",
        "Impact of Smoking Status on Premium Amount",
        "Impact of Education Level on Premium Amount",
    ],
    vertical_spacing=0.2,  
)

fig.add_trace(
    go.Histogram(
        x=df_sample["Gender"],
        marker=dict(color="cyan"),
        name="Gender",
    ),
    row=1, col=1,
)

fig.add_trace(
    go.Histogram(
        x=df_sample["Marital Status"],
        marker=dict(color="orange"),
        name="Marital Status",
    ),
    row=1, col=2,
)

smoking_impact = df_sample.groupby("Smoking Status")["Premium Amount"].mean().reset_index()
fig.add_trace(
    go.Bar(
        x=smoking_impact["Smoking Status"],
        y=smoking_impact["Premium Amount"],
        marker=dict(color="purple"),
        name="Smoking Status",
    ),
    row=2, col=1,
)

education_impact = df_sample.groupby("Education Level")["Premium Amount"].mean().reset_index()
fig.add_trace(
    go.Bar(
        x=education_impact["Education Level"],
        y=education_impact["Premium Amount"],
        marker=dict(color="green"),
        name="Education Level",
    ),
    row=2, col=2,
)

fig.update_layout(
    title_text="Combined Analysis: Gender, Marital Status, Smoking, and Education Impact",
    template="plotly_dark",
    height=800,  
    showlegend=False,  
)

fig.update_xaxes(title_text="Gender", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)

fig.update_xaxes(title_text="Marital Status", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.update_xaxes(title_text="Smoking Status", row=2, col=1)
fig.update_yaxes(title_text="Average Premium Amount", row=2, col=1)

fig.update_xaxes(title_text="Education Level", row=2, col=2)
fig.update_yaxes(title_text="Average Premium Amount", row=2, col=2)

fig.show()

## 5. Outlier Detection

In [None]:
fig = make_subplots(
    rows=1, cols=3,  
    subplot_titles=[
        "Outliers in Annual Income",
        "Outliers in Health Score",
        "Outliers in Credit Score",
    ],
    horizontal_spacing=0.1,  
)

fig.add_trace(
    go.Box(
        y=df_sample["Annual Income"],
        marker=dict(color="cyan"),
        name="Annual Income",
    ),
    row=1, col=1,
)

fig.add_trace(
    go.Box(
        y=df_sample["Health Score"],
        marker=dict(color="orange"),
        name="Health Score",
    ),
    row=1, col=2,
)

fig.add_trace(
    go.Box(
        y=df_sample["Credit Score"],
        marker=dict(color="purple"),
        name="Credit Score",
    ),
    row=1, col=3,
)

fig.update_layout(
    title_text="Outlier Analysis: Annual Income, Health Score, and Credit Score",
    template="plotly_dark",
    height=500,
    showlegend=False,  
)

fig.update_yaxes(title_text="Annual Income", row=1, col=1)
fig.update_yaxes(title_text="Health Score", row=1, col=2)
fig.update_yaxes(title_text="Credit Score", row=1, col=3)
fig.show()

## 6. Policy Type and Insurance Duration

In [None]:
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=[
        "Average Premium Amount by Policy Type",
        "Insurance Duration vs Premium Amount",
        "Insurance Duration Distribution by Policy Type",
        None,
    ],
    horizontal_spacing=0.1,  
    vertical_spacing=0.15,  
)

policy_avg = df_sample.groupby("Policy Type")["Premium Amount"].mean().reset_index()
fig.add_trace(
    go.Bar(
        x=policy_avg["Policy Type"],
        y=policy_avg["Premium Amount"],
        marker=dict(color="cyan"),
        name="Policy Type",
    ),
    row=1, col=1,
)

fig.add_trace(
    go.Scatter(
        x=df_sample["Insurance Duration"],
        y=df_sample["Premium Amount"],
        mode="markers",
        marker=dict(color="purple"),
        name="Insurance Duration",
    ),
    row=1, col=2,
)

fig.add_trace(
    go.Box(
        x=df_sample["Policy Type"],
        y=df_sample["Insurance Duration"],
        marker=dict(color="green"),
        name="Policy Type",
    ),
    row=2, col=1,
)

fig.update_layout(
    title_text="Policy and Insurance Analysis",
    template="plotly_dark",
    height=800,  
    showlegend=False,  
)

fig.update_xaxes(title_text="Policy Type", row=1, col=1)
fig.update_yaxes(title_text="Average Premium Amount", row=1, col=1)

fig.update_xaxes(title_text="Insurance Duration (years)", row=1, col=2)
fig.update_yaxes(title_text="Premium Amount", row=1, col=2)

fig.update_xaxes(title_text="Policy Type", row=2, col=1)
fig.update_yaxes(title_text="Insurance Duration", row=2, col=1)

fig.show()

## 7. Location-Based Analysis

In [None]:
location_avg = df_sample.groupby("Location")["Premium Amount"].mean().reset_index()

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=[
        "Average Premium Amount by Location",
        "Premium Amount Distribution by Location",
    ],
    horizontal_spacing=0.2,  
)

fig.add_trace(
    go.Bar(
        x=location_avg["Location"],
        y=location_avg["Premium Amount"],
        marker=dict(color="blue"),
        name="Average Premium",
    ),
    row=1, col=1,
)

fig.add_trace(
    go.Box(
        x=df_sample["Location"],
        y=df_sample["Premium Amount"],
        marker=dict(color="red"),
        name="Premium Distribution",
    ),
    row=1, col=2,
)

fig.update_layout(
    title_text="Premium Amount Analysis by Location",
    template="plotly_dark",
    height=500,  
    showlegend=False,  
)

fig.update_xaxes(title_text="Location", tickangle=-45, row=1, col=1)
fig.update_yaxes(title_text="Average Premium Amount", row=1, col=1)

fig.update_xaxes(title_text="Location", tickangle=-45, row=1, col=2)
fig.update_yaxes(title_text="Premium Amount", row=1, col=2)
fig.show()

## 8. Health and Lifestyle Features

In [None]:
fig2 = px.box(
    df_sample,
    x="Exercise Frequency",
    y="Premium Amount",
    title="Premium Amount by Exercise Frequency",
    template="plotly_dark",
    color_discrete_sequence=["purple"],
)
fig2.update_layout(xaxis_title="Exercise Frequency", yaxis_title="Premium Amount")
fig2.show()

## 9. Dependents and Family Dynamics

In [None]:
fig1 = go.Figure()

fig1.add_trace(
    go.Scatter(
        x=df_sample["Number of Dependents"],
        y=df_sample["Premium Amount"],
        mode="markers",
        marker=dict(
            color=df_sample["Annual Income"],
            colorscale="Viridis",
            colorbar=dict(title="Annual Income (USD)"),
        ),
        name="Dependents vs Premium",
    )
)

fig1.update_layout(
    title_text="Number of Dependents vs Premium Amount by Income Level",
    template="plotly_dark",
)

fig1.update_xaxes(title_text="Number of Dependents")
fig1.update_yaxes(title_text="Premium Amount (USD)")

fig1.show()


In [None]:
fig2 = go.Figure()

fig2.add_trace(
    go.Box(
        x=df_sample["Number of Dependents"],
        y=df_sample["Premium Amount"],
        marker=dict(color="teal"),
        name="Premium Distribution",
    )
)

fig2.update_layout(
    title_text="Premium Amount Distribution by Number of Dependents",
    template="plotly_dark",
)

fig2.update_xaxes(title_text="Number of Dependents")
fig2.update_yaxes(title_text="Premium Amount (USD)")

fig2.show()


In [None]:
fig3 = go.Figure()

heatmap_data = go.Heatmap(
    x=df_sample["Annual Income"],
    y=df_sample["Number of Dependents"],
    z=df_sample["Premium Amount"],
    colorscale="Inferno",
    colorbar=dict(title="Premium Amount (USD)"),
    name="Heatmap",
)

fig3.add_trace(heatmap_data)

fig3.update_layout(
    title_text="Heatmap: Income, Dependents & Premium Amount",
    template="plotly_dark",
)

fig3.update_xaxes(title_text="Annual Income (USD)")
fig3.update_yaxes(title_text="Number of Dependents")

fig3.show()


## 10. Time-Based Trends

In [None]:
df_sample['Policy Start Date'] = pd.to_datetime(df_sample['Policy Start Date'], errors='coerce')

df_sample['Year'] = df_sample['Policy Start Date'].dt.year
df_sample['Month'] = df_sample['Policy Start Date'].dt.month
df_sample['Weekday'] = df_sample['Policy Start Date'].dt.weekday 

In [None]:
fig = make_subplots(
    rows=1, cols=3,
    subplot_titles=("Premium Amount by Year", "Premium Amount by Month", "Premium Amount by Weekday"),
    column_widths=[0.33, 0.33, 0.33]
)


fig.add_trace(
    go.Box(
        x=df_sample["Year"],
        y=df_sample["Premium Amount"],
        name="By Year",
        marker=dict(color="blue"),
        boxmean=True  
    ),
    row=1, col=1
)

fig.add_trace(
    go.Box(
        x=df_sample["Month"],
        y=df_sample["Premium Amount"],
        name="By Month",
        marker=dict(color="red"),
        boxmean=True
    ),
    row=1, col=2
)

fig.add_trace(
    go.Box(
        x=df_sample["Weekday"],
        y=df_sample["Premium Amount"],
        name="By Weekday",
        marker=dict(color="green"),
        boxmean=True
    ),
    row=1, col=3
)

fig.update_layout(
    title_text="Premium Amount Distribution by Year, Month, and Weekday",
    template="plotly_dark",
    showlegend=False,
    height=500, width=1200
)

fig.update_xaxes(title_text="Policy Start Year", row=1, col=1)
fig.update_xaxes(title_text="Policy Start Month", row=1, col=2)
fig.update_xaxes(title_text="Policy Start Weekday", row=1, col=3)

fig.update_yaxes(title_text="Premium Amount (USD)", row=1, col=1)
fig.update_yaxes(title_text="Premium Amount (USD)", row=1, col=2)
fig.update_yaxes(title_text="Premium Amount (USD)", row=1, col=3)

fig.show()

# üéâ **Conclusion**
The purpose of this  noteboos is to help community to work in the competition 

**
Playground Serie**s - Season 4, Episode 1Regression with an Insurance Dataset
et s or suggestions. 

Happy coding! üöÄ
