In [None]:
#Import Libraries
import pandas as pd
import plotly.express as px

# --------------------------------------------------
# Load Data
# --------------------------------------------------
df = pd.read_csv(r"E:\Labmentix Internship\Week 5\TOTAL POLICY DETAILS.csv")

df.columns = (
    df.columns.str.strip()
    .str.lower()
    .str.replace(' ', '_')
    .str.replace('.', '', regex=False)
)

In [None]:
# Checking the final consolidated dataframe
df.head(5)

Unnamed: 0,policy_no,children,smoker,region,age,sex,bmi,charges_in_inr,bmi_category,age_group
0,PLC156898,0,yes,southwest,19,female,27.9,16884.92383,Overweight,<25
1,PLC156907,1,no,southeast,18,male,33.77,1725.552246,Obese,<25
2,PLC156916,3,no,southeast,28,male,33.0,4449.461914,Obese,25–40
3,PLC156925,0,no,northwest,33,male,22.705,21984.4707,Normal,25–40
4,PLC156934,0,no,northwest,32,male,28.879999,3866.855225,Overweight,25–40


In [None]:
# --------------------------------------------------
# Q1. Gender impact – DONUT CHART
# --------------------------------------------------
gender_avg = df.groupby('sex')['charges_in_inr'].mean().reset_index()

px.pie(
    gender_avg,
    names='sex',
    values='charges_in_inr',
    hole=0.5,
    title='Average Insurance Cost Share by Gender'
).show()

The above pie chart shows that male policyholders account for a slightly higher share of average insurance costs (about 53%) compared to female policyholders (about 47%).Gender has a minor impact on insurance costs. While males contribute marginally more to overall charges, the difference is not significant enough to be a primary pricing driver on its own.

In [None]:
# --------------------------------------------------
# Q2. Average policy cost – KPI PRINT
# --------------------------------------------------
print(f"Average cost per policy cover: INR {df['charges_in_inr'].mean():,.2f}")

Average cost per policy cover: INR 13,270.42


On average, the company spends ₹13,270 on each insurance policy.

In [None]:
# --------------------------------------------------
# Q3. Geographic impact – BAR (kept for clarity)
# --------------------------------------------------
region_avg = df.groupby('region')['charges_in_inr'].mean().reset_index()

px.bar(
    region_avg,
    x='region',
    y='charges_in_inr',
    title='Average Insurance Cost by Region',
    text_auto='.2s'
).show()

The Southeast region has the highest average insurance cost (~₹15k).
Reason: This may indicate higher health risk exposure, higher claim frequency, or higher medical costs in that region.

The Northeast region follows with an average cost of ~₹13k.
Reason: Moderate claim behavior and healthcare cost levels.

The Northwest and Southwest regions have the lowest average costs (~₹12k each).
Reason: These regions may have healthier policyholders, fewer claims, or better cost control.

In [None]:
# --------------------------------------------------
# Q4. Dependents impact – LINE CHART
# --------------------------------------------------
children_avg = df.groupby('children')['charges_in_inr'].mean().reset_index()

px.line(
    children_avg,
    x='children',
    y='charges_in_inr',
    markers=True,
    title='Trend of Insurance Cost with Number of Dependents'
).show()

1. Rising costs up to 3 dependents

2. More dependents generally mean higher healthcare utilization (more coverage needs, higher probability of claims).

3. Families with 2–3 dependents often fall into prime earning-age groups, enabling them to afford broader or higher-value insurance plans, which raises average charges.

4. Decline after 3 dependents

5. Households with 4–5 dependents are fewer in the dataset, making averages more sensitive to variation.

6. Such families may opt for cost-controlled or basic plans due to budget constraints, reducing average charges.

7.There may be a demographic overlap with younger parents or non-smokers, which lowers risk-based premiums.

8.Sharp drop at 5 dependents

9. Likely driven by small sample size and outlier effects.

Indicates that number of dependents alone is not a strong independent driver of insurance cost.

In [None]:
# --------------------------------------------------
# Q5. BMI impact – DONUT CHART
# --------------------------------------------------
df['bmi_category'] = pd.cut(
    df['bmi'],
    [0, 18.5, 25, 30, 100],
    labels=['Underweight', 'Normal', 'Overweight', 'Obese']
)

bmi_avg = df.groupby('bmi_category')['charges_in_inr'].mean().reset_index()

px.pie(
    bmi_avg,
    names='bmi_category',
    values='charges_in_inr',
    hole=0.4,
    title='Insurance Cost Distribution by BMI Category'
).show()





Obese (34.1%) → Highest insurance cost
Reason: Higher health risks lead to more medical expenses.

Overweight (24.1%) → Second highest cost
Reason: Increased chances of health issues compared to normal BMI.

Normal (22.9%) → Moderate cost
Reason: Generally healthier, fewer medical claims.

Underweight (19%) → Lowest cost
Reason: Fewer costly health conditions in this data.

Overall Insight: Insurance costs increase as BMI increases, mainly due to higher health risks.

In [None]:
# --------------------------------------------------
# Q6. Smoker vs Non-Smoker – BAR
# --------------------------------------------------
smoker_avg = df.groupby('smoker')['charges_in_inr'].mean().reset_index()

px.bar(
    smoker_avg,
    x='smoker',
    y='charges_in_inr',
    title='Average Insurance Cost: Smoker vs Non-Smoker',
    text_auto='.2s'
).show()

Smokers (~₹32k) pay much higher insurance costs than non-smokers.

Non-smokers (~₹8.4k) have significantly lower costs.

Reason: Smoking increases health risks and medical claims, so insurers charge higher premiums.

Key Insight:

Smokers cost nearly 4 times more than non-smokers.

In [None]:
# --------------------------------------------------
# Q7. Age impact – LINE CHART (GROUPED)
# --------------------------------------------------
df['age_group'] = pd.cut(
    df['age'],
    [0, 25, 40, 55, 100],
    labels=['<25', '25–40', '40–55', '55+']
)

age_avg = df.groupby('age_group')['charges_in_inr'].mean().reset_index()

px.line(
    age_avg,
    x='age_group',
    y='charges_in_inr',
    markers=True,
    title='Insurance Cost Trend Across Age Groups'
).show()





Older people have higher health risks, more medical needs, and frequent treatments, so insurance charges rise with age.

Key Insight:

Age is a strong factor—insurance becomes more expensive as people get older.

In [None]:
# --------------------------------------------------
# Q8. Discount eligibility – STACKED BAR
# --------------------------------------------------
discount_view = df.groupby(
    ['bmi_category', 'smoker']
)['charges_in_inr'].mean().reset_index()

px.bar(
    discount_view,
    x='bmi_category',
    y='charges_in_inr',
    color='smoker',
    title='BMI-Based Cost Comparison with Smoking Status',
    text_auto='.2s'
).show()





Smokers have much higher insurance costs than non-smokers in every BMI group.

Insurance cost increases as BMI increases, especially for smokers.

By BMI category:

Underweight: Low cost for non-smokers, much higher for smokers

Normal & Overweight: Costs rise steadily, smokers always pay more

Obese: Highest cost, especially for smokers

Reason:
Smoking and high BMI both increase health risks.
When smoking + obesity combine, medical expenses rise sharply.

Key Insight: Smoking multiplies the cost impact of high BMI on insurance charges.