Insurance dataset found here: https://www.kaggle.com/datasets/alexisbcook/data-for-datavis?resource=download&select=insurance.csv

Illustrated the average charges based on each column.
I decided not to seperate based on smoker due to 80% being no. 

In [12]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

In [13]:
df = pd.read_csv('insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Average Insurance Charge based on sex

In [14]:
#Calculating average charge based on sex
avg_female_cost = int(df[df['sex'] == 'female']['charges'].mean())
avg_male_cost = int(df[df['sex'] == 'male']['charges'].mean())
#Creating X,Y 
sex = ['female','male']
charges = [avg_female_cost, avg_male_cost]

In [15]:
#Creating a dataframe for the graph
avg_cost_df = pd.DataFrame(
    {
        'sex':sex,
        'charges':charges
    }
)

avg_cost_df

Unnamed: 0,sex,charges
0,female,12569
1,male,13956


In [16]:
#Making the graph
px.bar(
    avg_cost_df,
    x = 'sex',
    y = 'charges',
    color = 'sex',
    title = 'Insurance Charges based on sex',
    color_discrete_map={
        'female':'pink',
        'male':'blue'
    },
    text = 'charges'  
    
)


The males look to have around a 1k increase in charges compared to women.

# Average Insurance Charges based on Region

In [25]:
#Creating the average cost and putting them into a dataframe
#As_Index = False keeps the result as a dataframe
region_df = df.groupby('region', as_index = False)['charges'].mean()
region_df.columns = ['region', 'avg_charge']
region_df

Unnamed: 0,region,avg_charge
0,northeast,13406.384516
1,northwest,12417.575374
2,southeast,14735.411438
3,southwest,12346.937377


In [26]:
#Make graph
fig = px.bar(
    region_df,
    x = 'avg_charge',
    y = 'region',
    color = 'region',
    orientation = 'h',
    title = 'Charges based on Region',
    color_discrete_map={
        'northeast' : 'blue',
        'northwest' : 'green',
        'southeast' : 'yellow',
        'southwest' : 'purple'
    },
    text = 'avg_charge',
        
)

fig.update_traces(
    textfont_color = 'black',
    textfont_family = 'Times New Roman',
    textfont_size = 16,
    textposition = 'outside',
    cliponaxis = False
)   
fig.show()

From this, we can see that the west region seems to have lower insurance charges compared to the east. 

# Average Insurance charge based on Children


In [27]:

#The average cost with there being 4 or more children is lower due to the sample size being much smaller
#We can check how many entries there are using value_counts
children_df = df.groupby('children', as_index = False)['charges'].mean()
children_df.columns = ['children', 'costs']
df['children'].value_counts()
#Because of this, we will drop the 4th and 5th rows.

0    574
1    324
2    240
3    157
4     25
5     18
Name: children, dtype: int64

In [28]:
children_df = children_df.drop([4,5])

In [29]:
fig = px.scatter(
    children_df,
    x = 'children',
    y = 'costs',
    title = 'Insurance Charges corresponding to number of children',
    color = 'costs',
).update_traces(mode='lines+markers')

fig.show()
   

Health Insurance increases as number of children increases

# Average Insurance based on ages

In [30]:
#We will split the ages with 4 different groups, min value is 18 and max is 64
def age_grouper(df):
    if df['age'] <= 25:
        return 'young adult'
    elif df['age'] > 25 and df['age'] <= 44:
        return 'adult'
    elif df['age'] > 44 and df['age'] <= 59:
        return 'mid_age'
    else:
        return 'later_age'
df['age_group'] = df.apply(age_grouper,axis = 1)

age_group_df = df.groupby('age_group', as_index = False)['charges'].mean()
age_group_df


Unnamed: 0,age_group,charges
0,adult,11818.902596
1,later_age,21248.021885
2,mid_age,15922.929285
3,young adult,9087.015807


In [31]:
fig = px.pie(
    age_group_df,
    values = 'charges',
    names = 'age_group',
)
fig.show()

In [32]:
bmi_df = df.groupby('bmi', as_index = False)['charges'].mean()
bmi_df.columns = ['bmi', 'charges']


In [33]:
fig = px.scatter(
    bmi_df,
    x = 'bmi',
    y = 'charges',
    title = 'Average Charges based on bmi',
    color = 'bmi',
)
fig.show()

BMI does not really play a role in the charges as much as the other categories do.
