# **(HACKATHON 1 HEALTH INSURANCE)**

## Objectives

*  "Fetch data from Kaggle ( https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance ) and save as raw data" 
*  "Clean and transform data" 
*  "Engineer features for modelling and visualisation"

## Inputs

* This dataset contains information on the relationship between personal attributes (age, gender, BMI, family size, smoking habits), geographic factors, and their impact on medical insurance charges.
* Raw data: csv.file /1 File/Columns:8, Rows:1339 headers and index included

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 



---

#                                                               DATA CLEANING

In [33]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [34]:
df = pd.read_csv(r'C:\\Users\lilia\Documents\LP\Data AI\Hackathons\H1_HealtInsurance\H1_HealthInsurance\data\inputs\raw\insurance.csv.csv')

#Check data type: object, float, etc & View a summary of the data frame 

In [35]:

df.info()  
df.describe() 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


#Sample of first few rows

In [36]:
df.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216
6,46,female,33.44,1,no,southeast,8240.5896
7,37,female,27.74,3,no,northwest,7281.5056
8,37,male,29.83,2,no,northeast,6406.4107
9,60,female,25.84,0,no,northwest,28923.13692


#Count of missing values per column

In [37]:
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

#Round values for bmi and charges to 2 decimal numbers

In [38]:
df['bmi'] = df['bmi'].round(2)
df['charges'] = df['charges'].round(2)

In [39]:
df.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
8,37,male,29.83,2,no,northeast,6406.41
9,60,female,25.84,0,no,northwest,28923.14


# Save Cleaned Data

In [40]:
df.to_csv('cleaned_healthcare_insurance.csv', index=False)


In [12]:
import pandas as pd
import numpy as np

In [13]:
df_cleaned = pd.read_csv('cleaned_healthcare_insurance.csv')


In [14]:
df_cleaned.head(10)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.77,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.88,0,no,northwest,3866.86
5,31,female,25.74,0,no,southeast,3756.62
6,46,female,33.44,1,no,southeast,8240.59
7,37,female,27.74,3,no,northwest,7281.51
8,37,male,29.83,2,no,northeast,6406.41
9,60,female,25.84,0,no,northwest,28923.14


# DATA VISUALISATION

In [26]:
import plotly.express as px
import plotly.graph_objects as go



#Descriptive visualisation

In [17]:

avg_charges_by_age = df_cleaned.groupby('age', as_index=False)['charges'].mean()
fig = px.line(avg_charges_by_age, x='age', y='charges',
              title='Average Insurance Charges by Age',
              labels={'charges': 'Avg Charges ($)', 'age': 'Age'})
fig.show()


In [20]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Load dataset (adjust path as needed)
df_cleaned = pd.read_csv('cleaned_healthcare_insurance.csv')

# Precompute averages
avg_charges_by_age = df_cleaned.groupby('age', as_index=False)['charges'].mean()
avg_charges_by_gender = df_cleaned.groupby('sex', as_index=False)['charges'].mean()
avg_charges_by_region = df_cleaned.groupby('region', as_index=False)['charges'].mean()

# Create subplot figure with 1 row and 3 columns
fig = make_subplots(rows=1, cols=3, subplot_titles=[
    "Avg Charges by Age",
    "Avg Charges by Gender",
    "Avg Charges by Region"
])

# Add line chart (age)
fig.add_trace(
    go.Scatter(x=avg_charges_by_age['age'], y=avg_charges_by_age['charges'],
               mode='lines+markers', name='Age'),
    row=1, col=1
)

# Add bar chart (gender)
fig.add_trace(
    go.Bar(x=avg_charges_by_gender['sex'], y=avg_charges_by_gender['charges'], name='Gender'),
    row=1, col=2
)

# Add bar chart (region)
fig.add_trace(
    go.Bar(x=avg_charges_by_region['region'], y=avg_charges_by_region['charges'], name='Region'),
    row=1, col=3
)

# Update layout
fig.update_layout(height=400, width=1200, title_text="Descriptive Statistics: Insurance Charges")
fig.update_yaxes(title_text="Avg Charges (£)", row=1, col=1)
fig.update_yaxes(title_text="Avg Charges (£)", row=1, col=2)
fig.update_yaxes(title_text="Avg Charges (£)", row=1, col=3)
fig.show()


#Correlation Analysis

In [25]:
corr_matrix = df_cleaned.corr(numeric_only=True)
fig = px.imshow(corr_matrix, text_auto=True,
                title='Correlation Heatmap of Features (Including Charges)',
                color_continuous_scale='RdBu', zmin=-1, zmax=1)
fig.show()


In [37]:
# correlation with charges

import pandas as pd
import plotly.express as px

# Load your dataset
df_cleaned = pd.read_csv('cleaned_healthcare_insurance.csv')

# Calculate correlation matrix
corr_matrix = df_cleaned.corr(numeric_only=True).round(2)

# Optional: Sort by correlation with 'charges'
corr_sorted = corr_matrix[['charges']].sort_values(by='charges', ascending=False)

# Display only top correlations with charges
print("Correlations with charges:\n", corr_sorted)

# Full heatmap
fig = px.imshow(
    corr_matrix,
    text_auto=True,
    color_continuous_scale='RdBu_r',
    zmin=-1, zmax=1,
    title='Correlation Matrix (Including Insurance Charges)'
)

fig.update_layout(
    autosize=False,
    width=700,
    height=600,
    margin=dict(l=40, r=40, t=50, b=40)
)

fig.show()

Correlations with charges:
           charges
charges      1.00
age          0.30
bmi          0.20
children     0.07


#Grouped Predictive Analysis Chart

In [34]:
%pip install statsmodels

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import statsmodels.api as sm


# Load dataset
df_cleaned = pd.read_csv('cleaned_healthcare_insurance.csv')

# Create subplots: 1 row, 3 columns
fig = make_subplots(rows=1, cols=3, subplot_titles=[
    'Charges vs Age (Smoker)',
    'Charges vs BMI (Gender)',
    'Charges vs Children'
])

# --- Charges vs Age (with smoker) ---
scatter1 = px.scatter(df_cleaned, x='age', y='charges', color='smoker', trendline='ols')
for trace in scatter1.data:
    fig.add_trace(trace, row=1, col=1)

# --- Charges vs BMI (with gender) ---
scatter2 = px.scatter(df_cleaned, x='bmi', y='charges', color='sex', trendline='ols')
for trace in scatter2.data:
    fig.add_trace(trace, row=1, col=2)

# --- Charges vs Number of Children ---
scatter3 = px.scatter(df_cleaned, x='children', y='charges', color='smoker', trendline='ols')
for trace in scatter3.data:
    fig.add_trace(trace, row=1, col=3)

# Layout
fig.update_layout(height=500, width=1300, title_text="Predictive Analysis: Estimating Insurance Charges")
fig.update_yaxes(title_text="Charges ($)", row=1, col=1)
fig.update_xaxes(title_text="Age", row=1, col=1)
fig.update_xaxes(title_text="BMI", row=1, col=2);
fig.update_xaxes(title_text="Children", row=1, col=3);
fig.show()


Note: you may need to restart the kernel to use updated packages.


# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [38]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\lilia\\Documents\\LP\\Data AI\\Hackathons\\H1_HealtInsurance\\H1_HealthInsurance\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [39]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [40]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\lilia\\Documents\\LP\\Data AI\\Hackathons\\H1_HealtInsurance\\H1_HealthInsurance'

# Section 1

Section 1 content

---

# Section 2

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [41]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)


IndentationError: expected an indented block after 'try' statement on line 2 (553063055.py, line 5)