Healthcare Insurance Cost Analysis

# Project Overview

## Objectives

- Load and clean the healthcare insurance dataset from Kaggle.  
- Visualise relationships between insurance charges and factors such as age, smoking status, gender, BMI, and geographic region.  
- Engineer features such as BMI categories and region flags for further analysis.  
- Prepare insights to support predictive modelling of insurance charges.

## Inputs

- `insurance.csv` dataset downloaded from Kaggle:  
  [https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance](https://www.kaggle.com/datasets/willianoliveiragibin/healthcare-insurance)

## Outputs

- Cleaned and transformed dataset saved as a DataFrame for analysis.  
- Visualisations illustrating key relationships between variables (scatter plots, boxplots).  
- Feature-engineered columns such as BMI categories and one-hot encoded regions.  
- Summary statistics and correlation matrices for exploratory data analysis.

## Additional Comments

- Visualisations use Plotly, Matplotlib, and Seaborn for interactivity and style variety.  
- Further modelling and prediction will be done in subsequent notebooks or scripts.  
- Ensure consistent data paths and environment setup before running the notebook.





---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [7]:
!python --version

Python 3.12.8


In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

Section 1 content

In [None]:
j# Project Overview

## Section 1 - Extracting Data

In this section, I:

- Imported all required Python libraries for data analysis and visualisation (pandas, NumPy, Matplotlib, Seaborn, Plotly).
- Loaded the dataset from the `data` folder.
- Displayed the first few rows of the dataset to understand its structure.
- Checked the column types and number of non-null values using `.info()`.
- Used `.describe()` to generate summary statistics for numerical variables.


---

# Section 2

Section 2 content

In [None]:
## Section 2 - Transforming Data

In this section, I:

- Performed data cleansing and transformation routines.
- Encoded categorical variables (sex, smoker, region) into numerical format.
- Handled missing values if applicable.
- Created additional features, such as BMI categories.
- Prepared the cleaned dataset for further analysis and visualisation.

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
