# **Medical Insurance Charges Study**

## Objectives

* Answer business requirements 1:
  * The client wants to understand the factors that influence medical insurance charges, so they can learn the most relevant factors and how they affect the charges.

## Inputs

* output/datasets/colletion/medical_insurance_charges.csv

## Outputs

* EDA Report: Generate reports to summarize data distributions, correlations, and initial findings.
* Visualizations that illustrate the relationships between the variables and the medical insurance charges, such as scatter plots, box plots, and histograms. 

## Additional Comments

* This is the first step in the analysis, focusing on exploratory data analysis (EDA) to understand the dataset and identify key factors influencing medical insurance charges.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Dataset

In [None]:
import pandas as pd
df = pd.read_csv('outputs/datasets/collection/insurance.csv')
df.head(10)

---

# Understand the dataset

In [None]:
# importing libraries
import matplotlib.pyplot as plt
import seaborn as sns

Let's see how charges are distributed according to the different variables in the dataset.

In [None]:
sns.set_theme(style='whitegrid')
f, ax = plt.subplots(1,1, figsize=(12, 8))
ax = sns.histplot(df['charges'], kde = True, color = 'c', kde_kws=dict(cut=3))
plt.title('Distribution of Charges')

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='age', y='charges', alpha=0.5)
plt.title('Distribution of Charges by age')
plt.xlabel('Age')
plt.ylabel('Charges')


In [None]:
charges = df['charges'].groupby(df['sex']).mean()
f, ax = plt.subplots(1,1, figsize=(8, 6))
ax = sns.barplot(x=charges.index, y=charges.values)
plt.title('Average Charges by Sex')

df['sex'].value_counts()

In [None]:
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='bmi', y='charges', alpha=0.5)
plt.title('Distribution of Charges by BMI')
plt.xlabel('BMI')
plt.ylabel('Charges')

In [None]:
charges = df['charges'].groupby(df['children']).mean()
f, ax = plt.subplots(1,1, figsize=(8, 6))
ax = sns.barplot(x=charges.index, y=charges.values)
plt.title('Average Charges by Children')

df['children'].value_counts()

In [None]:
charges = df['charges'].groupby(df['region']).mean()
f, ax = plt.subplots(1,1, figsize=(8, 6))
ax = sns.barplot(x=charges.index, y=charges.values)
plt.title('Average Charges by Region')

df['region'].value_counts()

In [None]:
charges = df['charges'].groupby(df['smoker']).mean()
f, ax = plt.subplots(1,1, figsize=(8, 6))
ax = sns.barplot(x=charges.index, y=charges.values)
plt.title('Average Charges by Smoker')

df['smoker'].value_counts()

---

# Correlation Analysis

Correlation measures the strength and direction of a linear relationship between two numerical variables.
* Values range from -1 to 1.
    * +1: Strong positive correlation (e.g., as age increases, charges increase).
    * -1: Strong negative correlation.
    * 0: No correlation.

We’ll use Pearson correlation



In [None]:
# Convert categorical variables to numeric to compute correlation
df_encoded = df.copy()
df_encoded['sex'] = df_encoded['sex'].map({'male': 1, 'female': 0})
df_encoded['smoker'] = df_encoded['smoker'].map({'yes': 1, 'no': 0})
df_encoded['region'] = df_encoded['region'].astype('category').cat.codes

# Correlation matrix
correlation = df_encoded.corr()

print("Correlation with Charges:")
print(correlation['charges'].sort_values(ascending=False))

# Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

---

# Conclusions and Next Steps
## Conclusions

* Smoker status is the strongest driver of high insurance charges
    *  Both correlation analysis and visualizations (boxplots, scatterplots) clearly show that smokers have significantly higher charges.
    * Pearson correlation between smoker and charges is very high (~0.79), confirming this is a critical risk factor.
* Age positively influences insurance costs
    * Correlation of ~0.3 indicates a moderate positive relationship.
    * Boxplots show that older age groups tend to have higher charges.
* BMI influences charges
    * Correlation of ~0.2 suggests a weak positive relationship.
    * Boxplots show that higher BMI groups tend to have higher charges, but the effect is less pronounced than age or smoking status
* Number of children, region, and sex have little to no impact on charges
    * These features showed weak or near-zero correlation with charges.
    * Their visual distributions (boxplots) confirmed a minimal effect, suggesting they might not add much predictive value to a model.
  
## Next Steps

* Data Cleaning
* Feature Engineering
* Data Preprocessing
* Model Development
* Model Evaluation

---