# Linear regression: health insurance cost

## Notebook setup

Handle imports of necessary modules up-front.

In [None]:
# Standard library imports
from pathlib import Path

# Third-party imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Custom functions for this notebook
import helper_functions as funcs

## 1. Data loading

### 1.1. Load

In [None]:
data_url = 'https://raw.githubusercontent.com/4GeeksAcademy/linear-regression-project-tutorial/main/medical_insurance_cost.csv'
data_df = pd.read_csv(data_url, sep=',')

### 1.2. Save local copy

In [None]:
# Make a directory for raw data
Path('../data/raw').mkdir(exist_ok=True, parents=True)

# Save a local copy of the raw data
data_df.to_parquet('../data/raw/medical-insurance-cost.parquet')

### 1.3. Inspect

In [None]:
data_df.head()

In [None]:
data_df.info()

The dataset contains information about 1,338 insurance policyholders with 7 features:
- **age**: Age of the policyholder (numerical)
- **sex**: Gender (categorical: male/female) 
- **bmi**: Body Mass Index (numerical)
- **children**: Number of children covered (numerical)
- **smoker**: Smoking status (categorical: yes/no)
- **region**: Geographic region (categorical: southeast, southwest, northeast, northwest)
- **charges**: Insurance charges - our target variable (numerical)

This is our target variable that we want to predict using the other features.

## 2. EDA

### 2.1. Data composition

#### 2.1.1. Interval features

In [None]:
# Take a look at some descriptive statistics for the numerical features - what do you see?

In [None]:
# Plot the distributions of the numerical features - what do you see?

#### 2.1.2. Nominal features

In [None]:
# Plot the level counts of the categorical features - what do you see?

### 2.2. Feature interactions

#### 2.2.1. Interval features vs label

In [None]:
# Plot the relationship between the numerical features and the target variable - what do you see?

#### 2.2.2. Nominal features vs label

In [None]:
# Plot the relationships between the categorical features and the target variable - what do you see?

## 3. Data preparation

### 3.1. Train-test split

In [None]:
# Split the data into 80% train and 20% test sets

We split the data into 80% training (1,070 samples) and 20% testing (268 samples) to evaluate model performance on unseen data. This helps prevent overfitting and gives us a realistic estimate of how the model will perform in practice.

### 3.2. Feature encoding

In [None]:
# Encode the categorical features using OneHotEncoder

We use one-hot encoding for categorical variables, which creates binary columns for each category. The `drop='first'` parameter prevents multicollinearity by dropping one category as a reference. 

Our dataset now has the following encoded features:
- Original: sex, smoker, region (3 categorical columns)  
- Encoded: sex_male, smoker_yes, region_northwest, region_southeast, region_southwest (5 binary columns)

This allows our linear model to properly handle categorical variables.

## 4. Model training

In [None]:
# Dictionary to store the results
results = {
    'RMSE': {},
    'R2': {}
}

### 4.1. Baseline

In [None]:
# Set a model baseline using a strategy that you devise - what is the 'easiest' 'prediction' you can make?

Any useful model should significantly outperform this baseline.

### 4.2. Linear regression model

In [None]:
# Train a default linear regression model on the training data

In [None]:
# Calculate the RMSE and R² for the model on the test data

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(8, 4))

# Predicted vs Actual plot
axs[0].scatter(labels, predictions, color='black', alpha=0.6, s=30)
axs[0].plot([min(labels), max(labels)], [min(labels), max(labels)], 
            'r--', linewidth=2, label='Perfect prediction')
axs[0].set_xlabel('Actual charges ($)')
axs[0].set_ylabel('Predicted charges ($)')
axs[0].set_title(f'Predicted vs Actual Charges')
axs[0].legend()
axs[0].grid(alpha=0.3)

# Residuals plot
residuals = predictions - labels
axs[1].scatter(predictions, residuals, color='black', alpha=0.6, s=30)
axs[1].axhline(y=0, color='black', linestyle='-', linewidth=1)
axs[1].set_xlabel('Predicted charges ($)')
axs[1].set_ylabel('Residuals ($)')
axs[1].set_title(f'Residuals vs Predicted')
axs[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

**Predicted vs Actual Plot:**
- Points should lie on the red diagonal line for perfect predictions

**Residuals Plot:**
- Residuals should be randomly scattered around zero

## 5. Optimization

### 5.1. Feature transformations

In [None]:
# Try to improve model performane by scaling the data with StandardScaler or MinMaxScaler, then re-training the model
# is there any improvement in the RMSE and R²? Why or why not?

### 5.2. Feature engineering

In [None]:
# Devise a additional strategy to improve the model performance - what features can you engineer or transform?
# Try it out and see if it improves the RMSE and R². Did it work? Why or why not?

## 6. Results

### 6.1. Model comparison

In [None]:
# Plot the RMSE and R² results for each model

### 6.2. Winning model evaluation

In [None]:
# Choose the best model based on RMSE and R² Evaluate it on the test data. Include plots of the predicted vs actual and residuals vs predicted