## Linear Regression Health Costs Calculator

In this challenge, you will predict healthcare costs using a regression algorithm.

You are given a dataset that contains information about different people including their healthcare costs. Use the data to predict healthcare costs based on new data.

The first two cells of this notebook import libraries and the data.

Make sure to convert categorical data to numbers. Use 80% of the data as the `train_dataset` and 20% of the data as the `test_dataset`.

`pop` off the "expenses" column from these datasets to create new datasets called `train_labels` and `test_labels`. Use these labels when training your model.

Create a model and train it with the `train_dataset`. Run the final cell in this notebook to check your model. The final cell will use the unseen `test_dataset` to check how well the model generalizes.

To pass the challenge, `model.evaluate` must return a Mean Absolute Error of under 3500. This means it predicts health care costs correctly within $3500.

The final cell will also predict expenses using the `test_dataset` and graph the results.

### Imports

In [None]:
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

### Load Dataset

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

### Data Evaluation and Exploring

In [None]:
dataset.isnull().sum()

In [None]:
dataset.dtypes

In [None]:
dataset.region.value_counts()

In [None]:
temp_dataset = dataset.join(pd.get_dummies(dataset.region, prefix='region')).drop('region', axis=1)
temp_dataset.head()

In [None]:
temp_dataset.

In [None]:
# Load the dataset
!wget -q "https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv"
df = pd.read_csv("insurance.csv")

# Handle categorical variables
# Region one-hot encoding
df = df.join(pd.get_dummies(df.region, prefix='region')).drop('region', axis=1)

# Convert 'sex' to numerical
df['sex'] = df['sex'].astype('category').cat.codes

# Convert 'smoker' to numerical
df['smoker'] = df['smoker'].astype('category').cat.codes

# Optional: Drop unused columns to match working example
df.drop(['region_northeast', 'region_northwest', 'region_southeast', 'region_southwest'], axis=1, inplace=True)
df.drop(['sex', 'children'], axis=1, inplace=True)

# Shuffle and split
df = df.sample(frac=1, random_state=42)
size = int(len(df) * 0.2)
train_dataset = df[:-size]
test_dataset = df[-size:]

# Extract labels
train_labels = train_dataset['expenses']
train_dataset = train_dataset.drop('expenses', axis=1)

test_labels = test_dataset['expenses']
test_dataset = test_dataset.drop('expenses', axis=1)

# Define model
model = tf.keras.models.Sequential([
    tf.keras.layers.Input(shape=(train_dataset.shape[1],)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

# Compile model
model.compile(
    optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.01),
    loss='mse',
    metrics=['mae', 'mse']
)

# Training callback
class EpochDots(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs):
    if epoch % 100 == 0:
      print()
      print('Epoch: {:d}, '.format(epoch), end='')
      for name, value in sorted(logs.items()):
        print('{}:{:0.4f}'.format(name, value), end=', ')
      print()
    print('.', end='')

# Train
model.fit(train_dataset, train_labels, epochs=1000, verbose=0, callbacks=[EpochDots()])


In [None]:
# Step 1: Download and load the dataset
!wget -q "https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv"
import pandas as pd
df = pd.read_csv("insurance.csv")

# Step 2: Handle categorical variables
# Convert 'region' into separate columns (one-hot encoding)
df = df.join(pd.get_dummies(df['region'], prefix='region')).drop('region', axis=1)

# Convert 'sex' and 'smoker' from text to numeric codes
df['sex'] = df['sex'].astype('category').cat.codes
df['smoker'] = df['smoker'].astype('category').cat.codes

# Optional: Drop columns that didn't help accuracy in experimentation
df.drop(['region_northeast', 'region_northwest', 'region_southeast', 'region_southwest'], axis=1, inplace=True)
df.drop(['sex', 'children'], axis=1, inplace=True)

# Step 3: Shuffle and split the dataset (80% train, 20% test)
df = df.sample(frac=1, random_state=42)  # shuffle the rows
size = int(len(df) * 0.2)  # 20% for testing

train_dataset = df[:-size]
test_dataset = df[-size:]

# Step 4: Separate labels (expenses) from features
train_labels = train_dataset.pop('expenses')
test_labels = test_dataset.pop('expenses')

# Step 5: Define the model
model = tf.keras.models.Sequential([
    # Input layer: matches number of features
    tf.keras.layers.Input(shape=(train_dataset.shape[1],)),

    # BatchNormalization helps stabilize and speed up training
    tf.keras.layers.BatchNormalization(),

    # Hidden layers with ReLU activation
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),

    # Output layer: single prediction (regression output)
    tf.keras.layers.Dense(1)
])

# Step 6: Compile the model
# - Loss: Mean Squared Error (MSE) to penalize large errors
# - Metrics: MAE is used to evaluate the challenge success
model.compile(
    optimizer=tf.keras.optimizers.RMSprop(learning_rate=0.01),
    loss='mse',
    metrics=['mae', 'mse']
)


# Custom callback to condense log output
print_callback = keras.callbacks.LambdaCallback(
    on_epoch_end=lambda epoch, logs:
        print(f"Epoch {epoch + 1}: loss={logs['loss']:.2f}, mae={logs['mae']:.2f}")
        if (epoch + 1) % 100 == 0 else None
)


# Step 7: Train the model with built-in progress
# - verbose=1 shows a progress bar and loss/metrics per epoch
# - You can reduce output with verbose=2 or 0 if desired
model.fit(
    train_dataset,
    train_labels,
    epochs=1000,
    verbose=0,
    callbacks=[print_callback]
)


In [None]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)
