# **Linear Regression Health Costs Calculator**

This project was undertaken as part of FreeCodeCamp's Machine Learning certification. **The goal was to predict medical expenses based on various demographic and lifestyle factors using a neural network model.**

Lets Dive Right In

In [None]:
# Import libraries. You may or may not use all of these.
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

In [None]:
dataset.head()

In [None]:
dataset.info()

We have explored and visualised our data Next we:                                               
**Prepare Data:** The categorical columns are converted to numerical values to prepare the dataset for training machine learning models, which typically require numerical inputs.                                                   
**Shuffle Data:** Shuffling the dataset ensures that the order of data does not bias the learning process.                       
**Split data:** Splitting the dataset into training and testing sets allows for model training on one portion of the data and evaluation on unseen data to assess model performance.

In [None]:
# Convert categorical data to numbers
dataset["sex"].replace(
    ["female", "male"],
    [0, 1],
    inplace=True
)

dataset["smoker"].replace(
    ["no", "yes"],
    [0, 1],
    inplace=True
)

dataset["region"].replace(
    ['southwest', 'southeast', 'northwest', 'northeast'],
    [0, 1, 2, 3],
    inplace=True
)

from sklearn.utils import shuffle

# Assuming 'dataset' is already defined and contains the data

dataset = shuffle(dataset).reset_index(drop=True)

# Separating the train and test datasets
train_dataset  = dataset[0:int(0.8*dataset.shape[0])]
test_dataset = dataset[int(0.8*dataset.shape[0]):dataset.shape[0] - 1]

train_labels = train_dataset.pop("expenses")
test_labels = test_dataset.pop("expenses")


We have now processed our data, **We are ready to CREATE and Build OUR MODEL.**                                   
Next we  set up a **neural network** model for regression using TensorFlow/Keras. It begins by normalizing the input data using a normalization layer adapted to the training dataset. The sequential model architecture includes dense layers with ReLU activation, culminating in an output layer for regression predictions.

In [None]:

# Creating a normalization layer and adapting it to the training dataset
normalizer = layers.experimental.preprocessing.Normalization()
normalizer.adapt(np.array(train_dataset))

# Defining a sequential model with normalization, dense layers, and an output layer
model = keras.Sequential([
    normalizer,
    layers.Dense(32, activation="relu"),
    layers.Dense(16, activation="relu"),
    layers.Dense(1)
])

# Compiling the model with Adam optimizer, MAE loss function, and metrics for evaluation
model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mae',
    metrics=['mae', 'mse']
)

# Building the model to infer input shape and displaying a summary of its architecture
model.build()
model.summary()

Our Model is now set up, which means we ready to Train it using the dataset and this is After compiling the model with an Adam optimizer and MAE loss, we train the model on the training dataset for **100 epochs to optimize performance metrics such as MAE and MSE**.

In [None]:
# Training the model on the training dataset for 100 epochs
history = model.fit(
    train_dataset,
    train_labels,
    epochs=100
)

Now we TEST, TEST , TEST and test. Our aim is to have this model return a **Mean Absolute Error of under 3500**. This means it predicts health care costs correctly within $3500.

In [None]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)


YEEEEEY!!! WE PASSED the challenge! Our model achieved a **mean absolute error** **(MAE) of 2078.07 on the testing set**, which is below the threshold of 3500. This indicates that your model performs well in predicting expenses based on the evaluation metrics provided.