<a href="https://colab.research.google.com/github/MwangiMuriuki2003/MURIUKI/blob/main/fcc_predict_health_costs_with_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Import libraries. You may or may not use all of these.
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

In [None]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_dataset, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)


# Task
Create, compile, and train a Keras regression model to predict medical expenses using the `insurance.csv` dataset, preparing it for evaluation with the provided test cell.

## Analyze and Preprocess Data

### Subtask:
Examine the 'dataset' DataFrame to identify categorical features. Convert these categorical features into numerical representations using one-hot encoding. Then, split the preprocessed data into training and testing sets (80% train, 20% test), creating `train_dataset`, `test_dataset`, `train_labels` (expenses for training), and `test_labels` (expenses for testing).


**Reasoning**:
The subtask requires identifying categorical features, applying one-hot encoding, separating features and labels, and then splitting the data into training and testing sets. This code block will perform all these preprocessing steps sequentially.



In [None]:
from sklearn.model_selection import train_test_split

# 1. Identify categorical columns
categorical_cols = ['sex', 'smoker', 'region']

# 2. Apply one-hot encoding
dataset_encoded = pd.get_dummies(dataset, columns=categorical_cols, drop_first=True)

# 3. Separate features (x) and labels (y)
x = dataset_encoded.drop('expenses', axis=1)
y = dataset_encoded['expenses']

# 4. Split the data into training and testing sets (80% train, 20% test)
train_dataset, test_dataset, train_labels, test_labels = train_test_split(x, y, test_size=0.2, random_state=42)

print("Data preprocessing complete. Shapes of the datasets:")
print(f"Train features: {train_dataset.shape}")
print(f"Test features: {test_dataset.shape}")
print(f"Train labels: {train_labels.shape}")
print(f"Test labels: {test_labels.shape}")

## Build and Compile Model

### Subtask:
Construct a Keras Sequential model for regression. Define the layers and activation functions suitable for this task. Compile the model using an appropriate optimizer (e.g., 'adam'), a loss function for regression (e.g., 'mse' or 'mae'), and 'mae' as a metric.


**Reasoning**:
The subtask requires constructing and compiling a Keras Sequential model. This code block will define the model architecture, including input layer, hidden layers with 'relu' activation, and an output layer for regression, then compile it with the specified optimizer, loss, and metrics.



In [None]:
input_shape = train_dataset.shape[1]

model = keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=[input_shape]),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])

model.compile(
    optimizer='adam',
    loss='mae',
    metrics=['mae']
)

print("Keras Sequential model created and compiled.")
model.summary()

**Reasoning**:
The previous code block produced a UserWarning regarding the `input_shape` argument in the first `Dense` layer of a `Sequential` model. To address this, it is best practice to explicitly define an `InputLayer` as the first layer in the model. This will remove the warning and clarify the model's input expectations.



In [None]:
input_shape = train_dataset.shape[1]

model = keras.Sequential([
    layers.Input(shape=(input_shape,)), # Explicitly defining InputLayer
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])

model.compile(
    optimizer='adam',
    loss='mae',
    metrics=['mae']
)

print("Keras Sequential model created and compiled.")
model.summary()

## Train Model

### Subtask:
Train the compiled model using the `train_dataset` and `train_labels`. Monitor the training process to ensure the model learns effectively.


**Reasoning**:
The subtask requires training the compiled Keras model with the specified parameters, including training data, labels, number of epochs, and a validation split.



In [None]:
history = model.fit(
    train_dataset,
    train_labels,
    epochs=100,
    validation_split=0.2,
    verbose=0 # Suppress verbose output during training
)

print("Model training complete.")

## Final Task

### Subtask:
Evaluate the trained model and plot predictions using the pre-provided test cell.


## Summary:

### Data Analysis Key Findings

*   **Data Preprocessing**: Categorical features (`sex`, `smoker`, `region`) were successfully converted using one-hot encoding with `drop_first=True`. The dataset was split into 1070 training samples and 268 testing samples, each with 8 features after encoding.
*   **Model Architecture**: A Keras Sequential model was constructed, featuring an explicit `Input` layer, two hidden `Dense` layers with 64 units and 'relu' activation, and a final `Dense` output layer with 1 unit (for regression).
*   **Model Compilation**: The model was compiled using the 'adam' optimizer, 'mae' (Mean Absolute Error) as the loss function, and 'mae' as the primary metric.
*   **Model Training**: The model was trained for 100 epochs using the training dataset, with 20% of the training data reserved for validation during the training process.

### Insights or Next Steps

*   The model has been successfully trained, and its performance during training (e.g., loss reduction, MAE on validation set) can be further analyzed using the `history` object to identify potential overfitting or underfitting.
*   The next crucial step is to evaluate the trained model's performance on the unseen `test_dataset` and `test_labels` using metrics like MAE, and then visualize predictions against actual values.
