*Note: You are currently reading this using Google Colaboratory which is a cloud-hosted version of Jupyter Notebook. This is a document containing both text cells for documentation and runnable code cells. If you are unfamiliar with Jupyter Notebook, watch this 3-minute introduction before starting this challenge: https://www.youtube.com/watch?v=inN8seMm7UI*

---

In this challenge, you will predict healthcare costs using a regression algorithm.

You are given a dataset that contains information about different people including their healthcare costs. Use the data to predict healthcare costs based on new data.

The first two cells of this notebook import libraries and the data.

Make sure to convert categorical data to numbers. Use 80% of the data as the `train_dataset` and 20% of the data as the `test_dataset`.

`pop` off the "expenses" column from these datasets to create new datasets called `train_labels` and `test_labels`. Use these labels when training your model.

Create a model and train it with the `train_dataset`. Run the final cell in this notebook to check your model. The final cell will use the unseen `test_dataset` to check how well the model generalizes.

To pass the challenge, `model.evaluate` must return a Mean Absolute Error of under 3500. This means it predicts health care costs correctly within $3500.

The final cell will also predict expenses using the `test_dataset` and graph the results.

In [None]:
# Import libraries. You may or may not use all of these.
!pip install -q git+https://github.com/tensorflow/docs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

import tensorflow_docs as tfdocs
import tensorflow_docs.plots
import tensorflow_docs.modeling

In [None]:
# Import data
!wget https://cdn.freecodecamp.org/project-data/health-costs/insurance.csv
dataset = pd.read_csv('insurance.csv')
dataset.tail()

In [None]:
len(dataset)

# Clean and Correlate

In [None]:
dataset.isnull().sum()

In [None]:
dataset.head()

Make sure to convert categorical data to numbers. 

One thing to consider is the region variable. Let's look at the unique values there.

In [None]:
dataset.region.unique()

Going to manually map the regions rather than using OneHotEncoder.

southwest: 1\
southeast: 2\
northwest: 3\
northeast: 4

In [None]:
smoker_bin = dataset.smoker.map(dict(yes=1, no=0))

sex_bin = dataset.sex.map(dict(female=0, male=1))

dataset.smoker = smoker_bin
dataset.sex = sex_bin


region_enc = dataset.region.map(dict(southwest=1, southeast=2, northwest=3, northeast=4))
dataset.region = region_enc

In [None]:
dataset.head()

The expenses variable seems to have some significant distribution. Let's look at all of the non-binary variable distributions two ways.

These variables are: age, bmi, children, region, expenses

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

dist_dataset = dataset.loc[:,['age', 'bmi', 'children', 'region', 'expenses']]


def dist_plotter(df):
    # create the figure
    fig, ax = plt.subplots(5, 2, figsize=(20, 10))
    
    # create the plots:
    for n in range(len(df.columns)):
        # column 0 will be distplots
        sns.distplot(df.age, ax=ax[0, 0]).set(title='Age')
        sns.distplot(df.bmi, ax=ax[1, 0]).set(title='BMI')
        sns.distplot(df.children, ax=ax[2, 0]).set(title='Children')
        sns.distplot(df.region, ax=ax[3, 0]).set(title='Region')
        sns.distplot(df.expenses, ax=ax[4, 0]).set(title='Expenses')

        # column 1 will be boxplots
        sns.boxplot(x=df.age, ax=ax[0, 1]).set(title='Age')
        sns.boxplot(x=df.bmi, ax=ax[1, 1]).set(title='BMI')
        sns.boxplot(x=df.children, ax=ax[2, 1]).set(title='Children')
        sns.boxplot(x=df.region, ax=ax[3, 1]).set(title='Region')
        sns.boxplot(x=df.expenses, ax=ax[4, 1]).set(title='Expenses')

    plt.show()

dist_plotter(dist_dataset)

BMI and Expenses clearly have some extreme outliers.

Let's use IQR to take out the outliers.

In [None]:
def remove_outliers(df, columns):
    for col in columns:
        q25, q75 = df[col].quantile(.25), df[col].quantile(.75)

        IQR = q75 - q25

        lower_range = q25 - 1.5 * IQR
        upper_range = q75 + 1.5 * IQR

        outlier_free_list = [x for x in df[col] if (
            (x > lower_range) & (x < upper_range)
        )]

        filtered_data = df.loc[df[col].isin(outlier_free_list)]

    return filtered_data

dataset = remove_outliers(dataset, ['bmi', 'expenses'])

dataset.head()

In [None]:
len(dataset)

In [None]:
corr = dataset.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(20, 20))
sns.heatmap(corr, mask=mask, annot=True)


Smoker clearly has the highest impact on the expenses variable. Age seems to be the next most impactful. Otherwise, the other variables are negligible.

# Using Tensorflow

Example: https://www.tensorflow.org/tutorials/keras/regression

In [None]:
from sklearn.model_selection import train_test_split

train_dataset, test_dataset = train_test_split(dataset, test_size=.2)

train_features = train_dataset[['age', 'smoker', 'expenses']].copy()
test_features = test_dataset[['age', 'smoker', 'expenses']].copy()

train_labels = train_dataset.pop('expenses')
test_labels = test_dataset.pop('expenses')


### Normailizing

Normalizing data is important, especially when multiple features are all on different scales.

In [None]:
normalizer = layers.Normalization(axis=-1)

In [None]:
normalizer.adapt(np.array(train_features))

In [None]:
print(normalizer.mean.numpy())


In [None]:
first = np.array(train_features[:1])

with np.printoptions(precision=2, suppress=True):
  print('First example:', first)
  print()
  print('Normalized:', normalizer(first).numpy())


In [None]:
model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])


In [None]:
model.predict(train_features[:10])


In [None]:
model.layers[1].kernel

In [None]:
# When evaluating the model, this will give loss, mae, and mse as the metrics when metrics_names() is called
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1.5),
    loss=['mean_absolute_error'], 
    metrics=['mae', 'mse']
    )


In [None]:
test_results = {}

In [None]:
%%time
history = model.fit(
    train_features,
    train_labels,
    epochs=400,
    # Suppress logging.
    verbose=0,
    # Calculate validation results on 20% of the training data.
    validation_split = 0.2)


In [None]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  plt.xlabel('Epoch')
  plt.ylabel('Error')
  plt.legend()
  plt.grid(True)

plot_loss(history)


In [None]:
# model.metrics_names

In [None]:
# loss, mae, mse = model.evaluate(test_features, test_labels, verbose=2)

In [None]:
# RUN THIS CELL TO TEST YOUR MODEL. DO NOT MODIFY CONTENTS.
# Test model by checking how well the model generalizes using the test set.
loss, mae, mse = model.evaluate(test_features, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} expenses".format(mae))

if mae < 3500:
  print("You passed the challenge. Great job!")
else:
  print("The Mean Abs Error must be less than 3500. Keep trying.")

# Plot predictions.
test_predictions = model.predict(test_dataset).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True values (expenses)')
plt.ylabel('Predictions (expenses)')
lims = [0, 50000]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims,lims)
