In [1]:
import pandas as pd

In this example, we are going to predict the insurance charges using age, gender, body-mass-index(BMI), number of children, smorker status and region.

Dataset info: https://www.kaggle.com/mirichoi0218/insurance

# Read CSV

Let us start by reading the CSV file using `pandas.read_csv`

In [None]:
dataset = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv')

In [None]:
dataset.head()

# Data Preparation

In [None]:
dataset.info()

The data type for 'sex', 'smoker', and 'region' a 'object' Dtype . Use `pd.Categorical` to convert them into 'category' Dtype

In [None]:
dataset['sex'] = pd.Categorical(dataset['sex'])
dataset['smoker'] = pd.Categorical(dataset['smoker'])
dataset['region'] = pd.Categorical(dataset['region'])

In [None]:
dataset.info()

Before we start to build model, let's do some data preprocessing:
- Convert all categorical columns into numerical representation
- Split dataset into a train set and a test set
- Normalize data

In [None]:
dataset['sex'] = dataset['sex'].cat.codes
dataset['smoker'] = dataset['smoker'].cat.codes
dataset['region'] = dataset['region'].cat.codes

In [None]:
dataset.head()

`dataset.pop()` removes "charges" column from dataset and stores it in a variable called 'target'

In [None]:
target = dataset.pop('charges')
features = dataset

In [None]:
target.head()

In [None]:
features.head()

# Create Train and Test Set

The original dataset is split for evaluation purpose. We will use the train set for model training, and use the test set to evaluate our model. By evaluation, we mean to assess how well the model can generalize to new, unseen data. 
We will be using the `train_test_split()` function provided by sklearn to split the data into train and test set.


In [None]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(features, target, test_size=0.2, random_state=123)

In [None]:
train_X.shape, test_X.shape, train_y.shape, test_y.shape

Next, we will perform data normalization. The goal of normalization is to change the values of every numeric columns in the dataset to a common scale, without distorting the differences in terms of range of values. Data normalization generally speeds up learning and leads to faster convergence.

In [None]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()

scaler.fit(train_X)

train_X_scaled = scaler.transform(train_X)
test_X_scaled = scaler.transform(test_X)

# Create Input Pipelines

`tf.data.Dataset` is the function provided by Tensorflow to create input pipelines.

In [None]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((train_X_scaled, train_y))
test_dataset = tf.data.Dataset.from_tensor_slices((test_X_scaled, test_y))

In [None]:
for feat, targ in train_dataset.take(5):
    print ('Features: {}, Target: {}'.format(feat, targ))

Shuffle the train set and subsequently create batches of samples

In [None]:
train_dataset_batch = train_dataset.shuffle(buffer_size=100).batch(8)
test_dataset_batch = test_dataset.batch(8)

In [None]:
features, targets = next(iter(train_dataset_batch))
print('Features shape: {}, targets shape: {}'.format(features.numpy().shape, targets.numpy().shape))

# Model training

In Keras, we can define the layers we desired and stack them using the `tf.keras.Sequential()` function. In this case, our feature size is six, so correspondingly we create six nodes in the first layer to receive input for our features. The input layer is followed by one hidden layer with the size of ten nodes, and an output layer with the size of one node to output a prediction.

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(6, activation='relu', dtype='float64'),
    tf.keras.layers.Dense(10, activation='relu', dtype='float64'),
    tf.keras.layers.Dense(1, dtype='float64')
])


After the model is defined, we have to define the configurations required to carry out model training such as optimizer, loss function, and evaluation metrics. We can do that by calling the `.compile()` function and specify our desired optimizer, loss function and evaluation metrics

In [None]:
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.mean_absolute_percentage_error]
)

Now, we begin to train the model using the `.fit()` function, by supplying the train set dataloader and training iteration. The parameter `validation_data` is optional, but here we assign a test dataloader to it, so that once it finishes training, it will proceed to evaluate the given test set at the end of each epoch.

In [None]:
model.fit(
    train_dataset_batch, 
    epochs=150, 
    validation_data=test_dataset_batch
)

# Inference 

After training the model, we can use it to predict unknown data.

In [None]:
predictions = model(test_X_scaled[:10]).numpy()

In [None]:
predictions