In [1]:
import pandas as pd

In this example, we are going to predict the insurance charges using age, gender, body-mass-index(BMI), number of children, smoker status and region.

Dataset info: https://www.kaggle.com/mirichoi0218/insurance

# Read CSV

Let us start by reading the CSV file using `pandas.read_csv`

In [2]:
dataset = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv')

In [3]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Data Preparation

In [4]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


The data type for 'sex', 'smoker', and 'region' a 'object' Dtype . Use `pd.Categorical` to convert them into 'category' Dtype

In [5]:
dataset['sex'] = pd.Categorical(dataset['sex'])
dataset['smoker'] = pd.Categorical(dataset['smoker'])
dataset['region'] = pd.Categorical(dataset['region'])

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       1338 non-null   int64   
 1   sex       1338 non-null   category
 2   bmi       1338 non-null   float64 
 3   children  1338 non-null   int64   
 4   smoker    1338 non-null   category
 5   region    1338 non-null   category
 6   charges   1338 non-null   float64 
dtypes: category(3), float64(2), int64(2)
memory usage: 46.2 KB


Before we start to build model, let's do some data preprocessing:
- Convert all categorical columns into numerical representation
- Split dataset into a train set and a test set
- Normalize data

In [7]:
dataset['sex'] = dataset['sex'].cat.codes
dataset['smoker'] = dataset['smoker'].cat.codes
dataset['region'] = dataset['region'].cat.codes

In [8]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,3,16884.924
1,18,1,33.77,1,0,2,1725.5523
2,28,1,33.0,3,0,2,4449.462
3,33,1,22.705,0,0,1,21984.47061
4,32,1,28.88,0,0,1,3866.8552


`dataset.pop()` removes "charges" column from dataset and stores it in a variable called 'target'

In [9]:
target = dataset.pop('charges')
features = dataset

In [10]:
target.head()

0    16884.92400
1     1725.55230
2     4449.46200
3    21984.47061
4     3866.85520
Name: charges, dtype: float64

In [11]:
features.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,0,27.9,0,1,3
1,18,1,33.77,1,0,2
2,28,1,33.0,3,0,2
3,33,1,22.705,0,0,1
4,32,1,28.88,0,0,1


# Create Train and Test Set

The original dataset is split for evaluation purpose. We will use the train set for model training, and use the test set to evaluate our model. By evaluation, we mean to assess how well the model can generalize to new, unseen data. 
We will be using the `train_test_split()` function provided by sklearn to split the data into train and test set.


In [12]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_y, test_y = train_test_split(features, target, test_size=0.2, random_state=123)

In [13]:
train_X.shape, test_X.shape, train_y.shape, test_y.shape

((1070, 6), (268, 6), (1070,), (268,))

Next, we will perform data normalization. The goal of normalization is to change the values of every numeric columns in the dataset to a common scale, without distorting the differences in terms of range of values. Data normalization generally speeds up learning and leads to faster convergence.

In [14]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()

scaler.fit(train_X)

train_X_scaled = scaler.transform(train_X)
test_X_scaled = scaler.transform(test_X)

# Create Input Pipelines

`tf.data.Dataset` is the function provided by Tensorflow to create input pipelines.

In [15]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((train_X_scaled, train_y))
test_dataset = tf.data.Dataset.from_tensor_slices((test_X_scaled, test_y))

In [16]:
for feat, targ in train_dataset.take(5):
    print ('Features: {}, Target: {}'.format(feat, targ))

Features: [ 0.06207177  0.98884723 -0.71961    -0.06736631 -0.5        -0.46038891], Target: 6389.37785
Features: [-0.15057538 -1.01127855  1.28697921 -0.8864794   2.          0.45695957], Target: 40419.0191
Features: [ 0.77089561  0.98884723 -0.67224951 -0.8864794  -0.5         1.37430805], Target: 8444.474
Features: [ 0.48736607  0.98884723 -0.97219929  1.57085987 -0.5        -1.37773739], Target: 9500.57305
Features: [-1.28469352 -1.01127855 -2.20107939 -0.06736631 -0.5         1.37430805], Target: 2585.269


Shuffle the train set and subsequently create batches of samples

In [17]:
train_dataset_batch = train_dataset.shuffle(buffer_size=100).batch(8)
test_dataset_batch = test_dataset.batch(8)

In [18]:
features, targets = next(iter(train_dataset_batch))
print('Features shape: {}, targets shape: {}'.format(features.numpy().shape, targets.numpy().shape))

Features shape: (8, 6), targets shape: (8,)


# Model training

In Keras, we can define the layers we desired and stack them using the `tf.keras.Sequential()` function. In this case, our feature size is six, so correspondingly we create six nodes in the first layer to receive input for our features. The input layer is followed by one hidden layer with the size of ten nodes, and an output layer with the size of one node to output a prediction.

In [19]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(6, activation='relu', dtype='float64'),
    tf.keras.layers.Dense(10, activation='relu', dtype='float64'),
    tf.keras.layers.Dense(1, dtype='float64')
])


After the model is defined, we have to define the configurations required to carry out model training such as optimizer, loss function, and evaluation metrics. We can do that by calling the `.compile()` function and specify our desired optimizer, loss function and evaluation metrics

In [20]:
model.compile(
    optimizer='adam',
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.mean_absolute_percentage_error]
)

Now, we begin to train the model using the `.fit()` function, by supplying the train set dataloader and training iteration. The parameter `validation_data` is optional, but here we assign a test dataloader to it, so that once it finishes training, it will proceed to evaluate the given test set at the end of each epoch.

In [21]:
model.fit(
    train_dataset_batch, 
    epochs=150, 
    validation_data=test_dataset_batch
)

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7ff353e19b90>

# Inference 

After training the model, we can use it to predict unknown data.

In [22]:
predictions = model(test_X_scaled[:10]).numpy()

In [23]:
predictions

array([[16305.95843232],
       [ 8903.49716806],
       [27894.62161983],
       [ 4704.51260921],
       [11569.63860048],
       [10391.21323598],
       [ 5757.19897229],
       [ 3306.98482641],
       [ 3521.82256052],
       [ 7962.86875142]])