# Exercise: Using a trained model on a new dataset

Previously, we created a basic model that let us find the relationship between a person's shoe length and their height. We showed how this model could then be used to make a prediction about a new, previously unseen person.

Last time, we built, trained, then used our model. Of course, in the real world, we might want to train our model in January in order to use it in March. How can we do this?

Here we will:

1. Create a basic model
2. Save it to disk
3. Load it from disk
4. Use it to make predictions about a new dataset.

## Load the first dataset

Let's begin by opening the dataset from file

In [6]:
import pandas

# Load a file containing people's shoe sizes
# and height, both in cm
data = pandas.read_csv('shoe-size-height.csv')

# Print the first few rows
data.head()


Unnamed: 0,shoe_length,height
0,29,177.6
1,32,183.8
2,15,159.0
3,32,192.8
4,24,172.6


## Create and train a model

As we have done before, we will create a simple Linear Regression model and train it on our dataset.

This time, for variety, we will use a different python package, scikit-learn, to run our linear regression.

In [7]:
from sklearn.linear_model import LinearRegression
import numpy as np


# For simplicity, we will call shoe_length X and height y. 
# We will also convert them into numpy arrays, as these
# work nicely with our linear regression library 
X = np.array(data["shoe_length"])
y = np.array(data["height"])


# Scikitlearn requires that input data looks like it is
# 2D, even when it only contains one column. Here, we 
# slightly reorganise the data to appear 2D.
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)

# Now create the linear regression model and fit it
# to our training dataset
model = LinearRegression()
model = model.fit(X, y)

print("Model trained!")

Model trained!


## Saving and loading a model

Our model is ready to use, but we don't need it yet. Let's save it to disk.

In [8]:
import joblib
model_filename = './height_shoes_model.pkl'
joblib.dump(model, model_filename)

print("Model saved!")

Model saved!


Loading our model is just as easy:

In [9]:
model_loaded = joblib.load(model_filename)

print("We have loaded the following:", model_loaded)

We have loaded the following: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


## Putting it together

The normal use-case for saved models is that we we receive new data, and we want to run predictions on it without re-training the model.

Let's put everything here together to make a function that loads a model from disk and uses it to make predictions on new data.

In [11]:
def load_model_and_predict(path_to_model, path_to_data):
    '''
    This function loads a pretrained model and a dataset of 
    shoe sizes from disk. It uses the model to predict how 
    tall the people are, based on their shoe size.
    '''

    # Load the model and print basic information about it
    loaded_model = joblib.load(path_to_model)

    print("We have loaded the following model:", loaded_model)

    # Load the dataset and prepare it for our model
    data = pandas.read_csv(path_to_data)

    shoe_lengths = np.array(data["shoe_length"])
    shoe_lengths = shoe_lengths.reshape(-1, 1)

    # Use the model to make a prediction
    predicted_heights = loaded_model.predict(shoe_lengths)

    # Print out a table of the shoe sizes and predicted heights
    dataframe = pandas.DataFrame({"shoe_length (cm)":shoe_lengths[:,0], "Predicted height (cm)":predicted_heights[:,0]})
    print("\nPredictions:")
    print(dataframe)

load_model_and_predict(model_filename, 'shoe-size-and-height-dataset-2.csv')

We have loaded the following model: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Predictions:
    shoe_length (cm)  Predicted height (cm)
0               31.0             187.689241
1               19.0             162.434414
2               34.0             194.002948
3               15.0             154.016138
4               31.0             187.689241
5               24.0             172.957259
6               33.0             191.898379
7               24.0             172.957259
8               16.0             156.120707
9               19.0             162.434414
10              15.0             154.016138
11              28.0             181.375534
12              35.0             196.107517
13              28.0             181.375534
14              25.0             175.061828
15              20.0             164.538983
16              21.0             166.643552
17              24.0             172.957259
18              25.0             1


## Summary

Well done!
In this exercise, we practiced:

1. Creating basic models
2. Training, then saving them to disk
3. Loading them from disk
4. Making predictions with them using new data sets