# Demonstration Machine Learning Model

This notebook outlines a simple machine-learning process using [Polars](https://pola.rs/) as the dataframe tool, and [scikit-learn](https://scikit-learn.org/stable/index.html) for modelling. The (fake) data records ice-cream sales versus mean daily temperatures. The model is intended to be able to predict ice-cream sales based on daily temperatures.

The following tasks will be executed:
1. Read initial data from S3
2. Separate the data into training and tests sets
3. Train a linear regression model, verifying that its R^{2} is adequate
4. Test the remaining data against the predictions of the model
5. If adequate, save the model to S3

Then, we will assume the model is being used at a later date to analyse new data (which does not conform to the previous data).
1. Load the serialised model from S3
2. Read the new data from S3
3. Compare it with the predictions of the loaded model and demonstrate that it is a poor fit
4. Create new training and test data sets
5. Retrain and test the model
6. Save the new parameters

First we install the Polars library on the instance...

In [None]:
# %pip install polars # does not need to be run as Polars is installed

Then we import all the necessary dependencies. The other libraries are either part of Python or included on the instance by default.

In [1]:
import polars as pl
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import pickle
import io
import boto3
import re
import os

  from pandas.core.computation.check import NUMEXPR_INSTALLED


Specify the bucket and the source path of the initial data.

In [2]:
BUCKET = 'sfc-mt-sagemaker-demo'
DATA1 = f's3://{BUCKET}/data1/'

(Setting up boto3 client and deleting any previous run parameters.)

In [3]:
s3 = boto3.client('s3')
try:
    s3.delete_object(Bucket='sfc-mt-sagemaker-demo', Key='params/v1/params.pkl')
    print("Deleted params/v1/params.pkl from sfc-mt-sagemaker-demo")
    s3.delete_object(Bucket='sfc-mt-sagemaker-demo', Key='params/v2/params.pkl')
    print("Deleted params/v2/params.pkl from sfc-mt-sagemaker-demo")
except Exception as e:
    print(e)
    print('Deletion failed, possibly no files')

Deleted params/v1/params.pkl from sfc-mt-sagemaker-demo
Deleted params/v2/params.pkl from sfc-mt-sagemaker-demo


## Read the initial data from S3

In [6]:
response = s3.get_object(Bucket=BUCKET, Key='data1/sales1.parquet')

ClientError: An error occurred (AccessDenied) when calling the GetObject operation: User: arn:aws:sts::344210435447:assumed-role/dev-sagemaker-execution-role/SageMaker is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::sfc-mt-sagemaker-demo/data1/sales1.parquet" because no identity-based policy allows the s3:GetObject action

In [4]:
df1 = pl.read_parquet(DATA1)
df1.head()

OSError: object-store error: The operation lacked the necessary privileges to complete for path data1/sales1.parquet: Error performing HEAD https://s3.eu-west-2.amazonaws.com/sfc-mt-sagemaker-demo/data1/sales1.parquet in 45.824217ms - Server returned non-2xx status code: 403 Forbidden: 

This error occurred with the following context stack:
	[1] 'parquet scan'
	[2] 'sink'


## Separate into training and test sets

In [None]:
TRAIN_FRAC = 0.7
df_train1 = df1.sample(fraction=TRAIN_FRAC, with_replacement=False, shuffle=True, seed=55)
df_test1 = df1.join(df_train1, on=df1.columns, how='anti')

## Training
1. Put the data into numpy arrays for use by scikit-learn
2. Create and fit the model
3. Get the predicted values from the model

In [None]:
x1 = df_train1['MeanDailyTemperature'].to_numpy().reshape(-1,1)
y1 = df_train1['IceCreamSales'].to_numpy()
model1 = LinearRegression()
model1.fit(x1, y1)
y_pred1 = model1.predict(x1)

### Plot the training set and regression line

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x1, y1, alpha=0.5, s=10, label='Ice Cream Data')
ax.plot(x1, y_pred1, color='red', linewidth=2, label='Regression Line')
ax.set_title('Ice Cream Sales Vs Mean Daily Temperature, with Regression')
ax.set_xlabel('Mean Daily Temperature (°C)')
ax.set_ylabel('Ice Cream Sales (£)')
ax.ticklabel_format(style='plain', axis='y')
ax.grid(True, linestyle=':', alpha=0.7)
ax.legend()
plt.tight_layout()
plt.show()

Visually, this looks like a very good fit, but we should measure the fit numerically.

### Verify that the R^{2} is good

In [None]:
r2_1 = r2_score(y1, y_pred1)

print(f'The model has an R2 score {r2_1}')


This is a high value so we are happy that the fit is good.

## Check model against test set

In [None]:
x_test1 = df_test1['MeanDailyTemperature'].to_numpy().reshape(-1,1)
y_test1 = df_test1['IceCreamSales'].to_numpy()
y_test_pred1 = model1.predict(x_test1)
r2_test1 = r2_score(y_test1, y_test_pred1)
print(f'The model scores {r2_test1} on test data')

Again, this is a high value, so we are happy that the model is robust.

Now we can extract the model parameters if desired.

In [None]:
slope = model1.coef_[0]
intercept = model1.intercept_
print(f'The model slope is {slope}, and the intercept is {intercept}')

## Save the model to S3

We will serialise the model using the built-in [pickle](https://docs.python.org/3/library/pickle.html) module. For a simple linear model, it is possible just to save the model parameters in plain text, but it is better practice to save the full model instance for re-use.

In [None]:
buffer = io.BytesIO()
pickle.dump(model1, buffer)
buffer.seek(0)

s3.upload_fileobj(buffer, BUCKET, 'params/v1/params.pkl')

## New data

Now we imagine that some time has passed and new data is available. First the analyst reads in the model that has been saved to S3. Then s/he reads in the new data and compares it to the model predictions. If these turn out to be a poor fit, the training/testing/serialisation cycle is restarted.

## Retrieve the serialised model

In [None]:
download = io.BytesIO()
s3.download_fileobj(BUCKET, 'params/v1/params.pkl', download)
download.seek(0)
loaded_model = pickle.load(download)
loaded_model.coef_[0], loaded_model.intercept_

## Retrieve the latest data

In [None]:
DATA2 = f's3://{BUCKET}/data2/'
df2 = pl.read_parquet(DATA2)

## Compare the model predictions to the new data

In [None]:
x2 = df2['MeanDailyTemperature'].to_numpy().reshape(-1,1)
y2 = df2['IceCreamSales'].to_numpy()
y_pred2 = loaded_model.predict(x2)
r2_loaded = r2_score(y2, y_pred2)
print(f'The R2 of the new data is {r2_loaded}')

This is a much lower score, indication that the model is not doing as good a job as it was previously. We can see this if we plot the data and regression line...

In [None]:
fig2, ax2 = plt.subplots(figsize=(10, 6))
ax2.scatter(x2, y2, alpha=0.5, s=10, label='Ice Cream Data')
ax2.plot(x2, y_pred2, color='red', linewidth=2, label='Regression Line')
ax2.set_title('Ice Cream Sales Vs Mean Daily Temperature, with Regression')
ax2.set_xlabel('Mean Daily Temperature (°C)')
ax2.set_ylabel('Ice Cream Sales (£)')
ax2.ticklabel_format(style='plain', axis='y')
ax2.grid(True, linestyle=':', alpha=0.7)
ax2.legend()
plt.tight_layout()
plt.show()

## Repeat Training, Testing and Serialisation

### Separate the training and test sets

In [None]:
df_train2 = df2.sample(fraction=TRAIN_FRAC, with_replacement=False, shuffle=True, seed=55)
df_test2 = df2.join(df_train2, on=df2.columns, how='anti')

### Train

In this case, we will simply train the model on the new data set. Alternatively, we could train it n the new data and previous data combined.

In [None]:
x2 = df_train2['MeanDailyTemperature'].to_numpy().reshape(-1,1)
y2 = df_train2['IceCreamSales'].to_numpy()
model2 = LinearRegression()
model2.fit(x2, y2)
y_pred2 = model2.predict(x2)
r2_2 = r2_score(y2, y_pred2)

print(f'The new model has an R2 score {r2_2}')

This is much better and we can see the effect using another plot.

In [None]:
fig2, ax2 = plt.subplots(figsize=(10, 6))
ax2.scatter(x2, y2, alpha=0.5, s=10, label='Ice Cream Data')
ax2.plot(x2, y_pred2, color='red', linewidth=2, label='Regression Line')
ax2.set_title('Ice Cream Sales Vs Mean Daily Temperature, with Regression')
ax2.set_xlabel('Mean Daily Temperature (°C)')
ax2.set_ylabel('Ice Cream Sales (£)')
ax2.ticklabel_format(style='plain', axis='y')
ax2.grid(True, linestyle=':', alpha=0.7)
ax2.legend()
plt.tight_layout()
plt.show()

### Test

In [None]:
x_test2 = df_test2['MeanDailyTemperature'].to_numpy().reshape(-1,1)
y_test2 = df_test2['IceCreamSales'].to_numpy()
y_test_pred2 = model2.predict(x_test2)
r2_test2 = r2_score(y_test2, y_test_pred2)
print(f'The new model scores {r2_test2} on test data')

In [None]:
slope = model2.coef_[0]
intercept = model2.intercept_
print(f'The new model slope is {slope}, and the intercept is {intercept}')

### Serialise

In [None]:
buffer = io.BytesIO()
pickle.dump(model2, buffer)
buffer.seek(0)

s3.upload_fileobj(buffer, BUCKET, 'params/v2/params.pkl')