# Train Model

This notebook demonstrates how to train a simple linear regression model, using a synthetic dataset downloaded from cloud storage (AWS S3). It persists the trained model and its metrics locally, before uploading them to cloud storage for use elsewhere.

## Imports

In [1]:
import re
from datetime import date
from urllib.request import urlopen
from typing import Tuple

import boto3 as aws
import numpy as np
import pandas as pd
from joblib import dump
from sklearn.base import BaseEstimator
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_percentage_error, max_error, r2_score
from sklearn.model_selection import train_test_split

## Load Dataset

Load the most recent dataset stored on AWS S3.

In [2]:
data_url = ('http://bodywork-ml-ops-project.s3.eu-west-2.amazonaws.com'
            '/datasets/regression-dataset-2021-01-12.csv')

data = pd.read_csv(urlopen(data_url))
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1440 entries, 0 to 1439
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   y       1440 non-null   float64
 1   X       1440 non-null   float64
dtypes: float64(2)
memory usage: 22.6 KB


## Define Task Metrics

This is a regression task, so we focus on:

* Mean Absolute Percentage Error (MAPE)
* R-Squared (R2)
* Maximum Residual

In [3]:
def model_metrics(y_actual, y_predicted) -> pd.DataFrame:
    """Return regression metrics record."""
    mape = mean_absolute_percentage_error(y_actual, y_predicted)
    r_squared = r2_score(y_actual, y_predicted)
    max_residual = max_error(y_actual, y_predicted)
    metrics_record = pd.DataFrame({
        'MAPE': [mape],
        'R2': [r_squared],
        'MR': [max_residual]
    })
    return metrics_record

## Split Data into Train and Test Subsets

We hold-out 20% of the data to use for testing the model.

In [4]:
X = data['X'].values.reshape(-1, 1)
y = data['y'].values

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

## Train Model and Compute Metrics

In [5]:
ols_regressor = LinearRegression(fit_intercept=True)
ols_regressor.fit(X_train, y_train)
metrics = model_metrics(y_test, ols_regressor.predict(X_test))

for k, v in metrics.to_dict().items():
    print(f'{k}: {v[0]:.2f}')

MAPE: 0.17
R2: 0.66
MR: 31.10


## Persist Model and Metrics

Upload artefacts to AWS S3.

In [6]:
def make_artefact_filenames(data_url: str) -> Tuple[str, str]:
    """Generate model and metrics filenames from data URL."""
    data_date = re.findall('20[0-9][0-9]-[0-1][0-9]-[0-3][0-9]', data_url)[0]
    model_filename = f'regressor-{data_date}.joblib'
    metrics_filename = f'regressor-{data_date}.csv'
    return (model_filename, metrics_filename)


model_filename, metrics_filename = make_artefact_filenames(data_url)
dump(ols_regressor, model_filename)
metrics.to_csv(metrics_filename, header=True, index=False)

s3_bucket_name = 'bodywork-ml-ops-project'
s3_client = aws.client('s3')

s3_client.upload_file(
    model_filename,
    s3_bucket_name,
    f'models/{model_filename}'
)
print(f'uploaded {model_filename} to s3://{s3_bucket_name}/models/')

s3_client.upload_file(
    metrics_filename,
    s3_bucket_name,
    f'model-metrics/{metrics_filename}'
)
print(f'uploaded {metrics_filename} to s3://{s3_bucket_name}/model-metrics/')

uploaded regressor-2021-01-12.joblib to s3://bodywork-ml-ops-project/models/
uploaded regressor-2021-01-12.csv to s3://bodywork-ml-ops-project/model-metrics/
