# Test Model-Scoring Service

This notebook uses newly generated data to test the model-scoring service - i.e. it uses data generated for the period `t+1` to test a model trained with all data up until period `t`.

## Imports

In [1]:
import re
from datetime import date, datetime
from io import BytesIO
from typing import Dict, Tuple

import requests
import boto3 as aws
import numpy as np
import pandas as pd
from botocore.exceptions import ClientError

## Load Newly Generated Data File

Load the newly generated CSV data file from the project's AWS S3 bucket. We start by defining an efficient helper function.

In [2]:
def download_latest_data_file(aws_bucket: str) -> Tuple[pd.DataFrame, date]:
    """Get latest model from AWS S3 bucket."""
    def _date_from_object_key(key: str) -> date:
        """Extract date from S3 file object key."""
        date_string = re.findall('20[2-9][0-9]-[0-1][0-9]-[0-3][0-9]', key)[0]
        file_date = datetime.strptime(date_string, '%Y-%m-%d').date()
        return file_date

    print(f'downloading latest data file from s3://{aws_bucket}/datasets')
    try:
        s3_client = aws.client('s3')
        s3_objects = s3_client.list_objects(Bucket=aws_bucket, Prefix='datasets/')
        object_keys_and_dates = [
            (obj['Key'], _date_from_object_key(obj['Key']))
            for obj in s3_objects['Contents']
        ]
        latest_file_obj = sorted(object_keys_and_dates, key=lambda e: e[1])[-1]
        latest_file_obj_key = latest_file_obj[0]
        object_data = s3_client.get_object(Bucket=aws_bucket, Key=latest_file_obj_key)
        data = pd.read_csv(BytesIO(object_data['Body'].read()))
        dataset_date = latest_file_obj[1]
    except ClientError:
        print(f'failed to data file from s3://{aws_bucket}/datasets')
    return (data, dataset_date)


Applying `download_latest_data_file` to the project's S3 bucket.

In [3]:
test_data, test_data_date = download_latest_data_file('bodywork-ml-ops-project')
print(f'- most recent data added on {test_data_date}\n')
display(test_data)

downloading latest data file from s3://bodywork-ml-ops-project/datasets
- most recent data added on 2021-01-13



Unnamed: 0,y,X
0,31.087521,31.354226
1,52.499153,41.401191
2,48.643840,62.047707
3,76.006814,86.283587
4,76.897401,78.501607
...,...,...
1435,63.243651,65.926000
1436,36.271838,6.710064
1437,41.803066,27.505270
1438,45.139632,50.133207


## Score Latest Data using Current Scoring-Service API Endpoint

We use the model-scoring REST API endpoint to get predictions for every instance in the new dataset. We use the known labels together with the scores to compute errors.

In [4]:
def get_model_score(url: str, features: Dict[str, float]) -> float:
    """Request score from REST API for a single instance of data."""
    session = requests.Session()
    session.mount(url, requests.adapters.HTTPAdapter(max_retries=3))
    response = session.post(url, json=features)
    return response.json()['prediction']


def analyse_model_score(score: float, label: float) -> Tuple[float, float, float]:
    """Compute performance metrics for model score."""
    absolute_percentage_error = abs(score / label - 1)
    return (score, label, absolute_percentage_error)


def generate_model_test_results(url: str, test_data: pd.DataFrame) -> pd.DataFrame:
    """Get test results for all test data."""
    def single_test_result(X: float, label: float) -> Tuple[float, float, float]:
        score = get_model_score(url, {'X': X})
        test_result = analyse_model_score(score, label)
        return test_result
    
    test_data = [single_test_result(row.X, row.y) for row in test_data.itertuples()]
    return pd.DataFrame(test_data, columns=['score', 'label', 'APE'])

        
scoring_service_url = 'http://localhost:5000/score/v1'
test_results = generate_model_test_results(scoring_service_url, test_data)

# Analyse Test Results

Computing test metrics using scores and labels.

In [5]:
def test_metrics(test_results: pd.DataFrame, results_date: date) -> pd.DataFrame:
    MAPE = test_results.APE.mean()
    R2 = test_results.score.corr(test_results.label)
    MR = test_results.APE.max()
    return pd.DataFrame({'date': [results_date], 'MAPE': [MAPE], 'R2': R2, 'MR': [MR]})


test_metrics = test_metrics(test_results, test_data_date)
for k, v in test_metrics.to_dict().items():
    print(f'{k}: {v[0]}')

date: 2021-01-13
MAPE: 0.17367066675741
R2: 0.8271467437255443
MR: 2.9814631505893168


## Persist Test Results

Upload test metrics to AWS S3.

In [7]:
metrics_filename = f'regressor-test-results-{test_data_date}.csv'
test_metrics_filename = make_test_metrics_filename(test_data_date)
test_metrics.to_csv(test_metrics_filename, header=True, index=False)

s3_bucket_name = 'bodywork-ml-ops-project'
s3_client = aws.client('s3')

s3_client.upload_file(
    test_metrics_filename,
    s3_bucket_name,
    f'test-metrics/{test_metrics_filename}'
)
print(f'uploaded {test_metrics_filename} to s3://{s3_bucket_name}/test-metrics/')

uploaded regressor-test-results-2021-01-13.csv to s3://bodywork-ml-ops-project/test-metrics/
