### Model Training Pipeline

This notebook retrains the model and saves the model and performance metrics (Accuracy and AUC) to the Hopsworks.ai Model Registry.

It executes Notebook 07 as part of the process and will use parameters as set there (GPU/no GPU, retune Hyperparameters or not, etc...). 

Notebook 07 is executed as a subprocess and the output is captured and displayed in this notebook. Notebook 07 is used instead of full conversion to py scripts because, while Neptune.ai experiment tracking is integrated in, I like to be able to also review the output in the notebook as well.


This Notebook does the following:
 - Retrieves a train and test dataset from the Feature Store based upon on how many DAYS back from today you want to use as the test dataset.
 - Saves theses datasets as csv files in the data directory where Notebook 07 will expect to find them.
 - Executes Notebook 07 as a subprocess and captures the output.
 - Saves the model and performance metrics to the Hopsworks.ai Model Registry.


In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os

import pandas as pd
import numpy as np

import hopsworks

from hsml.schema import Schema
from hsml.model_schema import ModelSchema
from hsfs.client.exceptions import RestAPIError

import json

from datetime import datetime, timedelta

# change working directory to project root when running from notebooks folder to make it easier to import modules
# and to access sibling folders
os.chdir("..")

from pathlib import Path  # for Windows/Linux compatibility


from src.utils.hopsworks_utils import (
    convert_feature_names,
    create_train_test_data,
)

from dotenv import load_dotenv

In [None]:
CONFIGS_PATH = Path.cwd() / "configs"
DATA_PATH = Path.cwd() / "data"
NOTEBOOKS_PATH = Path.cwd() / "notebooks"
MODELS_PATH = Path.cwd() / "models"

**Parameters**

Train and Test will be divided by date. The earliest chunk of data will be used as the train dataset and the last DAYS of data will be used as the test dataset.

STARTDATE: The date to start the train dataset from. The train dataset will compose of all data from this date forward, leaving out the last number of DAYS as the test dataset.

DAYS: The number of days to use as the test dataset. The test dataset will be the last DAYS days of data.

In [None]:
STARTDATE = "2003-01-01"  # start date "YYYY-MM-DD" for training data, data goes back to 2003 season "2003-01-01"
DAYS = 30  # number of most recent days to use as test data

In [None]:
try:
    HOPSWORKS_API_KEY = os.getenv("HOPSWORKS_API_KEY")
except:
    raise Exception("Set environment variable HOPSWORKS_API_KEY")

**Connect to Hopsworks FeatureStore and Pull Train and Test data**

In [None]:
train, test = create_train_test_data(HOPSWORKS_API_KEY, STARTDATE, DAYS)

**Save data**

As a convenience to re-use the existing model training notebook, the data is saved to files first (currently <100 megabytes total)

In [None]:
train.to_csv(DATA_PATH / "processed" / "train_selected.csv", index=False)
test.to_csv(DATA_PATH / "processed" / "test_selected.csv", index=False)

**Model Training**

The existing model training notebook is re-used. It includes Neptune.ai experiment tracking for both training run and hyperparameter tuning.


In [None]:
%run notebooks/07_model_testing.ipynb


**Save to Model Registry**



In [None]:
# read in train_predictions to create model schema
train = pd.read_csv(DATA_PATH / "processed" / "train_predictions.csv")
target = train["TARGET"]
drop_columns = ["TARGET", "PredictionPct", "Prediction"]
train = train.drop(columns=drop_columns)

input_schema = Schema(train)
output_schema = Schema(target)
model_schema = ModelSchema(input_schema=input_schema, output_schema=output_schema)
model_schema.to_dict()

# read in model meta_data from training run
with open(MODELS_PATH / "model_data.json", "rb") as fp:
    model_data = json.load(fp)


# # log back in to hopsworks.ai. Hyperparameter tuning may take hours.
project = hopsworks.login(api_key_value=HOPSWORKS_API_KEY)
mr = project.get_model_registry()

model = mr.sklearn.create_model(
    name=model_data["model_name"],
    # metrics = model_data['metrics'],
    description=(
        model_data["model_name"]
        + ", calibration_method: "
        + model_data["calibration_method"]
        + ", brier_loss: "
        + str(model_data["brier_loss"])
    ),
    model_schema=model_schema,
)
model.save(str(MODELS_PATH) + "/model.pkl")