# End-To-End MLOps Pipeline With SageMaker (AWS)

Let's begin by setting up the environment.

In [1]:
%load_ext autoreload
%autoreload 2
%load_ext dotenv
%dotenv

import json
import sys
import os
from pathlib import Path
import boto3
import ipytest

Throughout this project we will have the ability to run our functions in local mode or not. By setting this to `True` we will run all of the pipelines locally (and not connect to AWS). Setting it to `False` will run the pipeline in SageMaker.

In [2]:
LOCAL_MODE = True

Now we are going to load the environment variables that we need to run this project.

In [3]:
import logging
# This stops SageMaker from reporting in this cell.
logging.getLogger('sagemaker.config').disabled = True

from sagemaker.workflow.pipeline_context import LocalPipelineSession, PipelineSession
import sagemaker

CODE_FOLDER = Path("code") 
CODE_FOLDER.mkdir(exist_ok=True) 
sys.path.extend([f"./{CODE_FOLDER}"])

DATA_FILEPATH = "penguins.csv" # Path to data

ipytest.autoconfig(raise_on_error=True) # Testing library

bucket = os.environ["BUCKET"]
role = os.environ["ROLE"]

COMET_API_KEY = os.environ.get("COMET_API_KEY", None)
COMET_PROJECT_NAME = os.environ.get("COMET_PROJECT_NAME", None)

Initialise our pipeline session and initialise required config variables.

In [4]:
pipeline_session = PipelineSession(default_bucket=bucket) if not LOCAL_MODE else LocalPipelineSession(default_bucket=bucket)
instance_type = "ml.m5.xlarge" if not LOCAL_MODE else "local"

config = {
        "session": pipeline_session,
        "instance_type": instance_type,
        "image": None,
    }

config["framework_version"] = "2.12"
config["py_version"] = "py310"

S3_LOCATION = f"s3://{bucket}/penguins"

sagemaker_session = sagemaker.session.Session()
sagemaker_client = boto3.client("sagemaker")
iam_client = boto3.client("iam")
region = boto3.Session().region_name

Windows Support for Local Mode is Experimental


# Splitting & Transforming The Data
Now we're going to start building a pipeline (which is just a chain of components). The first component will be for transforming our data.

*Note: If you haven't set up the project yet then you need to read [Configuring AWS & Local Environment for MLOps Pipelines Project](https://digitalredneck.co.uk/configuring-aws-local-environment-for-mlops-pipelines-project/) to get up to this point.* 

We've already uploaded our data to s3 so what this pipeline component is going to do is go to the s3 bucket and retrieve the data and then create a model out of that dataset. In order to do that there are a number of steps that we need to go through. We need to process the data, then we need to train the model, then we need to evaluate the model to make sure it is producing good results, and then we'll be in the position where we can register the model that we can then deploy for use in the future.

The pipeline is the structure that is going to combine all of these steps and execute them one after the other.

<img src="https://digitalredneck.co.uk/PipeLine_split_transform.jpg" style="width: 100%; height: auto;" />

This step is going to take the data from our s3 bucket and then split the data (training and testing sets) and transform the data (replacing missing values etc). The output of this component is going to be saved back in s3. Then the next step will be training and then evaluation etc.

We'll use the [Scikit-Learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for the transformations, and a [Processing Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-processing) with a [SKLearnProcessor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-processor) to execute a processing script. For more information on this check out the [SageMaker Pipelines Overview](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html) for an intro into the fundamental components of a SageMaker Pipeline.

## Step 1: The Processing Script
We need to write a script that will be used to split and transform the data. This Processing Step will create a SageMaker Processing Job in the background. It will then run the script and upload the output to AWS s3.

We will create a folder called "processing" and add it to the system path so that we can use it later.

In [5]:
(CODE_FOLDER / "processing").mkdir(parents=True, exist_ok=True)

The first line in the following cell i.e. `%%writefile {CODE_FOLDER}/processing/script.py` essentially writes the code written in the cell to a new file in a folder called `processing` without executing it. This means we're using Jupyter to write python files that will then be used in production (and not Jupyter itself).

In [6]:
%%writefile {CODE_FOLDER}/processing/script.py
# | filename: script.py
# | code-line-numbers: true

import os
import tarfile
import tempfile
from pathlib import Path

import joblib
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler


def preprocess(base_directory):
    """Load, split, and transform the data."""

    # Get all of the data in our data directory and receive a 
    # shuffled dataset that contains all of the data.
    df = _read_data_from_input_csv_files(base_directory)

    # This is where we create a SciKit-Learn pipeline to transform the dataset.

    # Create a ColumnTransformer for the target feature.
    # This will transform the categorical data in the target feature
    # using an ordinal encoder: so 0, 1, 2 etc for the different classes.
    target_transformer = ColumnTransformer(
        transformers=[("species", OrdinalEncoder(), [0])],
    )

    # This is applied to all of the numerical columns in the dataset.
    # We're going to impute missing values with the mean of all the other values.
    # We're also going to scale (normalise) all of the values.
    numeric_transformer = make_pipeline(
        SimpleImputer(strategy="mean"),
        StandardScaler(),
    )

    # This will be applied to all of the categorical columns in the dataset.
    # We're going to impute missing values with the most common of all the other values.
    # We're also going to one-hot encode the columns to create new features.
    categorical_transformer = make_pipeline(
        SimpleImputer(strategy="most_frequent"),
        OneHotEncoder(),
    )

    # This is where we use the transformers we just created in a ColumnTransformer.
    features_transformer = ColumnTransformer(
        transformers=[
            (
                "numeric", # The name of this transformer
                numeric_transformer,
                make_column_selector(dtype_exclude="object"),
            ),
            (
                "categorical", # The name of this transformer 
                categorical_transformer, 
                ["island"]), # Only being applied to the island column.
                # We don't transform the "sex" column and as such SciKit Learn 
                # will drop it. By ignoring it it will be dropped.
                # The reason we're doing that is because "sex" doesn't have any 
                # predictive power in this dataset and as such we don't need it.
        ],
    )

    # Now that the data has been transformed we're now going to split it.
    # We need to split the data before transforming it.
    # 70% for training, 15% for validation, and 15% for test
    df_train, df_validation, df_test = _split_data(df)

    # This is going to help us when we get to monitoring our model
    # These are the baselines for our raw data.
    _save_train_baseline(base_directory, df_train)
    _save_test_baseline(base_directory, df_test)

    # This is where we now transform the data. This is how the transformers are applied.
    # This will be applied to all of the sets in our data i.e. train, validation, and test.
    # We "fit" the train set and from the info collected from that we will then transform
    # The validation and test set using the info collected by calling "fit_transform".
    
    # The reason we do this is in case some of the classes are missing in the validation 
    # or test sets. If they are we will know because we got that from "fit_transform"
    # By doing it this way, we can ensure that the encodings of the data are the same 
    # in all sets i.e. class A=0, class B=1, class C=2 etc. 
    y_train = target_transformer.fit_transform( # fit_transform 
        np.array(df_train.species.values).reshape(-1, 1),
    )
    y_validation = target_transformer.transform( # transform
        np.array(df_validation.species.values).reshape(-1, 1),
    )
    y_test = target_transformer.transform( # transform
        np.array(df_test.species.values).reshape(-1, 1),
    )

    # Now that we've got our "y" labels we can drop them from the three main datasets.
    # We are going to drop the target feature for the three sets.
    df_train = df_train.drop("species", axis=1)
    df_validation = df_validation.drop("species", axis=1)
    df_test = df_test.drop("species", axis=1)

    # This is where we transform the features in the dataset (minus the target column).
    X_train = features_transformer.fit_transform(df_train)  # noqa: N806
    X_validation = features_transformer.transform(df_validation)  # noqa: N806
    X_test = features_transformer.transform(df_test)  # noqa: N806

    # Once this is done our data is ready to be saved.
    _save_splits(
        base_directory,
        X_train,
        y_train,
        X_validation,
        y_validation,
        X_test,
        y_test,
    )

    # This is where we save the SciKit Learn transformation pipeline.
    # When we deploy this model in production we are going to be receiving raw data.
    # Before we can make a prediction on that raw data we need to transform the data
    # so that it works with the model (i.e. has had the same transformations made to it).
    # We need to transform that raw data using the exact same process as we have here.
    _save_model(base_directory, target_transformer, features_transformer)


def _read_data_from_input_csv_files(base_directory):
    """Read the data from the input CSV files.

    This function reads every CSV file available and
    concatenates them into a single dataframe.
    """

    # This script will grab data out of every csv file in the directory
    # This is because more data can be added to the project (in separate files)
    # And that new data will be included in the processing step.
    input_directory = Path(base_directory) / "input"
    files = list(input_directory.glob("*.csv"))

    # If there are no files then we are going to raise an error
    if len(files) == 0:
        message = f"The are no CSV files in {input_directory.as_posix()}/"
        raise ValueError(message)

    # If there are files then we are going to save them to a DataFrame
    # This DataFrame will contain the data from all of the files in the directory.
    raw_data = [pd.read_csv(file) for file in files]
    df = pd.concat(raw_data)

    # Shuffle and return the data
    return df.sample(frac=1, random_state=42)


def _split_data(df):
    """Split the data into train, validation, and test."""
    df_train, temp = train_test_split(df, test_size=0.3)
    df_validation, df_test = train_test_split(temp, test_size=0.5)

    return df_train, df_validation, df_test


def _save_train_baseline(base_directory, df_train):
    """Save the untransformed training data to disk.

    We will need the training data to compute a baseline to
    determine the quality of the data that the model receives
    when deployed.
    """
    baseline_path = Path(base_directory) / "train-baseline"
    baseline_path.mkdir(parents=True, exist_ok=True)

    df = df_train.copy().dropna()

    # To compute the data quality baseline, we don't need the
    # target variable, so we'll drop it from the dataframe.
    df = df.drop("species", axis=1)

    df.to_csv(baseline_path / "train-baseline.csv", header=True, index=False)


def _save_test_baseline(base_directory, df_test):
    """Save the untransformed test data to disk.

    We will need the test data to compute a baseline to
    determine the quality of the model predictions when deployed.
    """
    baseline_path = Path(base_directory) / "test-baseline"
    baseline_path.mkdir(parents=True, exist_ok=True)

    df = df_test.copy().dropna()

    # We'll use the test baseline to generate predictions later,
    # and we can't have a header line because the model won't be
    # able to make a prediction for it.
    df.to_csv(baseline_path / "test-baseline.csv", header=False, index=False)


def _save_splits(
    base_directory,
    X_train,  # noqa: N803
    y_train,
    X_validation,  # noqa: N803
    y_validation,
    X_test,  # noqa: N803
    y_test,
):
    """Save data splits to disk.

    This function concatenates the transformed features
    and the target variable, and saves each one of the split
    sets to disk.
    """
    train = np.concatenate((X_train, y_train), axis=1)
    validation = np.concatenate((X_validation, y_validation), axis=1)
    test = np.concatenate((X_test, y_test), axis=1)

    train_path = Path(base_directory) / "train"
    validation_path = Path(base_directory) / "validation"
    test_path = Path(base_directory) / "test"

    train_path.mkdir(parents=True, exist_ok=True)
    validation_path.mkdir(parents=True, exist_ok=True)
    test_path.mkdir(parents=True, exist_ok=True)

    pd.DataFrame(train).to_csv(train_path / "train.csv", header=False, index=False)
    pd.DataFrame(validation).to_csv(
        validation_path / "validation.csv",
        header=False,
        index=False,
    )
    pd.DataFrame(test).to_csv(test_path / "test.csv", header=False, index=False)


def _save_model(base_directory, target_transformer, features_transformer):
    """Save the Scikit-Learn transformation pipelines.

    This function creates a model.tar.gz file that
    contains the two transformation pipelines we built
    to transform the data.
    """
    with tempfile.TemporaryDirectory() as directory:
        joblib.dump(target_transformer, Path(directory) / "target.joblib")
        joblib.dump(features_transformer, Path(directory) / "features.joblib")

        model_path = Path(base_directory) / "model"
        model_path.mkdir(parents=True, exist_ok=True)

        with tarfile.open(f"{(model_path / 'model.tar.gz').as_posix()}", "w:gz") as tar:
            tar.add(Path(directory) / "target.joblib", arcname="target.joblib")
            tar.add(
                Path(directory) / "features.joblib", arcname="features.joblib",
            )


if __name__ == "__main__":
    preprocess(base_directory="/opt/ml/processing")

Overwriting code/processing/script.py


Let's test the script to ensure everything worked okay.

In [7]:
%%ipytest -s
# | code-fold: true

import os
import shutil
import tarfile
import tempfile

import pytest
from processing.script import preprocess
import pandas as pd


@pytest.fixture(autouse=False)
def directory():
    directory = tempfile.mkdtemp()
    input_directory = Path(directory) / "input"
    input_directory.mkdir(parents=True, exist_ok=True)
    shutil.copy2(DATA_FILEPATH, input_directory / "data.csv")

    directory = Path(directory)
    preprocess(base_directory=directory)

    yield directory

    shutil.rmtree(directory)


def test_preprocess_generates_data_splits(directory):
    output_directories = os.listdir(directory)

    assert "train" in output_directories
    assert "validation" in output_directories
    assert "test" in output_directories


def test_preprocess_generates_baselines(directory):
    output_directories = os.listdir(directory)

    assert "train-baseline" in output_directories
    assert "test-baseline" in output_directories


def test_preprocess_creates_two_models(directory):
    model_path = directory / "model"
    tar = tarfile.open(model_path / "model.tar.gz", "r:gz")

    assert "features.joblib" in tar.getnames()
    assert "target.joblib" in tar.getnames()


def test_splits_are_transformed(directory):
    train = pd.read_csv(directory / "train" / "train.csv", header=None)
    validation = pd.read_csv(directory / "validation" / "validation.csv", header=None)
    test = pd.read_csv(directory / "test" / "test.csv", header=None)

    # After transforming the data, the number of features should be 7:
    # * 3 - island (one-hot encoded)
    # * 1 - culmen_length_mm = 1
    # * 1 - culmen_depth_mm
    # * 1 - flipper_length_mm
    # * 1 - body_mass_g
    number_of_features = 7

    # The transformed splits should have an additional column for the target
    # variable.
    assert train.shape[1] == number_of_features + 1
    assert validation.shape[1] == number_of_features + 1
    assert test.shape[1] == number_of_features + 1


def test_train_baseline_is_not_transformed(directory):
    baseline = pd.read_csv(
        directory / "train-baseline" / "train-baseline.csv",
        header=None,
    )

    island = baseline.iloc[:, 0].unique()

    assert "Biscoe" in island
    assert "Torgersen" in island
    assert "Dream" in island


def test_test_baseline_is_not_transformed(directory):
    baseline = pd.read_csv(
        directory / "test-baseline" / "test-baseline.csv", header=None
    )

    island = baseline.iloc[:, 1].unique()

    assert "Biscoe" in island
    assert "Torgersen" in island
    assert "Dream" in island


def test_train_baseline_includes_header(directory):
    baseline = pd.read_csv(directory / "train-baseline" / "train-baseline.csv")
    assert baseline.columns[0] == "island"


def test_test_baseline_does_not_include_header(directory):
    baseline = pd.read_csv(directory / "test-baseline" / "test-baseline.csv")
    assert baseline.columns[0] != "island"

[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m[32m.[0m
[32m[32m[1m8 passed[0m[32m in 0.40s[0m[0m


## Step 2: Caching Config
SageMaker supports caching which means it will try to execute a previous run of the same step (if nothing has changed). For more info check out [Caching Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-caching.html).

Let's define a caching policy that we can use at every step.

In [8]:
from sagemaker.workflow.steps import CacheConfig

cache_config = CacheConfig(enable_caching=True, expire_after="15d")

## Step 3: Pipeline Config
We can make our pipeline more flexible by parameterising it. In our case we are going to pass through the location of the dataset. This means we can then switch out datasets by changing the value of this parameter. For more information visit [Pipeline Parameters](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-parameters.html).

In [9]:
from sagemaker.workflow.parameters import ParameterString
from sagemaker.workflow.pipeline_definition_config import PipelineDefinitionConfig

pipeline_definition_config = PipelineDefinitionConfig(use_custom_job_prefix=True)

dataset_location = ParameterString(
    name="dataset_location",
    default_value=f"{S3_LOCATION}/data",
)

## Step 4: Setting up the Processing Step

In [10]:
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(
    base_job_name="preprocess-data",
    framework_version="1.2-1",
    # By default, a new account doesn't have access to `ml.m5.xlarge` instances.
    # If you haven't requested a quota increase yet, you can use an
    # `ml.t3.medium` instance type instead. This will work out of the box, but
    # the Processing Job will take significantly longer than it should have.
    # To get access to `ml.m5.xlarge` instances, you can request a quota
    # increase under the Service Quotas section in your AWS account.
    instance_type=config["instance_type"],
    instance_count=1,
    role=role,
    sagemaker_session=config["session"],
)

Now let's define the Processing Step that we'll use in our pipeline.

We specify the list of inputs. In this instance it is the dataset that we stored in s3 (make sure you've completed that by visiting [Configuring AWS & Local Environment for MLOps Pipelines Project](https://digitalredneck.co.uk/configuring-aws-local-environment-for-mlops-pipelines-project/) to ensure you're properly set up to run this notebook project).

In [11]:
%%capture
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

preprocessing_step = ProcessingStep(
    name="preprocess-data",
    step_args=processor.run(
        code=f"{(CODE_FOLDER / 'processing' / 'script.py').as_posix()}",
        inputs=[
            ProcessingInput(
                source=dataset_location,
                destination="/opt/ml/processing/input",
            ),
        ],
        outputs=[
            ProcessingOutput(
                output_name="train",
                source="/opt/ml/processing/train",
                destination=f"{S3_LOCATION}/preprocessing/train",
            ),
            ProcessingOutput(
                output_name="validation",
                source="/opt/ml/processing/validation",
                destination=f"{S3_LOCATION}/preprocessing/validation",
            ),
            ProcessingOutput(
                output_name="test",
                source="/opt/ml/processing/test",
                destination=f"{S3_LOCATION}/preprocessing/test",
            ),
            ProcessingOutput(
                output_name="model",
                source="/opt/ml/processing/model",
                destination=f"{S3_LOCATION}/preprocessing/model",
            ),
            ProcessingOutput(
                output_name="train-baseline",
                source="/opt/ml/processing/train-baseline",
                destination=f"{S3_LOCATION}/preprocessing/train-baseline",
            ),
            ProcessingOutput(
                output_name="test-baseline",
                source="/opt/ml/processing/test-baseline",
                destination=f"{S3_LOCATION}/preprocessing/test-baseline",
            ),
        ],
    ),
    cache_config=cache_config,
)

# Training the Model
Now we're going to extend the pipeline that we just created with a [Training Step](https://docs.aws.amazon.com/sagemaker/latest/dg/build-and-manage-steps.html#step-type-training). We're going to be using TensorFlow for that so if you need to know more about that then visit the [TensorFlow docs in SageMaker](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/using_tf.html#train-a-model-with-tensorflow).

We're also going to implement tracking using [https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html](Amazon SageMaker Experiments) and [Comet ML](https://www.comet.com/site/).

To do this we're going to create a new folder in the project called `training` and this will house the script ()which we can then use later by importing it as a module.

In [12]:
(CODE_FOLDER / "training").mkdir(parents=True, exist_ok=True)

## Step 1: Creating the Training Script

In [13]:
%%writefile {CODE_FOLDER}/training/script.py
# | filename: script.py
# | code-line-numbers: true

import argparse
import json
import os
import tarfile

from pathlib import Path
from comet_ml import Experiment

import keras
import numpy as np
import pandas as pd
from keras import Input
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import SGD
from packaging import version
from sklearn.metrics import accuracy_score


def train(
    model_directory,
    train_path,
    validation_path,
    pipeline_path,
    experiment,
    epochs=50,
    batch_size=32,
):
    print(f"Keras version: {keras.__version__}")

    X_train = pd.read_csv(Path(train_path) / "train.csv")
    y_train = X_train[X_train.columns[-1]]
    X_train = X_train.drop(X_train.columns[-1], axis=1)

    X_validation = pd.read_csv(Path(validation_path) / "validation.csv")
    y_validation = X_validation[X_validation.columns[-1]]
    X_validation = X_validation.drop(X_validation.columns[-1], axis=1)

    model = Sequential(
        [
            Input(shape=(X_train.shape[1],)),
            Dense(10, activation="relu"),
            Dense(8, activation="relu"),
            Dense(3, activation="softmax"),
        ]
    )

    model.compile(
        optimizer=SGD(learning_rate=0.01),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )

    model.fit(
        X_train,
        y_train,
        validation_data=(X_validation, y_validation),
        epochs=epochs,
        batch_size=batch_size,
        verbose=2,
    )

    predictions = np.argmax(model.predict(X_validation), axis=-1)
    val_accuracy = accuracy_score(y_validation, predictions)
    print(f"Validation accuracy: {val_accuracy}")

    # Starting on version 3, Keras changed the model saving format.
    # Since we are running the training script using two different versions
    # of Keras, we need to check to see which version we are using and save
    # the model accordingly.
    model_filepath = (
        Path(model_directory) / "001"
        if version.parse(keras.__version__) < version.parse("3")
        else Path(model_directory) / "penguins.keras"
    )

    model.save(model_filepath)

    # Let's save the transformation pipelines inside the
    # model directory so they get bundled together.
    with tarfile.open(Path(pipeline_path) / "model.tar.gz", "r:gz") as tar:
        tar.extractall(model_directory)

    if experiment:
        experiment.log_parameters(
            {
                "epochs": epochs,
                "batch_size": batch_size,
            }
        )
        experiment.log_dataset_hash(X_train)
        experiment.log_confusion_matrix(
            y_validation.astype(int), predictions.astype(int)
        )
        experiment.log_model("penguins", model_filepath.as_posix())


if __name__ == "__main__":
    # Any hyperparameters provided by the training job are passed to
    # the entry point as script arguments.
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=50)
    parser.add_argument("--batch_size", type=int, default=32)
    args, _ = parser.parse_known_args()

    # Let's create a Comet experiment to log the metrics and parameters
    # of this training job.
    comet_api_key = os.environ.get("COMET_API_KEY", None)
    comet_project_name = os.environ.get("COMET_PROJECT_NAME", None)

    experiment = (
        Experiment(
            project_name=comet_project_name,
            api_key=comet_api_key,
            auto_metric_logging=True,
            auto_param_logging=True,
            log_code=True,
        )
        if comet_api_key and comet_project_name
        else None
    )

    training_env = json.loads(os.environ.get("SM_TRAINING_ENV", {}))
    job_name = training_env.get("job_name", None) if training_env else None

    # We want to use the SageMaker's training job name as the name
    # of the experiment so we can easily recognize it.
    if job_name and experiment:
        experiment.set_name(job_name)

    train(
        # This is the location where we need to save our model.
        # SageMaker will create a model.tar.gz file with anything
        # inside this directory when the training script finishes.
        model_directory=os.environ["SM_MODEL_DIR"],
        # SageMaker creates one channel for each one of the inputs
        # to the Training Step.
        train_path=os.environ["SM_CHANNEL_TRAIN"],
        validation_path=os.environ["SM_CHANNEL_VALIDATION"],
        pipeline_path=os.environ["SM_CHANNEL_PIPELINE"],
        experiment=experiment,
        epochs=args.epochs,
        batch_size=args.batch_size,
    )


Overwriting code/training/script.py


In [17]:
%%ipytest -s
#| code-fold: true

import os
import shutil
import pytest
import tempfile

from processing.script import preprocess
from training.script import train

@pytest.fixture(scope="function", autouse=False)
def directory():
    directory = tempfile.mkdtemp()
    input_directory = Path(directory) / "input"
    input_directory.mkdir(parents=True, exist_ok=True)
    shutil.copy2(DATA_FILEPATH, input_directory / "data.csv")
    
    directory = Path(directory)
    
    preprocess(base_directory=directory)
    train(
        model_directory=directory / "model",
        train_path=directory / "train", 
        validation_path=directory / "validation",
        pipeline_path=directory / "model",
        experiment=None,
        epochs=1
    )
    
    yield directory
    
    shutil.rmtree(directory)


def test_train_bundles_model_assets(directory):
    bundle = os.listdir(directory / "model")
    assert "001" in bundle
    
    assets = os.listdir(directory / "model" / "001")
    assert "saved_model.pb" in assets


def test_train_bundles_transformation_pipelines(directory):
    bundle = os.listdir(directory / "model")
    assert "target.joblib" in bundle
    assert "features.joblib" in bundle

Keras version: 2.14.0
8/8 - 0s - loss: 1.1912 - accuracy: 0.4226 - val_loss: 1.2092 - val_accuracy: 0.2941 - 445ms/epoch - 56ms/step
Validation accuracy: 0.29411764705882354
[32m.[0mKeras version: 2.14.0
8/8 - 1s - loss: 1.1659 - accuracy: 0.3473 - val_loss: 1.0872 - val_accuracy: 0.4314 - 519ms/epoch - 65ms/step
Validation accuracy: 0.43137254901960786
[32m.[0m
[32m[32m[1m2 passed[0m[32m in 2.36s[0m[0m


## Step 2: Creating the Training Step
Now we can create the Training Step

In [25]:
%%writefile {CODE_FOLDER}/training/requirements.txt
#| label: requirements.txt
#| filename: requirements.txt
#| code-line-numbers: false

comet_ml

Overwriting code/training/requirements.txt
