# Hopsworks and the Machine Learning Model Lifecycle


In this notebook we will take a look at how to the Hopsworks platform can be used throughout the standard lifecycle of a machine learning model. A typical lifecycle could look something like this:


    1. Domain Exploration

    2. Data Collection and Data Exploration

    3. Feature Engineering

    4. Preparing Data for Model Training

    5. Building a predictive model

    6. Analyzing model performance

    7. Deploying the model to production

    8. Monitoring the perfomance of a model in production


Roughly speaking, points 1-4 are data-centric and relate to the manipulation and storing of data and features, whereas points 5-8 relate to the handling of models.

The Hopsworks platform can support you model lifecycle in most of these steps, but it is important to understand some of the underlying concepts in order to use the platform in an optimal way. Below, you will learn more about these concepts and about the Hopsworks libraries you need in order to interface with them: [HSFS](https://github.com/logicalclocks/feature-store-api/blob/master/README.md), [HSML](https://pypi.org/project/hsml/), and [Maggy](https://github.com/logicalclocks/maggy).

For the data-centric part of the workflow, the most crucial component in the platform is the **feature store**, so we will first explain that concept as well as the related concepts of feature groups, training datasets, feature engineering and data validation.

In the model-centric part of this overview, we'll look into the ideas of the model registry and model deployment, as well as experimentation (hyperparameter optimization).

<!-- https://github.com/logicalclocks/telco_churn/blob/master/0_ml_lifecycle.ipynb -->

<!-- 
Nice references:
https://www.tecton.ai/blog/devops-ml-data/
https://feast.dev/blog/what-is-a-feature-store/
https://docs.hopsworks.ai/feature-store-api/2.5.8/quickstart/
-->

## Feature Store

The [Feature Store](https://www.hopsworks.ai/feature-store) is a data management system that allows data scientists and data engineers to efficiently collaborate on projects.

An organization might have separate data pipelines for each and every model they train. In many cases this results in duplicate work, as the pipelines typically share some preprocessing steps in common. Consequently, changes in some preprocessing steps would result in even more duplicate work, and potential inconsistencies between pipelines.

Moreover, once a model has been deployed we need to make sure that online data is processed in the same way as the training data. The Feature Store streamlines the data pipeline creation process by putting all the feature engineering logic of a project in the same place. The built in version control enables them to work seamlessly with different versions of features and datasets.

Another advantage of having a feature store set up is that you can easily do "time travel" and recreate training data as it would have looked at some point in the past. This is very useful when, for example, benchmarking recommendation systems or assessing concept drift.

In short, a feature store enables data scientists to reuse features across different experiments and tasks, and to recreate datasets from a given point in time. 

In Hopsworks, the `hsfs` library is the Python interface to the feature store.

### Feature Group

A Feature Group is a collection of conceptually related features that typically originate from the same data source. It is really up to the user (perhaps a data scientist) to decide which features should be grouped together into a feature group, so the process is subjective. Apart from conceptual relatedness or stemming from the same data source, another way to think about a feature group might be to consider features that are collected at the same rate (e.g. hourly, daily, monthly) to belong to the same feature group.

A feature group is stored in the feature store and can exist in several versions. New data can be freely appended to a feature group, as long as it conforms to the data validation rules the user has set up.


### Data Validation

The data might not conform to the data schema we have defined. It could happen that a numerical feature is given as a string, or that a feature which should be a positive number is given as a negative number. For example, we might decide that a transaction amount must be positive.

In Hopsworks you can define feature expectations, which ensures that the data adheres to a specified format. These expectations can be attached to a feature group either during the creation of the feature group, or after the feature group has been created. 

**[The stuff below is maybe too much information]**

To use the data validation functionality, the user needs to:
- Define an expectation through Rule (from the `hsfs` library) specifying what each training example must fulfill
- Attach the expectation to a feature group
- Make sure that the feature group has enabled data validation

After this has been done, if the user tries to ingest new data that violates any of the expectations, the ingestion job will fail and the feature group will not be updated.

### Feature Engineering

Typically it does not suffice to train a model using raw data sources alone. There can for instance be data problems such as missing values and duplicate entries that need to be dealt with. Moreover, model training could be facilitated by preprocessing raw features, or creating new features using raw features.

Feature engineering can be considered a whole subfield by itself and there are general resources such as [Feature Engineering and Selection: A Practical Approach for Predictive Models](https://www.amazon.com/Feature-Engineering-Selection-Practical-Predictive-dp-1138079227/dp/1138079227/ref=as_li_ss_tl?_encoding=UTF8&me=&qid=1588630415&linkCode=sl1&tag=inspiredalgor-20&linkId=f3f8d9f56031a030893aad8fc684a800&language=en_US) and various Python libraries for feature engineering such as [tsfresh](https://tsfresh.readthedocs.io/en/latest/) for time series or [featuretools](https://www.featuretools.com/) which also handles relational data.

Engineered features are typically also stored in feature groups in the feature store.

### Preparing Data for Model Training

Hopsworks makes it easy to create datasets for model training.

In the Hopsworks framework, training datasets are immutable in the sense that, in contrast to feature groups, you cannot append data to them after they have been created. However, you *can* have different *versions* of the same training dataset.

Often, a training dataset will be created by running a query to join the feature groups of interest. This query is saved as metadata in the dataset, which makes it easy to see the dependencies between the dataset and the feature groups it originates from. The Hopsworks UI contains a dataset provenance graph that shows this.

<!-- Moreover, you can download the dataset in a format compatible with the framework you're working with, e.g. tfrecords, numpy, csv... -->

## Model handling

Here, we will briefly discuss the concepts of model registry and model deployment. Whereas the concepts above have related to the *data* - how to preprocess, store, and query it - we will now turn our attention towards the *model* part of the ML lifecycle.

### Model registry

The Hopsworks platform contains a model registry that keeps track of different ML models and their versions. These models can be either sklearn, Tensorflow, Pytorch or generic Python models. When you have trained a model that you want to keep around, you can register it with a simple line of code.

Once a model (version) has been registered, it will be visible in Hopsworks UI and you will be able to create a *deployment* (also called *serving*) based on it.

The model registry has functionality for retrieving the best performing model (based on some user-defined metric) from a collection of models. 

The interface to the model registry is the `hsml` library.

### Model deployment

A model from the registry can easily be deployed to a prediction service. The prediction service can then be called either via the `hsml` API or as a POST request to a REST endpoint which gets automatically created.

The user has the possibility to deploy the model via [KFServing/KServe](https://www.kubeflow.org/docs/external-add-ons/kserve/kserve/) or as a simple Docker containing running Flask. The deployment can also be configured with respect to resource allocation.

### Experimentation

Hopsworks' [Maggy](https://maggy.ai/master/) library enables easy experimentation with model hyperparameter tuning. It implements several strategies for searching through the parameter space and returns the combination corresponding to the best value of the provided metric.