# Unified Model Documentation <a class="tocSkip">

A lightweight library to create self-contained executable models with focus on compatibility, simplicity, and fast experimentation.

At its most basic, a trained model is just a function that takes some input and produces some output. This library builds on this assumption and enables you to package up the preprocessing & prediction logic, and all of your model artifacts into a single model archive file that you can then easily share, distribute, and deploy. There is no need for configuration files, copying files on the file system, or other manual tasks. You can define and save your model without leaving your python notebook or script. With the unified model format, you can build your model once and run it anywhere. It provides a huge flexibility to deploy and serve your models in any environment. Furthermore, this model format makes it possible to easily combine (e.g. via voting ensembles) and evaluate models without having to know the underlying machine learning library.

**Key Aspects:**
* Build once, run anywhere.
* Works with any machine learning library that comes with a python interface.
* Packages all your model logic, requirements, and artifacts into a single self-container file.

This tutorial shows you everything you need to know in order to use this library for your machine learning models. 

**In this notebook:**

* Unified Model Basics: Define, Use, Save, Load, Deploy, ...
* Case Study - Text Classification: Unified Models in Action

_The library and this notebook is only tested with Python 3._

# Dependencies
In the first step, we will just install and import all dependencies required for this notebook.

In [None]:
# System libraries.
from __future__ import absolute_import, division, print_function
import os, logging, sys, inspect

# Third-party libraries
import pandas as pd
pd.set_option("display.max_rows", 120)
pd.set_option("display.max_columns", 120)

# Intialize tqdm to always use the notebook progress bar
import tqdm
tqdm.tqdm = tqdm.tqdm_notebook

# Enable logging
logging.basicConfig(format='[%(levelname)s] %(message)s', level=logging.INFO, stream=sys.stdout)

# Lab libraries
from lab_client import Environment

In [None]:
# Initialize environment
env = Environment() 

# Create experiment
exp = env.create_experiment('Unified Model Tutorial')
output_path = exp.output_path

# Unified Model - Basics
---

**Required Time:** ~10 minutes

## Define Model
Define your first unified model. The minimum requirement is that you overwrite the `_predict` function. This function is expected to return predictions from the model based on the provided `data`.

In [None]:
from unified_model import UnifiedModel

# This basic model just returns the data that it gets in.
class MyEchoModel(UnifiedModel):
    def _predict(self, data, **kwargs):
        # Implement this function with the logic to make a prediction on the data item. 
        # In this example we just return the data itself.
        return data

## Initialize and Use Model
To use the model, you need to create a new instance and call the `predict` method with a data item.

In [None]:
echo_model = MyEchoModel()
echo_model.predict("This is an data example")

## Save Model
To make the model portable, you can save the model as a single file to a given path.

In [None]:
model_path = echo_model.save(os.path.join(output_path,'my_first_model.model.zip'))

**Note:** The saved model is a zip file that contains the python pickle of your model and a few other artifacts in a specific structure (more details in ["A Look inside the Model"](#A-Look-inside-the-Model)). You can unzip the model with any unzipping tool. You can also compress the model via the `compress` parameter (this is deactivated on default).

## Load Model
You can load any unified model file via `UnifiedModel.load(model_path)`

In [None]:
loaded_model = UnifiedModel.load(model_path)
loaded_model.predict("This is an data example")

**Note:** You can also load a unified model from a folder via `UnifiedModel.load(model_path)` as well, if it has the right folder structure as explained in ["A Look inside the Model"](#A-Look-inside-the-Model).

## Model Metadata
Adding metadata to the model allows you to provide information on how the model was trained, which data was used, how the data is preprocessed, the type of the model and more. You can save those additional metadata to the model by providing a dictonary at model initialization via the `info_dict` parameter.

In [None]:
model_info = {
    "description": "Returns the data that it gets in."
}

echo_model = MyEchoModel(info_dict=model_info)

Use the `info()` function to get this metadata about the model.

In [None]:
echo_model.info()

Every model also has a name which you can change on model initalization via the `name` parameter.

In [None]:
print(str(echo_model))
echo_model = MyEchoModel(name="echo_model")
print(str(echo_model))

## Model Lifecycle
The Unified Model also provides lifecycle methods that can be overwritten if additional processing is required for model save (`_save_model(output_path)`) and load (`_init_model()`). See the comments below for additonal details:

In [None]:
class MyEchoModel(UnifiedModel):
    def _init_model(self):
        # Called after the model is unpickled. 
        # Overwrite this method if additional initialization is required.
        print("init model")
    
    def _save_model(self, output_path):
        # Called before the model is saved to the file system. 
        # Overwrite this method if additional processing is required before the model is saved.
        print("save model")
    
    def _predict(self, data, **kwargs):
        # Implement this function with the logic to make a prediction on the data item. 
        # In this example we just return the text itself.
        return data
    
echo_model = MyEchoModel()

# The save function will call _save_model() before saving the model to the filesystem, 
# and _init_model after it is saved to initialize the model again for direct usage.
print("Saving Model:")
model_path = echo_model.save(os.path.join(output_path,'my_first_model.model.zip'))

# The load function will only call _init_model() after the unpickeling
print("Loading Model:")
UnifiedModel.load(model_path).predict("This is an data example")

## Model Requirements
In most cases, you will have third-party dependencies in your model. You can add those dependencies to the model via `add_requirements()` or at model initialization with the `requirements` parameter.

To add dependencies, you can either add pip-installable libraries or imported modules. Imported modules will be packaged with the unified model with all source code of the given library/modul when the model is saved. If possible, we suggest to add dependencies via the pip library instead of the imported module.

In [None]:
!pip install -q langdetect

In [None]:
from langdetect import detect

class LanguageDetectionModel(UnifiedModel):
    REQUIREMENTS = {
        # Add all (pip) dependencies or required modules here.
        "pandas", # pip installable dependency. 
        # Version requirements for pip depencies are supported as well, e.g. pandas>=0.23.0
        detect # imported module, 
        # Modules will be packaged with the unified model with all related source code
    }
    
    def __init__(self, **kwargs):
        super(LanguageDetectionModel, self).__init__(**kwargs)
        self.add_requirements(self.REQUIREMENTS)
            
    def _predict(self, data, **kwargs):
        # call langauge detection library
        prediction = detect(str(data))
        # add prediction result into an dataframe
        return pd.DataFrame([prediction], columns=["language"])

lang_detection_model = LanguageDetectionModel()
lang_detection_model_path = lang_detection_model.save(os.path.join(output_path,'language_detection_model.model.zip'))

# To install the requirements prior to the model initalization: install_requirements=true
# As a default it is deactivated, but activated when the model is loaded to serving.
UnifiedModel.load(lang_detection_model_path, install_requirements=True).predict("This is an data example")


<div class="alert alert-info">
Additionally, you can also provide a script at model initialization via the <b>setup_script</b> parameter which will be executed via /bin/sh prior to model initialization. However, we suggest to only use this option if there is no other solution. 
</div>

## Model Types


In addition to the generic `UnifiedModel` class, you can also extend your custom model from a variety of more specific model types. The goal of model types is to have common input/output data formats and usage parameters defined based on the given machine learning task, such as image/text classification, regression, recommendation, object detection, sequence to sequence modeling, and any other tasks that can share common formats. Thereby, the model types can enforce/validate specific input and output data of the predict function, provide default data transformation, add additional prediction parameters, and other tasks to ensure a common usage for the given machine learning task. Here are a few examples of available model types:
* **RecommendationModel:** Input= any data; Output= dataframe with atleast item and score column.
* **ClassificationModel:** Input= same as RecommendationModel; Output= same as RecommendationModel
* **TextClassificationModel:** Input= plain text (string); Output= same as ClassificationModel

You can see all currently provided model types with the following code. Additional model types will be added soon.

In [None]:
from unified_model import model_types
inspect.getmembers(model_types, inspect.isclass)

## Add Files to Model
You can add arbitrary files to the model that will be packaged within the model file on save. You can add files via `add_file(key, file_path)` and get a file (path to the file) by the key with `get_file(key)`.

In this example, we are building a unified model that is able to suggest similar words to a given input word. Therefore, we will use a pretrained word2vec model that we will download in the following section:

In [None]:
from gensim import downloader
pretrained_wv_path = os.path.join(output_path,'pretrained_word_embeddings.bin')
downloader.load("glove-wiki-gigaword-50").save_word2vec_format(pretrained_wv_path, binary=True)

We will now define a `WordSuggestionModel`. Here, we will use gensim to load the pretrained word vectors and predict similar words. The downloaded file (pretrained word vectors) will be added to the model via `add_file()` and requested again in `_init_model()` via `get_file()`. 

In [None]:
from gensim.models import KeyedVectors

class WordSuggestionModel(model_types.RecommendationModel):
    W2V_MODEL = "w2v_model.bin"
        
    REQUIREMENTS = {
        "gensim"
    }
    
    def __init__(self, wv_model_path, **kwargs):
        super(WordSuggestionModel, self).__init__(**kwargs)
        self.add_requirements(self.REQUIREMENTS)
        # Add the file with an identifier to the model. 
        # The file will be bundled into the model file on model save.
        self.add_file(self.W2V_MODEL,wv_model_path)
        # Call the model initialization
        self._init_model()
    
    def _init_model(self):
        # get the added file from the model and load it. 
        self.gensim_wv_model = KeyedVectors.load_word2vec_format(
            self.get_file(self.W2V_MODEL), binary=True)
    
    def _save_model(self, output_path):
        # Delete the word vectors instance, so that it is not included into the pickle.
        # The added file will be automatically packaged with the model.
        del self.gensim_wv_model
            
    def _predict(self, data, limit=5, **kwargs):
        most_similar = self.gensim_wv_model.similar_by_word(str(data), limit)
        return pd.DataFrame(most_similar, columns=["item", "score"])
    
word_suggestions_model = WordSuggestionModel(pretrained_wv_path)
word_suggestions_model_path = word_suggestions_model.save(
    os.path.join(output_path,'word_suggestion_model.model.zip'))

UnifiedModel.load(word_suggestions_model_path).predict("data")

**Note:** In the above example, we also could just pickle the gensim wordvectors instance instead of storing and loading it as a file. However, their are many cases were pickle fails to serialize certain objects or were you would like to preserve the orginial file formats within the unified model package. In those cases, you have to use the add/get file functionality. To test whether a model can be successfully loaded in another python enviornment, we provide test utilities as explained in the next section.

## A Look inside the Model

The model file is a zip file that contains the python pickle of your model and a few other artifacts in a specific structure. The model files is a valid Python Zip Application as defined in [PEP 441](https://www.python.org/dev/peps/pep-0441) and, therefore, can be direcly executed. The unified model has the following structure and artifacts which are automatically created during `save()`: 
* **info.json**                  _-> Export of the model metadata (info dictionary)._
* **\_\_main\_\_.py**                _-> Starts the CLI interface for serving and prediction of the model._
* **data/**                      _-> All the data required for the initalization of the model._
  * **unified_model.pkl**        _-> Pickle file of the model instance that contains all the model's data processing and prediction logic._
* **code/**                      _-> All modules and libraries in this directory are added to the python path prior to model initialization. Added modules will be stored here._
* **requirements.txt**           _-> Optional: List of requirements installed via pip prior to model initialization_
* **setup.sh**                   _-> Optional: Setup script that is executed prior to model initialization_

Use the following code to look inside the `WordSuggestionModel`:

In [None]:
import zipfile
zipfile.ZipFile(word_suggestions_model_path, 'r').namelist()

## Predefined Models
For many common machine learning libraries, we will provide predefined model wrappers for different model types so that you are not required to always define you own custom models. 

* [FastText](https://github.com/facebookresearch/fastText):
    * **FasttextClassifier** (TextClassificationModel): Can be initialized with any fasttext classification model file via the `ft_classifier_path` parameter.
* [Sklearn](https://github.com/scikit-learn/scikit-learn):
    * **SklearnTextClassifier** (TextClassificationModel): Can be initialized with any sklearn classifier via the `sklearn_classifier` parameter.

Additional predefined model wrappers will be added soon. You can see some predefined models in action in the ["Case Study - Text Classification"](#Case-Study---Text-Classification) section.

To add your own data preprocessing logic to a predefined model, you can provide a transformation method via the `transform_func` parameter during model initialization. The transformation method should have the following format: 
``` python
def transform(data, model=None, **kwargs):
    # Apply your preprocessing on the data
    return data
```

## Serve & Deploy Model
The unified model format provides a huge flexibility to deploy and serve your models in any environment. In the following section we will serve with an REST API the model via CLI as well as a Docker image. 

### Serve via CLI

``` bash
python {word_suggestions_model_path} serve --port 8060
```

### Serve via Docker

``` bash
docker run -d -p 8091:8091 -v {word_suggestions_model_path}:/default_model mltooling/unified-model-service:latest
```

Take a look on the ["Compatibility & Deployment"](#Compatiblity-&amp;-Deployment) section for more deployment options.

## Test a Model
There are many factors, especially with pickling and the requirements, that prevents the model from successfully loading in another python environment. Therefore, we provide the`test_unified_model(model_instance, test_data_item=None)` functionality that helps to test whether your model instance can be successfully loaded in another python environment. This utilty saves the model instance, loads the model file in another python process, and (optionally) calls `predict()` with the provided test data.

In [None]:
from unified_model.evaluation_utils import test_unified_model
test_unified_model(word_suggestions_model,"data")

# Case Study - Text Classification
---

In the follwing case study, we will train a fasttext and a sklearn text classification model on the [20-newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset, and wrap the trained models into unified models. In additon, we will combine both models into an voting ensemble and apply extensive evaluation on all models.

**Required Time:** ~5 minutes

## Load Dataset
In the first step, we load the [20-newsgroups](http://qwone.com/~jason/20Newsgroups/) dataset and split it into train, validate, and predict.

In [None]:
import numpy as np
from gensim import downloader

# Load newsgroups dataset: Collection of approximately 20,000 newsgroup posts, partitioned (nearly) evenly across 20 different newsgroups.
DATASET_NAME = "20-newsgroups"
df = pd.DataFrame([data for data in downloader.load(DATASET_NAME)])

# Split this dataset into train (60%), validate (20%), and test (20%)
train_df, validate_df, test_df = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])
train_df.head()

## Sklearn Classifier

### Train Model
For the sklearn classifier, we use TF-IDF to vectorize the text data and an Linear SVM as the classifier.  

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV

# Preprocessing logic
def preprocess(data, **kwargs):
    return data.replace('\n', ' ').replace('\r', '').strip()

# Training logic
def train(config):
    global sklearn_classifier
    
    # Experiment Implementation
    def default_analyzer(x):
        return x

    classification_pipeline = Pipeline([
        ("tfidf", TfidfVectorizer(analyzer=default_analyzer,min_df=config['min_df'])),
        ("lsvc_calib", CalibratedClassifierCV(LinearSVC(verbose=0),method="isotonic", cv=3))])
    
    sklearn_classifier = classification_pipeline.fit(
        [preprocess(item).split() for item in train_df["data"].tolist()], 
        train_df["topic"].tolist()
    )
    
    score = sklearn_classifier.score(
        [preprocess(item).split() for item in validate_df["data"].tolist()],
        validate_df["topic"].tolist()
    )
    
    print("Model trained. Score: "+str(score))

# RUN 
# Define default parameter configuration
config = {
    'min_df':2
}

# Run model training
train(config)

### Option 1: Define Unified Model
Define the Unified Model for the sklearn classifier. Since we want to provide a text classification model, we can use the `TextClassificationModel` model type.

In [None]:
from unified_model.model_types import TextClassificationModel

class SklearnTextClassifier(TextClassificationModel):
    
    REQUIREMENTS = [
        "scikit-learn"
    ]

    def __init__(self, sklearn_classifier, **kwargs):
        super(SklearnTextClassifier, self).__init__(**kwargs)
        self.add_requirements(self.REQUIREMENTS)
        self.sklearn_classifier = sklearn_classifier

    def _predict(self, data, limit=None, **kwargs):
        if limit is None:
            limit = len(self.sklearn_classifier.classes_)

        # Also put preprocessing logic for the data directly into the model.
        data = preprocess(str(data))
        # Text needs to be tokenized
        if isinstance(data, (list, np.ndarray)):
            tokenized_text = data
        else:
            tokenized_text = str(data).strip().split()  # split by whitespace
            
        result = []
        class_confidences = self.sklearn_classifier.predict_proba([tokenized_text])
        for index in np.argsort(class_confidences)[:, :-limit - 1:-1][0]:
            item = self.sklearn_classifier.classes_[index]
            score = class_confidences[0][index]
            result.append([str(item), score])

        # The TextClassificationModel type expects a dataframe as prediction result 
        # The resulting dataframe needs to have atleast an item and score column
        return pd.DataFrame(result, columns=["item", "score"])

sklearn_model = SklearnTextClassifier(sklearn_classifier)
sklearn_model.predict("this is a data item")

### Option 2: Use Predefined Model Wrapper
Instead of defining your custom model, you can also just use the predefined `SklearnTextClassifier` and provide the preprocessing logic via the `transform_func` parameter.

In [None]:
from unified_model.predefined_models.sklearn_models import SklearnTextClassifier
sklearn_model = SklearnTextClassifier(sklearn_classifier, transform_func=preprocess)
sklearn_model.predict("this is a data item")

## Fasttext Classifier

### Train Model

In [None]:
from pyfasttext import FastText
import multiprocessing

# Preprocessing logic
def preprocess(data, **kwargs):
    return data.replace('\n', ' ').replace('\r', '').strip()

# Create training data for fasttext
train_data_path = os.path.join(output_path,'data.train.txt')
test_data_path =  os.path.join(output_path,'data.test.txt')
with open(train_data_path, 'w') as f:
    for index, row in train_df.iterrows():
        f.write("__label__" + row["topic"].strip() + ' ' +  preprocess(row["data"]) + '\n')
                
with open(test_data_path, 'w') as f:
     for index, row in validate_df.iterrows():
        f.write("__label__" + row["topic"].strip() + ' ' +  preprocess(row["data"]) + '\n')

# Training logic
def train(config):
    global fasttext_classifier
    global fasttext_classifier_path
    
    fasttext_classifier_path = os.path.join(output_path, DATASET_NAME+"_classifier_ft.model")
    fasttext_classifier = FastText()
    fasttext_classifier.supervised(
        input=train_data_path, 
        output=fasttext_classifier_path, 
        thread=multiprocessing.cpu_count(),
        pretrainedVectors='',
        wordNgrams=config['word_ngrams'],
        epoch=config['epochs'],
        minCount=config['min_count'],
        ws=config['window_size'],
        dim=config['vector_dim'],
        lr=config['learning_rate'],
        lrUpdateRate=config['lr_update_rate'],
        neg=config['negativ_sampling'],
        t=config['sampling'],
        bucket=0)
    
    # .bin is added automatically by fasttext
    fasttext_classifier_path = fasttext_classifier_path+".bin" 
    
    # fasttext_classifier.test(test_data_path) -> print test scores out on console
    print("Model trained.")

# RUN 
# Define default parameter configuration
config = {
    'min_count':2,
    'window_size':5,
    'word_ngrams':1,
    'vector_dim':100,
    'learning_rate':0.1,
    'lr_update_rate':100,
    'negativ_sampling':5,
    'sampling':0.0001,
    'epochs':50
}

# Run model training
train(config)

### Option 1: Define Unified Model
Define the Unified Model for the fasttext classifier. Since we want to provide a text classification model, we can use the `TextClassificationModel` model type. Since pickle is not able to serialize the fasttext model instance, we have to add and load the fasttext model from the original fasttext binary file.

In [None]:
from unified_model.model_types import TextClassificationModel

class FasttextClassifier(TextClassificationModel):
    FT_MODEL_KEY = "ft_classifier.model"
    
    REQUIREMENTS = [
        "pyfasttext"
    ]

    def __init__(self, ft_classifier_path, **kwargs):
        super(FasttextClassifier, self).__init__(**kwargs)
        self.add_requirements(FasttextClassifier.REQUIREMENTS)

        self.add_file(FasttextClassifier.FT_MODEL_KEY, ft_classifier_path)
        self._init_model()

    def _init_model(self):
        self.model_instance = FastText(self.get_file(FasttextClassifier.FT_MODEL_KEY))

    def _save_model(self, output_path):
        del self.model_instance
        
    def _predict(self, data, limit=None, **kwargs):
        if limit is None:
            limit = self.model_instance.nlabels

        # Also put preprocessing logic for the data directly into the model.
        preprocessd_text = preprocess(str(data))
        
        result = []
        result = self.model_instance.predict_proba_single(preprocessd_text, k=limit)

           # The TextClassificationModel type expects a dataframe as prediction result 
        # The resulting dataframe needs to have atleast an item and score column
        return pd.DataFrame(result, columns=["item", "score"])

fasttext_model = FasttextClassifier(fasttext_classifier_path)
fasttext_model.predict("this is a data item")

### Option 2: Use Predefined Model Wrapper
Instead of defining your custom model, you can also just use the predefined `FasttextClassifier` and provide the preprocessing logic via the `transform_func` parameter.

In [None]:
from unified_model.predefined_models.fasttext_models import FasttextClassifier
fasttext_classifier = FasttextClassifier(fasttext_classifier_path, transform_func=preprocess)
fasttext_classifier.predict("this is a data item")

## Combine Models - Voting Ensemble
This library also provides a simple way to build voting ensembles from a collection of classifaction models via the `VotingEnsemble` class. There are various voting strategies that can be applied such as relative_score, rank_averaging, rank_vote, one_vote, total_score, highest_scores.

In [None]:
from unified_model.ensemble_utils import VotingEnsemble

voting_ensemble = VotingEnsemble(models=[sklearn_model,
                                        fasttext_model], strategy="relative_score")
voting_ensemble.predict("this is a data item")

As with any unified model, also a `VotingEnsemble` can be saved to a single file that includes all the other models combined in the ensemble.

In [None]:
voting_ensemble_path = voting_ensemble.save(os.path.join(output_path,"voting_ensemble_model.model.zip"))
voting_ensemble = UnifiedModel.load(voting_ensemble_path)
voting_ensemble.predict("this is a data item")

## Evaluate Models
After training our models, we want to have a closer look into how those models actually perform.

### Single Model

Every unified model provides an `evaluate(test_data, target_predictions)` method It will calculate common metrics for the given model type. For example, for text classification it returns micro/macro precision, recall, and f1 score.

In [None]:
# Select a model
selected_model = fasttext_model

# Evaluate with test data
print("Evaluating "+ str(selected_model))
metrics, label_scores = selected_model.evaluate(
    test_df['data'].tolist(), test_df['topic'].tolist(), per_label=True)

# model metrics
metrics

You can also get evaluation statistics per label (precision, recall, f1):

In [None]:
label_scores.style.background_gradient(cmap='BuGn', low=0.1, high=0.8, axis=0)

### Compare mutlitple models
With the `compare_models` method you can evaluate and compare a collection of unified models on the same test dataset.

In [None]:
from unified_model import evaluation_utils

evaluation_utils.compare_models(
    [sklearn_model, fasttext_model, voting_ensemble], 
    test_df['data'].tolist(), test_df['topic'].tolist(), styled=True)

# Next Steps

- [Lab Client Tutorial](./tutorials/lab-client-tutorial.ipynb): Learn how to connect to ML Lab and run your experiments.
- [Experiment Template](../templates/experiment-template.ipynb): Start your own high-quality reusable experiment notebook with this template.
- [Introduction to Pandas](./pandas-tutorial.ipynb): Introduction to the data structures and functionalities of the pandas library.
- [Introduction to Numpy](./numpy-tutorial.ipynb): Introduction to datatypes, arrays, and mathematical operations of the numpy library.