# Hands-on Workshop - *Predicting product ratings from reviews* - Using MLServer and V2 Inference Protocol

This workshop is focused on the deployment of a machine learning model for predicting product ratings from reviews.

We will use a fine tuned [DistilBERT hugging face transformer model](https://huggingface.co/docs/transformers/main/en/model_doc/distilbert), which is a "small, fast, cheap and light Transformer model trained by distilling BERT base". For the sake of time, we will use a pretrained model rather than training the model in the workshop. The model is stored in the ```kelly-seldon``` Google Storage bucket at the path ```nlp-ratings/model/1```.

We will deploy our trained model through the Seldon Deploy UI and view our running deployment. We will deploy our model using MLServer, which is on open source inference server for ML models. MLServer has support for for the [standard V2 Inference Protocol](https://docs.seldon.io/projects/seldon-core/en/latest/reference/apis/v2-protocol.html) on both the gRPC and REST flavours, which has been standardised and adopted by various model serving frameworks. Full MLServer docs can be found [here](https://mlserver.readthedocs.io/en/latest/).

-----------------------------------

## Deploying the Model using MLSever and V2 Inference Protocol

The reviews are preprocessed in a variety of different steps before being passed to the model. In order to account for this there are 2 options we can use:

1. **Custom Model:** Incorporate the pre-processing directly in the predict method of a custom model. This provides simplicity when creating the deployment as there is only a single code base to worry about and a single component to be deployed.
2. **Input Transformer:** Make use of a separate container to perform all of the input transformation and then pass the vectors to the model for prediction. The schematic below outlines how this would work.

```
            ________________________________________
            |            SeldonDeployment          |
            |                                      |
Request -->  Input transformer   -->     Model    -->  Response
            |  (Pre-processing)                    |
            |                                      |
            ________________________________________
```
         
The use of an input transformer allows us to separate the pre-processing logic from the prediction logic. This means thar each of the components can be upgraded independently of one another. However, it does introduce additional complexity in the deployment which is generated, and how that then interacts with advanced monitoring components such as outlier and drift detectors. 

This workshop will focus on the generation of a **custom model for this case**.

-----


## Set up 

We can define our Seldon custom model. The component parts required to build the custom model are outlined below. Each of the files play a key part in building the eventual docker image.

---

### ratings.py


This is the critical file as it contains the logic associated with the deployment wrapped as part of a class.





```
import numpy as np
import datasets
from transformers import AutoTokenizer, DefaultDataCollator, TFAutoModelForSequenceClassification
import string
import nltk
from nltk.stem import WordNetLemmatizer

from mlserver import MLModel
from mlserver.utils import get_model_uri
from mlserver.types import (
    InferenceRequest,
    InferenceResponse
)
from mlserver.codecs import NumpyRequestCodec, PandasCodec
from mlserver.logging import logger


class ReviewRatings(MLModel):
    async def load(self) -> bool:
        model_uri = await get_model_uri(
            self.settings
        )

        logger.info("Loading model")
        self._model = TFAutoModelForSequenceClassification.from_pretrained(model_uri, num_labels=9)
        logger.info("Model successfully loaded")

        self._wordnet_lemmatizer = WordNetLemmatizer()

        logger.info("Loading tokenizer and data collator")
        self._tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
        self._data_collator = DefaultDataCollator(return_tensors="tf")

        nltk.download("stopwords", download_dir="./nltk")
        nltk.download("wordnet", download_dir="./nltk")
        nltk.download("omw-1.4", download_dir="./nltk")
        nltk.data.path.append("./nltk")
        # Stop words present in the library
        self.stopwords = nltk.corpus.stopwords.words('english')

        self.ready = True
        return self.ready

    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        review_df = self.decode_request(payload, default_codec=PandasCodec)
        pred_proc = self.process_whole(review_df)
        response = NumpyRequestCodec.encode_response(model_name=self.name, payload=pred_proc)
        response.outputs[0].shape = [response.outputs[0].shape[0], 1]

        return response

    def preprocess_text(self, df, feature_names):
        logger.info("Preprocessing text")
        logger.info("Removing punctuation")
        df['review'] = df['review'].apply(lambda x: self.remove_punctuation(x))
        logger.info("Lowercase all characters")
        df['review'] = df['review'].apply(lambda x: x.lower())
        logger.info("Removing stopwords")
        df['review'] = df['review'].apply(lambda x: self.remove_stopwords(x))
        logger.info("Carrying out lemmatization")
        df['review'] = df['review'].apply(lambda x: self.lemmatizer(x))

        len_df = len(df)
        logger.info(f"{len(df)}")

        dataset = datasets.Dataset.from_pandas(df, preserve_index=False)
        logger.info(f"Dataset created: {dataset}")

        tokenized_revs = dataset.map(self.tokenize, batched=True)
        logger.info(f"Tokenized reviews: {tokenized_revs}")

        logger.info("Converting tokenized reviews to tf dataset")
        tf_inf = tokenized_revs.to_tf_dataset(
            columns=["attention_mask", "input_ids"],
            label_cols=["labels"],
            shuffle=True,
            batch_size=len_df,
            collate_fn=self._data_collator
        )
        logger.info(f"TF dataset created: {tf_inf}")

        return tf_inf

    def remove_punctuation(self, text):
        punctuation_free = "".join([i for i in text if i not in string.punctuation])
        return punctuation_free

    def remove_stopwords(self, text):
        text = ' '.join([word for word in text.split() if word not in self.stopwords])
        return text

    def lemmatizer(self, text):
        lemm_text = ' '.join([self._wordnet_lemmatizer.lemmatize(word) for word in text.split()])
        return lemm_text

    def tokenize(self, ds):
        return self._tokenizer(ds["review"], padding="max_length", truncation=True)

    def process_output(self, preds):
        logger.info("Processing model predictions")
        rating_preds = []
        for i in preds["logits"]:
            rating_preds.append(np.argmax(i, axis=0))

        logger.info("Create output array for predictions")
        rating_preds = np.array(rating_preds)

        return rating_preds

    def process_whole(self, text):
        logger.info("Start processing")
        tf_inf = self.preprocess_text(text, feature_names=None)
        logger.info("Predictions ready to be made")
        preds = self._model.predict(tf_inf)
        logger.info(f"Prediction type: {type(preds)}")
        logger.info(f"Predictions: {preds}")
        preds_proc = self.process_output(preds)
        logger.info(f"Processed predictions: {preds_proc}, Processed predictions type: {type(preds_proc)}")

        return preds_proc

```

The custom inference wrapper should be responsible for: 

1. Loading the model
2. Running inference using our model structure

Some things to note: 

- The `ReviewRatings` class inherits from `mlserver.MLModel`.
- There are 2 expected `async` methods - `load` and `predict`:
    - The `load` method is used to load the model and any other artefacts, in this example the lemmatizer, the tokenizer and the data collator. 
    - The `predict` method takes the inference request, makes custom predictions, and outputs the inference response.
- We use the mlserver `logger` from `mlserver.logging`

## Content Types (and Codecs)

Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from “general purpose” NumPy arrays or Pandas DataFrames to more granular definitions, like datetime objects, Pillow images, etc. Unfortunately, the definition of the V2 Inference Protocol doesn’t cover any of the specific use cases. This protocol can be thought of a wider “lower level” spec, which only defines what fields a payload should have.

To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should “decode” V2-compatible payloads. When shaped in the right way, these payloads should “encode” all the information required to extract the higher level Python type that will be required for a model.

To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specifity it through the content_type field of the parameters section of your request.

It’s important to keep in mind that content types can be specified at both the request level and the input level. The former will apply to the entire set of inputs, whereas the latter will only apply to a particular input of the payload.



### Example Request 

```
{
    "parameters": {
        "content_type": "pd"
    },
    "inputs": [
        {
          "name": "review",
          "shape": [1, 1],
          "datatype": "BYTES",
          "data": ["_product is excellent! I love it, it's great!"]
        }
    ]
}
```

Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.

Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response. However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.

More information on Content Types (and Codecs) can be found [here](https://mlserver.readthedocs.io/en/latest/user-guide/content-type.html).

Within the `predict` method of the above `ratings.py` file, the following 2 lines can be seen in the `predict` method:

- Decoding Request:
    `review_df = self.decode_request(payload, default_codec=PandasCodec)`
    
This is a request codec, which works at the request- / response level and is used in this example to decode the V2-compatible payload to a DataFrame (as specified in the example request above).
    
- Encoding Response: 
    `response = NumpyRequestCodec.encode_response(model_name=self.name, payload=pred_proc)`

This is also a request codec, which works at at the request- / response level and is used in this example to encode the NumPy Array to a V2-compatible payload.
    

## Settings Files

When serving a custom model, MLServer also expects the following 2 configuration files: 

- `settings.json`: holds the configuration of our server (e.g. ports, log level, etc.).
- `model-settings.json`: holds the configuration of our model (e.g. input type, runtime to use, etc.).

In this case, our files are as follows: 

`settings.json` 

```
{
    "debug": "true"
}
```

`model-settings.json`

```
{
    "implementation": "ratings.ReviewRatings",
    "parameters": {
        "uri": "./1/"
    },
    "parallel_workers": 0
}
```

## Requirements.txt

We also have our ```requirements.txt``` file, which contains a list of Python packages which the deployment requires to run:

```
datasets == 2.2.2
numpy == 1.21.6
pandas == 1.3.5
tensorflow == 2.7.3
transformers == 4.20.0
mlserver
nltk
```

### Testing Locally

First, we must download the model artifact so that we have it locally available for testing.  You can run the following command to get it from Google Storage:

```
gsutil cp -r gs://kelly-seldon/nlp-ratings/1/ .
```

Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

```
mlserver start .
```

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

We can then test sending requests by running the following python script: 

`test_request.py`


```
import requests

inference_request = {
    "parameters": {
        "content_type": "pd"
    },
    "inputs": [
        {
          "name": "review",
          "shape": [1, 1],
          "datatype": "BYTES",
          "data": ["_product is excellent! I love it, it's great!"]
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/nlp-ratings-v2/infer"
response = requests.post(endpoint, json=inference_request)

print(response.json())
```

## Deployment 

Now that we have written and tested our custom model, the next step is to deploy it. With that goal in mind, the rough outline of steps will be to first build a custom image containing our code, and then deploy it using the SDK.

### Building a custom image 

MLServer offers helpers to build a custom Docker image containing your code. In this example, we will use the mlserver build subcommand to create an image, which we’ll be able to deploy later.

Note that this section expects that Docker is available and running in the background.

We will run: 

```
mlserver build . -t kellyspry0316/ratings-v2:0.3
```

To ensure that the image is fully functional, we can spin up a container and then send a test request. To start the container, you can run the following in a separate terminal:

```
docker run -test --rm -p 8080:8080 kellyspry0316/ratings-v2:0.3
```

We can then test sending requests again using the above request. 

Finally, we can push our image to a docker registry, in this case we will just push it to Docker Hub:

```
docker push kellyspry0316/ratings-v2:0.3
```

Now that we’ve built a custom image and verified that it works as expected, we can move to the next step and deploy it. 

### Deploying through the UI 

We will deploy the model using the UI for this session, so we will jump over there now.

A few things to note:

- Set the `Protocol` to `V2 Inference`. 
- Set the `Runtime` to `Custom`.
- Set the `Docker Image` to `kellyspry0316/ratings-v2:0.3`.
- Add an ENVIRONMENT VARIABLE to always use a writable HuggingFace cache location regardless of the user: 

    - Variable: `TRANSFORMERS_CACHE`
    - Value: `/opt/mlserver/.cache`
    
- Set CPU Requests and Limits to 1
- Set Memory Request and Limits to 3Gi

## Making Predictions

Now you can have a go at sending some requests to your model using the 'Predict' tab in the UI.

An example of a good review that we would expect to correspond to a higher rating.


```
{
    "parameters": {
        "content_type": "pd"
    },
    "inputs": [
        {
          "name": "review",
          "shape": [1, 1],
          "datatype": "BYTES",
          "data": ["_product is excellent! I love it, it's great!"]
        }
    ]
}
```

And an example of a negative review that we would expect to correspond to a lower rating.

```
{
    "parameters": {
        "content_type": "pd"
    },
    "inputs": [
        {
          "name": "review",
          "shape": [1, 1],
          "datatype": "BYTES",
          "data": ["_product_ was terrible, I would not use it again, it was awful!"]
        }
    ]
}
```

# Congratulations!

We successfully packaged up and deployed a model on Seldon Deploy using MLServer with the V2 Protocol! 