## Packaging, Deployment and Monitoring of spaCy Entity Extractor using MLFlow and Seldon Deploy

This workshop is focused on the packaging, deployment and monitoring of a machine learning model for extracting entities from text.

In order to create an MLFlow `pyfunc` model, we will use the `mlflow.pyfunc` utilities. We will define a class for a custom `PythonModel` that creates the desired entity dictionary output, before saving this custom model.

We will then deploy the model with Seldon Deploy, using the MLServer MLFlow runtime (MLServer is on open source inference server for ML models. MLServer has support for for the standard V2 Inference Protocol on both the gRPC and REST flavours, which has been standardised and adopted by various model serving frameworks. Full MLServer docs can be found [here](https://mlserver.readthedocs.io/en/latest/)). In order to do this, we will create a custom conda environment to ensure any extra dependencies not known in advance by MLServer are available. These dependencies are those that won't be include by default in the default `seldonio/mlserver` Docker image, so the `seldonio/mlserver` Docker image allows you to load custom environments before starting the server itself. We will serialise our created environment in the format expected by MLServer by using a tool called `conda-pack`. This tool will save a portable version of our environment as a `.tar.gz` file, also known as a *tarball*. We will deploy our model using the V2 Inference Protocol. 

We will then turn to the advanced monitoring capabilities that Seldon Alibi is famed for. We will configure and deploy a drift detector, before running a batch job in order to observe drift on the Seldon Deploy UI. 

Agenda:

1) Set up environment
2) Define Entity Extraction code
3) Define custom MLFlow `PythonModel`
4) Create custom conda environment
5) *Optional* - Test model locally using `mlserver`
6) Save MLFlow model to GCS
7) Deploy MLFlow model with Seldon Deploy
8) Add metadata and prediction schema to deployed MLFlow model 
9) Configure Maximum Mean Discrepancy (MMD) Drift Detector
10) Deploy Drift Detector with Seldon Deploy


---

## 1. Set up environment

Firstly, we will import the relevant packages which we will use throughout the packaging and deployment process.

In [None]:
import pandas as pd
import re
import json
import numpy as np
from datetime import date, timedelta

import spacy
import mlflow
import mlflow.spacy
import mlflow.pyfunc
from transformers import AutoTokenizer

from mlserver.codecs.string import StringRequestCodec
from mlserver.logging import logger

from seldon_deploy_sdk import Configuration, PredictApi, ApiClient, SeldonDeploymentsApi, ModelMetadataServiceApi, DriftDetectorApi, BatchJobsApi, BatchJobDefinition
from seldon_deploy_sdk.auth import OIDCAuthenticator, SessionAuthenticator
from seldon_deploy_sdk.rest import ApiException

from alibi_detect.models.tensorflow import TransformerEmbedding
from alibi_detect.cd import MMDDrift
from functools import partial
from alibi_detect.cd.tensorflow import preprocess_drift
from alibi_detect.utils.saving import save_detector

Next, we will get the saved spaCy artefact from where it has been stored in Google Cloud Storage.

In [None]:
!gsutil cp -r gs://andrew-seldon/sandbox/spacy_body_model ./

---

**The following 5 sections (*Define Entity Extraction code, define custom MLFlow `PythonModel`, create custom conda environment, *Optional* - Test model locally using `mlserver`, and save MLFlow model to GCS*) concentrate on packaging up the spaCy Entity Extractor ready for Deployment with Seldon. They follow the assumption that a trained spaCy entity extractor is already available, as is the case here.**

**These 4 sections would very typically form part of an automation pipeline (i.e. a CI/CD pipeline). For the sake of showing each of the steps in the workshop, we will run through them in a notebook, but it is worth emphasising that all of the following steps would be abstracted away and fully automated in a real world scenario.**

## 2. Define Entity Extraction code

### Define text processing functions

In [None]:
def find_geboortedatum(preprocessed_text):
    """
    returns a list with all the valid geboortedatums in the db notation (%Y-%m-%d) in in the preprocessed text
    @param preprocessed_text: processed text; entities are in database label notation format
    @return: list with all the found birth dates. A birthdate is considered valid if the date is older than
    12 years (currently hardcoded...)
    """
    gebdat_pattern = re.compile(r'\b[1-2][0-9]{3}-[0-1][0-9]-[0-3][0-9]\b')
    return [m.group() for m in re.finditer(gebdat_pattern, preprocessed_text) if valid_birthdate(m.group(), format='%Y-%m-%d')]


def find_postcode(preprocessed_text):
    """
    returns a list with all the postcodes in the db notation (1234AB)
    @param preprocessed_text: processed text; entities are in database label notation format
    @return: list with all the found postcodes in db notation (1234AB)
    """
    postcode_pattern = re.compile(r'\b[1-9][0-9]{3}[A-Z]{2}\b')
    return [m.group() for m in re.finditer(postcode_pattern, preprocessed_text)]


def find_polis(preprocessed_text):
    """
    returns a list with all the polisnummers in the db notation
    @param preprocessed_text: processed text; entities are in database label notation format
    @return: list with all the found polisnummers in db notation
    """
    polis_pattern = re.compile("|".join([
        r"\b29[0-9]{6}\b", #PM29000000	PM29599999 & PM29600000	PM29999999
        r"\b147[0-9]{5}\b", #PW14700000	PW14799999
        r"\b1[0-9]{7}\b",#SW10000000	SW19999999
        r"\b4[0-9]{7}\b",#PK40000000	PK42999999 
        r"\b9[0-9]{7}\b",#SL90000000	SL92999999  & SL93000000	SL93999999
        r"\b95[0-9]{6}\b",#LP95000000	LP95999999
        r"\b147[0-9]{5}\b",#PW14700000	PW14799999
        r"\b81[0-9]{6}\b"]))#PW81000000	PW81999999 
    return [m.group() for m in re.finditer(polis_pattern, preprocessed_text)]

    
def find_nn_iban(preprocessed_text):
    """
    returns a list with all NN iban numbers found in the input text
    @param preprocessed_text: processed text; entities are in database label notation format
    @return: list with all the found iban numbers
    """
    return find_iban(preprocessed_text, nn_iban=True)


def find_rekeningnummer(preprocessed_text):
    """
    returns a list with all rekening nummers (length between 7 and 10) found in the input text
    @param preprocessed_text: processed text; entities are in database label notation format
    @return: list with all the found rekening numbers
    """
    rekeningnummer_pattern = re.compile(r'\b\d{7,10}\b')
    return [m.group() for m in re.finditer(rekeningnummer_pattern, preprocessed_text)]


def find_iban(preprocessed_text, nn_iban=False):
    """
    returns a list with all iban numbers found in the input text
    @param preprocessed_text: processed text; entities are in database label notation format
    @param nn_iban: If true the regex only searches for iban numbers containing 'NNBA'
    @return: list with all the found iban numbers
    """
    if nn_iban:
        iban_pattern = re.compile(r'\bNL\s*\d\d\s*NNBA\s*\d{​​​​​​​​10}​​​​​​​​\b')
    else:
        iban_pattern = re.compile(r'\bNL\s*\d\d\s*\w{​​​​​​​​4}​​​​​​​​\s*\d{​​​​​​​​10}​​​​​​​​\b')

    matches = [m.group() for m in re.finditer(iban_pattern, preprocessed_text)]
    cleaned_matches = [match.replace(' ', '') for match in matches]
    stripped_matches = [i[-10:] for i in cleaned_matches]
    return stripped_matches


def find_phonenumber(preprocessed_text):
    """
    returns a list with all the phonenumbers in the db notation
    @param preprocessed_text: processed text; entities are in database label notation format
    @return: list with all the found phonenumbers in db notation
    """
    polis_pattern = re.compile("|".join([
        r'\(?([+]31|0031|0)-?6(\s?|-)([0-9]\s{0,3}){8}$',
        r'(((0)[1-9]{2}[0-9](\s?|-)?[1-9][0-9]{5})$',
        r'(((0)[1-9]{2}(\s?|-)?[1-9][0-9]{2}(\s?|-)?[0-9]{2}(\s?|-)?[0-9]{2}))$',# vast
        r'(((0)[1-9]{2}[0-9](\s?|-)?[1-9][0-9]{2}(\s?|-)?[0-9]{2}(\s?|-)?[0-9]{2}))$',
        r'((\\+31|0|0031)[1-9][0-9](\s?|-)?[1-9][0-9]{6}))$',
        #vast
        
        ]))
    found_numbers =  [m.group() for m in re.finditer(polis_pattern, preprocessed_text)]
    return [x.replace('-','').replace(' ','')[-9:] for x in found_numbers]


def remove_urls(text):
    """
    Removes the actual url and replaces it with 'URL'. Only works for
    @param text: string containing the text that needs to be processed
    @return: original string with the url replaced with 'URL'
    """
    url_pattern = r'<?http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(url_pattern, 'URL', text)


def zipcode_db_notation(text):
    """
    DB notation for postal codes is 1234AB instead of 1234 AB -> the space is removed for postal codes.
    @param text: string containing the text that needs to be processed
    @return: original string with the optional space in postal codes removed (e.g. 1234 AB -> 1234AB)
    """
    zipcode_pattern = r'\b[1-9][0-9]{3}[ -]?[A-Z]{2}\b|\b[1-9][0-9]{3}-?[a-zA-Z]{2}\b'

    def db_notate(match):
        return match.group().replace(' ', '').replace("-", "").upper()
    return re.sub(zipcode_pattern, db_notate, text)


def valid_date(candidate_date, format='%d-%m-%Y'):
    """
    Checks if the a date can be converted to a valid date(time)
    @param candidate_date: string that needs to be converted to a date
    @param format: date format
    @return: bool (True/False) whether the string can be converted to a datetime with specified format
    """
    try: 
        valid_date = pd.to_datetime(candidate_date, format=format)
        return True
    except:
        return False

def valid_birthdate(candidate_date, format='%d-%m-%Y', threshold=12*365.25):
    """
    checks whether a date can be considered as a birthdate. The assumtion is that very
    recent dates are unlikely to be actual birthdays.
    @param candidate_date: string that needs to be validated for a proper birthday
    @param format: date format
    @param threshold: minimum age in days
    @return: bool - whether the candidate is a proper birthdate or not
    """
    if valid_date(candidate_date, format=format):
        return pd.to_datetime(candidate_date, format=format) <= date.today() - timedelta(days=threshold)
    else:
        return False

def birthdate_db_notation(text):
    """
    Converts potential birthdates to the proper db notation/date format: %d-%m-%Y. TODO: explain which cases should be
    captured by this function.
    @param text: string containing the text that needs to be processed
    @return: original string with the potential birthdates in %d-%m-%Y format
    """
    months = ['januari', 'jan', 'februari', 'feb', 'maart', 'mrt', 'april', 'apr', 'mei', 'juni', 'jun', 'juli',
            'jul', 'augustus', 'aug', 'september', 'sep', 'oktober', 'okt', 'november', 'nov', 'december', 'dec']
    gebdat_pattern = '[0-3]?[0-9](/|-|\s)(' + '|'.join(months) + \
        '|[0-1]?[0-9])(/|-|\s)(19|20)?[0-9]{2}'

    def db_notate(match):
        replace_dict = {
            'januari': '1',
            'jan': '1',
            'februari': '2',
            'feb': '2',
            'maart': '3',
            'mrt': '3',
            'april': '4',
            'apr': '4',
            'mei': '5',
            'juni': '6',
            'jun': '6',
            'juli': '7',
            'jul': '7',
            'augustus': '8',
            'aug': '8',
            'september': '9',
            'sep': '9',
            'sept': '9',
            'oktober': '10',
            'okt': '10',
            'november': '11',
            'nov': '11',
            'december': '12',
            'dec': '12'
        }

        candidate_date = re.sub('/|\s', '-', match.group()).lower()
        mon = re.search('[a-z]+', candidate_date)
        if mon:
            # string month to int month
            candidate_date = re.sub(
                mon.group(), replace_dict[mon.group()], candidate_date)
        if re.fullmatch('[0-3]?[0-9]-[0-1]?[0-9]-[0-9]{2}', candidate_date):
            # yy to yyyy
            candidate_date = candidate_date[:-
                                            2] + '19' + candidate_date[-2:]

        if valid_birthdate(candidate_date):
            return pd.to_datetime(candidate_date, format='%d-%m-%Y').strftime('%Y-%m-%d')
        else:
            return match.group()

    return re.sub(gebdat_pattern, db_notate, text, flags = re.IGNORECASE)


def preprocess(text):
    """
    Preprocess text to match the database (label) notation
    @param: text: Text that will be annotated
    """
    preprocessed_text = remove_urls(text)
    preprocessed_text = zipcode_db_notation(preprocessed_text)
    preprocessed_text = birthdate_db_notation(preprocessed_text)
    return preprocessed_text

### Define Entity Extractor class

In [None]:
class Extractor:
    """
    Extracts entities using the spacy model and regular expressions. Returns an entity dictionary with
    entity_name: [value1, value2, ...] as output. For example: {"postcode": ["1234AB", "5678CD"], "polis": []}
    """
    def __init__(self,  nlp_model, regex_extractors, prefix=None):
        self.nlp_model = nlp_model
        self.regex_extractors = regex_extractors
        self.prefix = prefix

        '''
        Class which extracts entities with help of both regex extractors 
        and trained Named Entity Recognition model (spacy).
        @param: nlp_model: trained spacy NER model 
        @param: regex_extractors: dictionary {type entity: regex function to extract this entity}
        @param: prefix: prefix that needs to be applied for all found entity names (keys in the dictionary)

        '''

    def nlp_extract(self, text):
        '''
        Extracts entities from text with help of trained NER spacy model.
        @param text: from which we want to extract entities with our trained spacy model
        @return: dictionary {type entity: found entity/entities with help of spacy}
        '''
        d = {}
        for label in self.nlp_model.pipe_labels['ner']:
            d[label] = []
        doc = self.nlp_model(text)
        for ent in doc.ents:
            if ent.text not in d[ent.label_]:
                d[ent.label_].append(ent.text)
        return d

    def regex_extract(self, text):
        '''
        applies regex functions found in regex_extractors to text and maps them in dictionary.
        @param text: from which we want to extract entities with regular expressions
        @return: dictionary with {type entity: found entities with help of regex extractor}
        '''
        return {key: regextract_function(text) for key, regextract_function in self.regex_extractors.items()}

    def create_ent_dict(self, text):
        """
        Creates an entity dictionary for all entities. A prefix is applied to the keys to track which method was used
        to extract the entity e.g. "spacy_" or "_regex"
        @param text: text from which we want to extract entities with our spacy model and regular expressions
        @return: dictionary containing all the found entities using the spacy model and regular expressions.
        """
        preprocessed_text = preprocess(text)
        spacy_ent_dict = self.nlp_extract(preprocessed_text)
        spacy_ent_dict = self.apply_prefix(spacy_ent_dict, 'spacy_')
        regex_ent_dict = self.regex_extract(preprocessed_text)
        regex_ent_dict = self.apply_prefix(regex_ent_dict, 'regex_')
        ent_dict = {**spacy_ent_dict, **regex_ent_dict}
        if self.prefix:
            ent_dict = self.apply_prefix(ent_dict, self.prefix)
        return ent_dict

    def apply_prefix(self, ent_dict, prefix):
        """
        Returns the dictionary with found entities with the specified prefix. Used to track which method was used
        to extract the entities
        @param ent_dict: the dictionary with entities extracted
        @param prefix: the prefix that needs to be applied for the keys e.g. "spacy_"
        @return: dictionary with the prefix applied to the keys
        """
        return {prefix + key: val for key, val in ent_dict.items()}

### Define regex extractors

In [None]:
regex_extractors = {
    'polis': find_polis,
    'postcode': find_postcode,
    'rekeningnummer': find_iban,
    'telefoon': find_phonenumber
    } 

## 3. Define custom MLFlow `PythonModel`

As mentioned at the top of the notebook, we will use the `mlflow.pyfunc` utilities to define a class for a custom `PythonModel` that creates the desired entity dictionary output, before saving this custom model.

We are doing this so that our MLFlow model will have the `pyfunc` flavour with the custom `predict` method as defined below. This flavour is expected by Seldon's MLFlow server.

In [None]:
model_path = "./spacy_body_model/"

# Define the model class
class EntityDictionary(mlflow.pyfunc.PythonModel):

    def __init__(self):
        self.extractor = Extractor(
            nlp_model=spacy.load(model_path),
            regex_extractors=regex_extractors,
            prefix='subject_')

    def predict(self, context, model_input):
        logger.info(f"Model input before indexing: {model_input}")
        str_payload = model_input[0]
        logger.info(f"Model input: {str_payload}")
        logger.info(f"Model input type: {type(str_payload)}")
        ent_dict = self.extractor.create_ent_dict(str_payload)
        logger.info(f"Entity dict {ent_dict}")
        logger.info(f"Entity dict type: {type(ent_dict)}")

        return [json.dumps(ent_dict)]

We can then create the custom model and use the `mlflow.pyfunc` utility `save_model()` to save our custom MLFlow `PythonModel`.

In [None]:
custom_model = EntityDictionary()
mlflow.pyfunc.save_model(path="spacy_pyfunc", python_model=custom_model)

## 4. Create custom conda environment

Now we will create our custom conda environment and install the required packages into it for any extra dependencies required that are not know in advance by MLServer.

Then we will run `conda-pack` our environment as a *tarball*.

Firstly we can copy and paste the following into the `requirements.txt` file within the `spacy_pyfunc` model folder: 

```
mlserver==1.2.0.dev12
mlserver-mlflow==1.2.0.dev12
datetime
pandas
```

Make sure to save the `requirements.txt` file afterwards.

Next we will create a new conda environment, install the requirements and then `conda-pack` our environment as a *tarball*:

In [None]:
!conda create --name model-pack python=3.8.13 -y

In [None]:
%%bash 

source ~/anaconda3/etc/profile.d/conda.sh
conda activate model-pack
pip install -r ./spacy_pyfunc/requirements.txt

In [None]:
!conda pack -n model-pack -o ./spacy_pyfunc/environment.tar.gz -f

## 5. *Optional* - Test model locally using `mlserver`

Now that we have our custom model and custom conda environment ready, we can test our model locally. 

In order to do this, we first need to create a `model-settings.json` file. This holds the configuration of our model (e.g. input type, runtime to use, etc):

  ```
  {
      "name": "nn-ee",
      "implementation": "mlserver_mlflow.MLflowRuntime",
      "parameters": {
          "uri": "./spacy_pyfunc/"
      }
  }
  ```
This should already be present in the folder containing this notebook.

Then we can start the server by running:

`mlserver start .`

Note: This needs to either be run from the same directory where our `model-settings.json` file is or pointing to the folder where they are.

Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal.

We can then test sending requests by running the following python script:

```
  import requests

  inference_request = {
      "parameters": {
        "content_type": "str"
      },
      "inputs": [
          {
            "name": "test",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": ["Roeselare verkocht gemotiveerd student gevaarlijke eindeloze Koerdische luchthavens getal CERA uiteindelijke gesprekken overbodig losse achterop ongeval grafische belanden vergezeld witloof 1013ZK. steen ein VU buitenstaander alleen ongeval gepaard begeven file regelmaat Kennedy schuld vertrok dozen Ford Landen Nick voorbereiden vult benoemd 1013ZK. avec deel conservatief loonnorm ken huisvesting oceaan westen onvermijdelijk Wil schrijven mogelijke stamt muren gesprekken vult vorming Ludwig bezorgd bestemmingen 1013ZK. felle Elizabeth Express evenmin Buffett vult herhaald ontwerpers vertegenwoordigers strand Mobutu bedenken rekenen gÃ©Ã©n terechtgekomen hoort volumes authentieke wilden voorbereiden 1013ZK. westen afstand tweeduizend besliste Planet Hendrik onderschreven Finance menselijke conservatief nauwe verklaart constructeur Sam boeiende stamt CERA financier verwachte integratie 2596 CX"]
          }
      ]
  }

  endpoint = "http://localhost:8080/v2/models/nn-ee/infer"
  response = requests.post(endpoint, json=inference_request)

  print(response.json())

```

Again, this should already be present in the folder containing this notebook. 

## 6. Save MLFlow model to GCS

Finally we can save our custom spaCy MLFlow `PythonModel` to GCS ready for deployment.

In [None]:
!gsutil cp -r ./spacy_pyfunc/ gs://andrew-seldon/sandbox/spacy_pyfunc

## 7. Deploy MLFlow Model with Seldon Deploy

We can now deploy our models to the dedicated Seldon Deploy cluster that we have configured for this workshop. To do this, we will interact with the Seldon Deploy SDK. You can find the reference documentation [here](https://github.com/SeldonIO/seldon-deploy-sdk/blob/master/python/README.md).

Firstly, we need to set up the configuration and authentication required to access the cluster. 

⚠️ Make sure to fill in the following in the below cell if not already filled in: 

1) `YOUR_NAME` variable - NO capitals or underscores, but lower case characters and hyphens are fine. 
2) `SD_DOM` variable - Ensure to change this to the domain name of the cluster you are using. 
3) `CLIENT_SECRET` variable - this will either be `sd-api-secret` or an alternative secret provided during the workshop.

In [None]:
YOUR_NAME = ""
SD_DOM = ""
CLIENT_SECRET = ""

config = Configuration()
config.host = f"https://{SD_DOM}/seldon-deploy/api/v1alpha1"
config.oidc_client_id = "sd-api"
config.oidc_server = f"https://{SD_DOM}/auth/realms/deploy-realm"
config.oidc_client_secret = f"{CLIENT_SECRET}"
config.auth_method = "auth_code"

def auth():
    auth = OIDCAuthenticator(config)
    config.id_token = auth.authenticate()
    api_client = ApiClient(configuration=config, authenticator=auth)
    return api_client

Now we have configured the domain name correctly, as well as setup the authentication function, we can create the manifest for the deployment we would like to create. This is defined as a `SeldonDeployment` custom resource.

Notice in the deployment manifest, we have specificed `protocol: v2`. MLServer will then be the inference server and requests and responses will be compliant with the V2 Inference Protocol spec. 

In [None]:
DEPLOYMENT_NAME = f"{YOUR_NAME}-spacy"
NAMESPACE = "seldon-gitops"
MODEL_LOCATION = f"gs://andrew-seldon/sandbox/spacy_pyfunc"


mldeployment = {
  "apiVersion": "machinelearning.seldon.io/v1alpha2",
  "kind": "SeldonDeployment",
  "metadata": {
    "name": f"{DEPLOYMENT_NAME}",
    "namespace": f"{NAMESPACE}"
  },
  "spec": {
    "protocol": "v2",
    "name": f"{DEPLOYMENT_NAME}",
    "predictors": [
      {
        "graph": {
          "children": [],
          "implementation": "MLFLOW_SERVER",
          "modelUri": f"{MODEL_LOCATION}",
          "name": f"{DEPLOYMENT_NAME}-container"
        },
        "name": "default",
        "replicas": 1
      }
    ]
  }
}

You can now invoke the `SeldonDeploymentsApi` and create a new `SeldonDeployment`.

In [None]:
deployment_api = SeldonDeploymentsApi(auth())
deployment_api.create_seldon_deployment(namespace=NAMESPACE, mldeployment=mldeployment)

We can now access the Seldon Deploy cluster and view our freshly created deployments using the link below.

⚠️ Make sure to replace the `XXXXX` in the URL below with the cluster domain name. 

http://XXXXX/seldon-deploy/

The username and password for accessing the cluster will be shared during the session. 

We can test sending a request to our model running in production. We can do this through the `Predict` pane in the Seldon Deploy UI by pasting in the following json: 

```
{
      "parameters": {
        "content_type": "str"
      },
      "inputs": [
          {
            "name": "test",
            "shape": [1, 1],
            "datatype": "BYTES",
            "data": ["Roeselare verkocht gemotiveerd student gevaarlijke eindeloze Koerdische luchthavens getal CERA uiteindelijke gesprekken overbodig losse achterop ongeval grafische belanden vergezeld witloof 1013ZK. steen ein VU buitenstaander alleen ongeval gepaard begeven file regelmaat Kennedy schuld vertrok dozen Ford Landen Nick voorbereiden vult benoemd 1013ZK. avec deel conservatief loonnorm ken huisvesting oceaan westen onvermijdelijk Wil schrijven mogelijke stamt muren gesprekken vult vorming Ludwig bezorgd bestemmingen 1013ZK. felle Elizabeth Express evenmin Buffett vult herhaald ontwerpers vertegenwoordigers strand Mobutu bedenken rekenen gÃ©Ã©n terechtgekomen hoort volumes authentieke wilden voorbereiden 1013ZK. westen afstand tweeduizend besliste Planet Hendrik onderschreven Finance menselijke conservatief nauwe verklaart constructeur Sam boeiende stamt CERA financier verwachte integratie 2596 CX"]
          }
      ]
  }
```

## 8. Add metadata and prediction schema to deployed MLFlow model 

Seldon Deploy has a model catalogue where all deployed models are automatically registered. The model catalogue can store custom metadata as well as prediction schemas for your models. 

Metadata promotes lineage from across different machine learning systems, aids kmowledge transfer between teams, and allows for faster deployment. Meanwhile, prediction schemas allow Seldon Deploy to automatically profile tabular data into histograms, allowing for filtering on features to explore trends. 

In order to effectively construct a prediciton schema, Seldon has the [ML Prediction Schema project](https://github.com/SeldonIO/ml-prediction-schema). 

In [None]:
prediction_schema = {
  "requests": [
    {
      "name": "Input text",
      "type": "TEXT",
      "dataType": "STRING",
      "nCategories": "0",
      "categoryMap": {},
      "schema": [],
      "shape": []
    }
  ],
  "responses": [
    {
      "name": "Entity Dictionary",
      "type": "TEXT",
      "dataType": "STRING",
      "nCategories": "0",
      "categoryMap": {},
      "schema": [],
      "shape": []
    }
  ]
}

We can then add the prediction schema to the wider model catalogue metadata. This includes information such as the model storage locaton, the name, the version, the artefact type etc. The metadata tags and metrics that can be associated with a model are freeform and can therefore be determined based upon the use case which is being developed.

In [None]:
model_catalog_metadata = {
      "URI": MODEL_LOCATION,
      "name": f"{DEPLOYMENT_NAME}-model",
      "version": "v1.0",
      "artifactType": "MLFLOW",
      "taskType": "Entity Extraction",
      "tags": {
        "auto_created": "true",
        "author": f"{YOUR_NAME}"
      },
      "project": "default",
      "prediction_schema": prediction_schema
    }

model_catalog_metadata

Next, using the `ModelMetadataServiceApi`, we can add this to the model which we have just created in Seldon.

In [None]:
metadata_api = ModelMetadataServiceApi(auth())
metadata_api.model_metadata_service_update_model_metadata(model_catalog_metadata)

## 9. Configure Maximum Mean Discrepancy (MMD) Drift Detector

Although powerful, modern machine learning models can be sensitive. Seemingly subtle changes in a data distribution can destroy the performance of otherwise state-of-the art models, which can be especially problematic when ML models are deployed in production. 

Drift can be classified into the following types:

- **Covariate drift**: Also referred to as input drift, this occurs when the distribution of the input data has shifted `P(X) != Pref(X)`, whilst `P(Y|X) = Pref(Y|X)`. This may result in the model giving unreliable predictions.

- **Prior drift**: Also referred to as label drift, this occurs when the distribution of the outputs has shifted `P(Y) != Pref(Y)`, whilst `P(X|Y) = Pref(X|Y)`. This can affect the model’s decision boundary, as well as the model’s performance metrics.

- **Concept drift**: This occurs when the process generating `Y` from `X` has changed, such that `P(Y|X) != Pref(Y|X)`. It is possible that the model might no longer give a suitable approximation of the true process.

---

In this example we will use the Maximum Mean Discrepancy (MMD) method for Drift Detection. Covariate or input drift detection relies on creating a distance measure between two distributions; a reference distribution and a new distribution. The MMD drift detector is no different; the mean embeddings of your features are used to generate the distributions and then the distance between them is measured. The training data is used to calculate the reference distribution, while the new distribution comes from your inference data. More information on the Maximum Mean Discrepancy (MMD) detector can be found in the alibi detect documentation [here](https://docs.seldon.io/projects/alibi-detect/en/stable/cd/methods/mmddrift.html).

Firstly we will define and load a tokenizer. We will use `BERTje` (`GroNLP/bert-base-dutch-cased` as our hugging face transformer here. More information can be found [here](https://huggingface.co/GroNLP/bert-base-dutch-cased#benchmarks). `BERTje` is a Ducth pre-trained BERT model developed at the University of Groningen. 

In [None]:
model_name = "GroNLP/bert-base-dutch-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

Crucially, the pre-processing which we perform when feeding data to our drift detector does not necessarily have to match the pre-processing being used by the model. This means that the embedding method which generates the best results for the model and the drift detector can be controlled independently of one another.

In [None]:
emb_type = 'hidden_state'
n_layers = 8
layers = [-_ for _ in range(1, n_layers + 1)]

embedding = TransformerEmbedding(model_name, emb_type, layers)

In [None]:
tokens = tokenizer(list(x_ref), padding=True, return_tensors='tf')
x_emb = embedding(tokens)
print(x_emb.shape)

Alibi Detect then allows us to construct a preprocessing function by bringing together each of these different components into a single function which can be readily serialised.

In [None]:
preprocess_fn = partial(preprocess_drift, model=embedding, tokenizer=tokenizer, max_len=512, batch_size=32)

The finally we can fit the MMD detector on our reference data set. 

In [None]:
cd = MMDDrift(x_ref, p_val=.05, preprocess_fn=preprocess_fn, input_shape=(512,))

We can then observe whether drift if flagged on 2 different batches of data: 

- `batch_0` containing the first 100 data points from the reference data set. 
- `batch_1` containing a single example repeated 100 times.

In [None]:
batch_0 = x_ref[:100]
batch_1 = [x_ref[0]] * 100

In [None]:
preds_cd = cd.predict(batch_0)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds_cd['data']['is_drift']]))
print('p-value: {}'.format(preds_cd['data']['p_val']))
print('Drift Distance: {}'.format(preds_cd['data']['distance']))

In [None]:

preds_cd = cd.predict(batch_1)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds_cd['data']['is_drift']]))
print('p-value: {}'.format(preds_cd['data']['p_val']))
print('Drift Distance: {}'.format(preds_cd['data']['distance']))

Now we will save our drift detectors to a Google Storage bucket in subfolders corresponding to the `YOUR_NAME` variable. We can then deploy our drift detectors direcly from object storage with Seldon Deploy.

In [None]:
save_detector(cd, "ee-dd")

In [None]:
!gsutil cp -r ee-dd gs://kelly-seldon/entity-extractor/dd/{YOUR_NAME}/ee-dd

## 10. Deploy Drift Detector with Seldon Deploy

Finally, we can use the Seldon Deploy SDK to deploy our newly configured drift detector. We can define the config for the drift detector and then call the `DriftDetectorApi` to create the drift detector seldon deployment. 

In [None]:
DD_URI = f"gs://kelly-seldon/entity-extractor/dd/{YOUR_NAME}/ee-dd/"
DD_NAME = "ee-dd"

dd_config = {'config': {'basic': 
                        {'drift_batch_size': '5',
                         'storage_uri': DD_URI},
                        'deployment': {'protocol': 'kfserving.http'}
                        },
             'deployment_name': DEPLOYMENT_NAME,
             'detector_type': 'drift',
             'name': DD_NAME,
             'namespace': NAMESPACE
            }

In [None]:
dd_api = DriftDetectorApi(auth())
dd_api.create_drift_detector_seldon_deployment(name=DEPLOYMENT_NAME, namespace=NAMESPACE, detector_data=dd_config)