<img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>

# <center>Getting Started with the Arize Platform</center>
## <center>Investigating Embedding Drift in NLP: Named Entity Recognition</center>

**In this walkthrough, we are going to ingest embedding data and look at embedding drift.**

In this scenario, you are in charge of maintaining a Named Entity Recognition (NER) model. This simple model can automatically scan text, pull out some fundamental entities within it, and classify them into predefined categories: Person, Location, or Organization. You trained your NER model on text written in English (see [dataset](https://huggingface.co/datasets/arize-ai/xtreme_en_token_drift)). However, once the model was released into production, you notice that the performance of the model has degraded over a period of time.

Arize is able to surface the reason for this performance degradation. In this example, text including locations is under-represented in the training set. This label imbalance impacts the model's performance. You can surface and troubleshoot this issue by analyzing the _embedding vectors_ associated with the input text.

It is worth noting that, according to our research, inspecting embedding drift can surface problems with your data before they cause performance degradation.

In this tutorial, we will start from scratch. We will:
* Download the data
* Preprocess the data
* Train the model
* Extract embedding vectors and predictions
* Log the inferences into the Arize Plaftorm

We will be using [🤗 Hugging Face](https://huggingface.co/)'s open source libraries to make this process extremely easy. In particular, we will use:
* [🤗 Datasets](https://huggingface.co/docs/datasets/index): a library for easily accessing and sharing datasets, and evaluation metrics for Natural Language Processing (NLP), computer vision, and audio tasks.
* [🤗 Transformers](https://huggingface.co/docs/transformers/index): a library to easily download and use state-of-the-art pre-trained models. Using pre-trained models can lower your compute costs, reduce your carbon footprint, and save you time from training a model from scratch.

Before we start, if this is your first Arize Tutorial, we recommend that you complete [Send Data to Arize in 5 Easy Steps](https://colab.research.google.com/github/Arize-ai/client_python/blob/main/arize/examples/tutorials/Arize_Tutorials/Quick_Start/Send_data_to_Arize_in_5_easy_steps_classification.ipynb) before continuing. If you are familiar with sending data to Arize, it only takes a few more lines to send embedding data.

Let's get started!

# Step 0. Setup and Getting the Data

We will first install 🤗Hugging Face's `datasets` and `transformers` libraries, mentioned above. In addition, we will import some metrics from `seqeval`, an opensource library ideal for sequence labeling evaluation. Find out more [here](https://github.com/chakki-works/seqeval).

We'll explain each of the imports below as we use them through this tutorial.


## Install Dependencies and Import Libraries 📚

In [None]:
!pip -q install datasets transformers seqeval arize

import numpy as np
import pandas as pd
import torch
from datasets import load_dataset
from transformers import (AutoTokenizer,
                          AutoModelForTokenClassification,
                          DataCollatorForTokenClassification,
                          Trainer,
                          TrainingArguments
                          )
from seqeval.metrics import f1_score, accuracy_score

from datetime import datetime
import uuid
from arize.pandas.logger import Client, Schema
from arize.utils.types import Environments, ModelTypes, EmbeddingColumnNames

## Check if GPU is available
Here we use Pytorch to check whether a GPU is available or not. When appropriate we will use PyTorch's `nn.Module.to()` method to ensure that the model will run on the GPU if we have one.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## **🌐 Download the Data**

The easiest way to load a dataset is from the [Hugging Face Hub](https://huggingface.co/datasets). There are already over 6000 datasets in over 100 languages on the Hub. The [arize-ai/xtreme_en_token_drift](https://huggingface.co/datasets/arize-ai/xtreme_en_token_drift) dataset has been crafted by Arize for this example notebook.

Thanks to Hugging Face 🤗 Datasets, we can download the dataset in one line of code. The `Dataset` object comes equipped with methods that make it very easy to inspect, pre-process, and post-process your data.

In [None]:
dataset = load_dataset("arize-ai/xtreme_en_token_drift")

You can select the splits of the dataset as you would in a dictionary.

In [None]:
train_ds, val_ds, prod_ds = dataset['training'], dataset['validation'], dataset['production']

## Inspect the Data

It is often convenient to convert a `Dataset` object to a Pandas `DataFrame` so we can access high-level APIs for data visualization. 🤗 Datasets provides a `set_format()` method that allows us to change the output format of the `Dataset`. This does not change the underlying data format, an Arrow table. When the `DataFrame` format is no longer needed, we can reset the output format using `reset_format()`.

In [None]:
train_ds.set_format("pandas")
display(train_ds[:].head())
train_ds.reset_format()

Let's also take a look at the categories we will be classifiying into:

In [None]:
tags = train_ds.features["ner_tags"].feature
tags

These tag labels follow the [IOB2 format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Let's assume you have an entity recognized as person, organization or location. The `B-` tag is applied to the first token of that _chunk_ of tokens. For the rest of tokens, the `I-` tag is used. If the word is not classified as any of the 3 options of this example, we give it the label `O`.

You will see a sample sentence classified using these tag labels later in this notebook.

# Step 1. Setting up your Named Entity Recognition Model

## Pre-processing the data

Before being able to input our data into our model for fine-tuning we need to perform some transformations: *__tokenization__* and *__label alignment__*.

### Tokenization

Transformer models like __*XLM-RoBERTa*__ cannot receive raw strings as input. We need to _tokenize_ and _encode_ the text as numerical vectors. We will perform _Subword Tokenization_, which is learned from the pre-training corpus. Its goal is to break complex words (or misspellings) into smaller units from which the model can learn, and to represent common words as unique entities, keeping the length of the input to a reasonable size.

🤗 Transformers provides the `AutoTokenizer` class, which allows us to quickly download the tokenizer required by the pre-trained model of our choosing.

In this case, we will use the following __checkpoint__: `xlm-roberta-base`.

In [None]:
model_ckpt = 'xlm-roberta-base'

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Next, let's define a function to tokenize the examples in the dataset. The `padding` and `truncation` options are added to keep the inputs to a consistent length. Shorter sequences are _padded_ and longer ones are _truncated_.

In [None]:
def tokenize(batch, max_length=512):
    return tokenizer(batch["split_text"], padding=True, truncation=True, max_length=max_length, is_split_into_words=True)

### Label alignment

The tokenizer returns `input_ids` and `attention_mask` for the model's inputs, both padded and truncated if the options above are set to `True`. However, the `ner_tags` column is not of the same length. This is because the `ner_tags` are assigned to each word. But after tokenization, words are split into tokens. Hence, there is a misalignment between the tags and the tokens.

To fix this, we need to assign a tag to each token. The following function does the trick.

In [None]:
def align_labels(batch, tokenized_inputs):
    labels = []
    for idx, tags in enumerate(batch["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=idx)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None or word_idx == previous_word_idx:
                label_ids.append(-100)
            else:
                label_ids.append(tags[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)
    return labels

It uses the `word_ids` to distinguish which tokens belong to which word. If the word ID is `None` it means that the token does not belong to a word, i.e., the `[SEP]` or `[CLS]` tokens. Moreover, some words are split into different tokens. Since the lowest granularity of an entity is the word, all tokens in the word must be classified with the same tag. To achieve this, the model will focus its attention on the first token of the word and ignore the rest.

In short, we want the model to ignore tokens not belonging to words and sub-word tokens excluding the first by assigning the label -100. This will tell `PyTorch` to ignore these labels when it computes the loss.


### Apply to dataset

Now we apply the transformations above to the entire dataset using the `map()` method.

In [None]:
def preprocess(batch):
    max_length = 128 # max length of the tokenized sequence
    tokenized_inputs = tokenize(batch, max_length)
    tokenized_inputs["labels"] = align_labels(batch, tokenized_inputs)
    return tokenized_inputs

In [None]:
process_batch_size = 100

train_ds = train_ds.map(preprocess, batched=True, batch_size=process_batch_size, remove_columns=['ner_tags'])
val_ds = val_ds.map(preprocess, batched=True, batch_size=process_batch_size, remove_columns=['ner_tags'])
prod_ds = prod_ds.map(preprocess, batched=True, batch_size=process_batch_size, remove_columns=['ner_tags'])

In the following view of our dataset, two columns have appeared:
* `input_ids`: A numerical identifier to which each token has been mapped.
* `attention_mask`: Array of 1s and 0s, allowing the model to ignore the padded parts of the inputs.

Notice that we have replaced the `ner_tags` column with `labels`. The information is the same, but now the `labels` are aligned with `input_ids`.

In [None]:
train_ds.set_format(type="pandas")
display(train_ds[:].head())
train_ds.reset_format()

## Get the model

Similar to how we obtained the tokenizer, 🤗 Transformers provides the `AutoModelForTokenClassification` class, which allows us to quickly download a pre-trained transformer model with a token classification [task head](https://huggingface.co/course/en/chapter2/2?fw=pt#model-heads-making-sense-out-of-numbers) on top. The pre-trained model to use in this tutorial is [XLM-RoBERTa](https://huggingface.co/xlm-roberta-base). The weights of the token classification task head will be randomly initialized.

It is important to pass `output_hidden_states = True` to be able to compute the embedding vectors associated with the text (explained below).


In addition, we will need to provide the mapping of each tag label to a tag ID and vice versa.


In [None]:
index2tag = {idx: tag for idx, tag in enumerate(tags.names)}
tag2index = {tag: idx for idx, tag in enumerate(tags.names)}

In [None]:
model_name = f"XLM-RoBERTa-xtreme-en-token-drift"
SKIP_TRAINING = False # Make True if you want to skip training

_NOTE_: You may skip the fine-tuning section if you would like to use a model that Arize has already fine-tuned for you. To skip, set `SKIP_TRAINING = True` and go ahead to [_B) Download the model_](#B\)-Download-the-fine-tuned-model).

### A) Fine-tune the model

Let's download the pre-trained model,

In [None]:
model = (AutoModelForTokenClassification
         .from_pretrained(model_ckpt,
                          num_labels=tags.num_classes,
                          id2label=index2tag,
                          label2id=tag2index,
                          output_hidden_states=True)
         .to(device))

Further, we use the `TrainingArguments` class to define the training parameters. This class stores a lot of information and gives you control over the training and evaluation.

In [None]:
training_batch_size = 32
training_epochs = 3
logging_steps = len(train_ds) // training_batch_size

training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=training_epochs,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=training_batch_size,
                                  per_device_eval_batch_size=training_batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  log_level="error",
                                  optim="adamw_torch",
                                  )

When evaluating an NER model, _all_ words of an entity need to be predicted correctly in order for a prediction to be counted as correct. For instance, if our model prediction was `New[LOC] York[PER]` it would not be correct. We would need both words predicted correctly, `New[LOC] York[LOC]`, for the entity prediction to be correct. For this purpose, we use the library `seqeval` ([github.com/chakki-works/seqeval](https://github.com/chakki-works/seqeval)).

The following functions will:
1. Remove the prediction and actual labels marked with -100, which should be ignored when computing metrics.
2. Compute the evaluation metrics on the two sequences of tag labels.

In [None]:
def remove_labels_of_ignored_tokens(predictions, label_ids):
    batch_preds = np.argmax(predictions, axis=2)
    batch_size, seq_len = batch_preds.shape
    actuals_list, preds_list = [], []

    for batch_idx in range(batch_size):
        actuals, preds = [], []
        for seq_idx in range(seq_len):
            # Ignore label IDs = -100
            if label_ids[batch_idx, seq_idx] != -100:
                actuals.append(index2tag[label_ids[batch_idx][seq_idx]])
                preds.append(index2tag[batch_preds[batch_idx][seq_idx]])

        actuals_list.append(actuals)
        preds_list.append(preds)

    return preds_list, actuals_list

def compute_metrics(pred):
    y_pred, y_true = remove_labels_of_ignored_tokens(pred.predictions[0],
                                                     pred.label_ids)
    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "f1": f1_score(y_true, y_pred, average="weighted")
        }

Next, we need a _data collator_ so that we can pad each input sequence to match the length of the largest sequence in a batch.

In [None]:
data_collator = DataCollatorForTokenClassification(tokenizer)


Finally, we can fine-tune our model using the `Trainer` class.

In [None]:
if SKIP_TRAINING == False:
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        tokenizer=tokenizer,
    )

    print("Evaluation before training")
    eval = trainer.evaluate(eval_dataset=val_ds)
    eval_df = pd.DataFrame({'Epoch':0, 'Validation Loss': eval['eval_loss'], 'Accuracy': eval['eval_accuracy'], 'F1': eval['eval_f1']}, index=[0])
    display(eval_df)
    print(" ")

    torch.cuda.empty_cache() # Free up some memory

    print("\nTraining...")
    trainer.train()

### B) Download the fine-tuned model

If you decided to skip step 1, you can download the already fine-tuned model [arize-ai/XLM-RoBERTa-xtreme-en-token-drift](https://huggingface.co/arize-ai/XLM-RoBERTa-xtreme-en-token-drift) from Arize's page in the Hugging Face Hub.


In [None]:
if SKIP_TRAINING == True: # Make sure you marked SKIP_TRAINING = True if you wanted to skip training
    model_ckpt = f"arize-ai/{model_name}"

    model = (AutoModelForTokenClassification
            .from_pretrained(model_ckpt,
                             num_labels = tags.num_classes,
                             output_hidden_states=True
                             )
            .to(device))

## Try out the model

If you want to get a feel of how the fine-tuned model performs with specific text you want to pass as input, the following is a helper function so we can try it out.

In [None]:
def show_tags(split_text, index2tag, tokenizer, model):
    # Get tokens with special characters
    tokens = tokenizer(list(split_text), is_split_into_words=True).tokens()
    # Encode the sequence into IDs
    input_ids = tokenizer(list(split_text), is_split_into_words=True, return_tensors="pt").input_ids.to(device)
    # Get predictions as distribution over 7 possible classes
    logits = model(input_ids).logits
    # Take argmax to get most likely class per token
    predictions = torch.argmax(logits, dim=2).squeeze().cpu().numpy()
    # Convert to DataFrame
    preds = [index2tag[p] for p in predictions]

    return pd.DataFrame([tokens, preds], index=["Tokens", "Predicted Tags"])

In [None]:
text = "My name is Julia, I study at Imperial College, in London".split(" ")

pd.set_option("display.max_columns", 50)
show_tags(text, index2tag, tokenizer, model)

As we can see in the table above, our model fails to understand locations. This is not surprising given that the dataset we used to train it was crafted in such a way that the `LOC` tokens are very under represented in the training set.

We will see how Arize is able to surface this problem once the data is ingested into the platform.

If you'd like to ingest data from a model without this problem you can do one of the following:
* Substitute the current model with a model fine-tuned on a dataset without this representation problem, and keep going from here (note that it is missing `-token-drift`).
```
model_ckpt = f"arize-ai/XLM-RoBERTa-xtreme-en"
model = (AutoModelForTokenClassification
        .from_pretrained(model_ckpt,
                         num_labels = tags.num_classes,
                         output_hidden_states=True
                        )
        .to(device))
```

* If you want to obtain a fine-tuned model like the one above, substitute the dataset downloaded at the beginning of this tutorial for one without this problem (note that it is missing `_token_drift`). Then, run the notebook again executing the [fine-tuning step](#A\)-Fine-tune-the-model).
```
dataset = load_dataset("arize-ai/xtreme_en")
```

# Step 2. Post-Processing your data

## Get model outputs
Now we will extract the prediction labels and the text embedding vectors. The latter are formed from the hidden states of our pre-trained (and then fine-tuned) model. We will choose the last hidden layer to compute our embeddings. We get the whole layer now so in a later function we can compute the desired token embedding vector from it.

In [None]:
def get_model_outputs(batch):
    # Get model inputs, convert dict of lists to list of dicts suitable for data collator
    inputs = {k:v.to(device) for k,v in batch.items() if k in tokenizer.model_input_names}

    with torch.no_grad():
        # Pass data through model
        output = model(**inputs)
        predicted_labels = torch.argmax(output.logits, dim=2).cpu().numpy()
        hidden_states = torch.stack(output.hidden_states).cpu().numpy() # (layer_#, batch_size, seq_length/or/num_tokens, hidden_size)

    return {"pred_labels": predicted_labels, "last_hidden_state": hidden_states[-1]}

In [None]:
train_ds.set_format("torch", columns=["input_ids", "attention_mask"])
train_ds = train_ds.map(get_model_outputs, batched=True, batch_size=process_batch_size)

val_ds.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
val_ds = val_ds.map(get_model_outputs, batched=True, batch_size=process_batch_size)

prod_ds.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
prod_ds = prod_ds.map(get_model_outputs, batched=True, batch_size=process_batch_size)

## Expand the dataset

Each record sent to the Arize platform can contain one prediction label and one actual label. In the case of NER, for each individual input sequence to the model, there are tokens which labels have been predicted. Hence, we need perform a few transformations on the dataset to obtain individual records, each containing one prediction and one actual. Since this can dramatically increase the size of our example, we will filter out the cases when both the prediction and actual agree that the label is "O" (representing miscellaneous entities, not Person, Organization or Location).

In [None]:
index2tag[-100] = "IGN"

def find_label_indexes(labels, pred_labels):
    token_index_list = []
    for i in range(1, len(labels)-1):
        label = labels[i]
        pred_label = pred_labels[i]
        if (label == "IGN") or (label == "O" and pred_label == "O"):
            continue
        token_index_list.append(i)
    return token_index_list

def filter_labels(labels, filters):
    return [labels[index] for index in filters]

def filter_word_ids(word_ids, filters):
    return [word_ids[index] for index in filters]

def get_token_embeddings(last_hidden_state, filters):
    return [np.stack(last_hidden_state)[index, :] for index in filters]

def mark_word_in_text(split_text, word_id):
    marked_text = split_text[:word_id].tolist()
    marked_text.append(">" + split_text[word_id] + "<")
    marked_text += split_text[word_id+1:].tolist()
    return " ".join(marked_text)


def postprocess(df):
    # Helper column (will be deleted) so we can truncate the different arrays on each row
    df['N'] = df['attention_mask'].map(lambda x: x.sum())
    # Truncate the input_ids, labels, and pred_labels columns using the attention_mask
    df['input_ids'] = df.apply(lambda row: row['input_ids'][:row['N']], axis=1)
    df['labels'] = df.apply(lambda row: row['labels'][:row['N']], axis=1)
    df['pred_labels'] = df.apply(lambda row: row['pred_labels'][:row['N']], axis=1)

    # Translate the label IDs to their tag name
    df["labels"] = df["labels"].apply(lambda x: [index2tag[index] for index in x])
    df["pred_labels"] = df["pred_labels"].apply(lambda x: [index2tag[index] for index in x])

    # Get word_ids, this will be used to mark what word each prediction/actual will be corresponding to
    df['word_ids'] = df['split_text'].map(lambda split_text: tokenizer(list(split_text), is_split_into_words=True).word_ids())

    # We create a filter so we don't send to Arize the "IGN" labels and the redundant "O" labels (when both prediction and actuals agree to be "O"),
    # "O" labels are assigned to words that are not "LOC", "PER", or "ORG".
    # index_filter will be the indexes of those tokens to be considered and sent into Arize
    df['index_filter'] = df.apply(lambda row: find_label_indexes(row['labels'], row['pred_labels']), axis=1)

    # Use `index_filter` to filter labels and word_ids
    df['labels'] = df.apply(lambda row: filter_labels(row['labels'], row['index_filter']), axis=1)
    df['pred_labels'] = df.apply(lambda row: filter_labels(row['pred_labels'], row['index_filter']), axis=1)
    df['word_ids'] = df.apply(lambda row: filter_word_ids(row['word_ids'], row['index_filter']), axis=1)

    # We will get the token embeddings from the last_hidden_state layer, specifically using the token index from index_filter
    df['token_embeddings'] = df.apply(lambda row: get_token_embeddings(row['last_hidden_state'], row['index_filter']), axis=1)

    # We can now drop some columns that are no longer needed
    df.drop(columns=['input_ids', 'attention_mask','last_hidden_state', 'N', 'index_filter'], inplace=True)

    # We will expand our dataset to get predictions, actuals, and embeddings corresponding to individual tokens
    df = df.explode(['labels','pred_labels','word_ids','token_embeddings'])

    # Finally, we create a "text" column to be the raw text from which the prediction takes place, marking the word of interest
    # for this inference event
    df['split_text'] = df.apply(lambda row: mark_word_in_text(row['split_text'], row['word_ids']), axis=1)
    df.rename(columns={"split_text": "text"}, inplace=True)

    return df.reset_index(drop=True)

From this point forward, it is convenient to use Pandas DataFrames. We can do so easily using the format methods we have seen already

In [None]:
train_ds.set_format("pandas")
train_df = postprocess(train_ds[:])

val_ds.set_format("pandas")
val_df = postprocess(val_ds[:])

prod_ds.set_format("pandas")
prod_df = postprocess(prod_ds[:])

Visualize our dataset after the many transformations

In [None]:
train_df.head()

# Step 3. Prepare your data to be sent to Arize


## Update the timestamps

The data that you are working with was constructed in April of 2022. Hence, we will update the timestamps so they are current at the time that you're sending data to Arize.

In [None]:
last_ts = max(prod_df['prediction_ts'])
now_ts = datetime.timestamp(datetime.now())
delta_ts = now_ts - last_ts

train_df['prediction_ts'] = (train_df['prediction_ts'] + delta_ts).astype(float)
val_df['prediction_ts'] = (val_df['prediction_ts'] + delta_ts).astype(float)
prod_df['prediction_ts'] = (prod_df['prediction_ts'] + delta_ts).astype(float)

## Add prediction ids

The Arize platform uses prediction IDs to link a prediction to an actual. Visit the [Arize documentation](https://docs.arize.com/arize/data-ingestion/model-schema/5.-prediction-id?q=prediction_id) for more details.

You can generate prediction IDs as follows:

In [None]:
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]

In [None]:
train_df['prediction_id'] = add_prediction_id(train_df)
val_df['prediction_id'] = add_prediction_id(val_df)
prod_df['prediction_id'] = add_prediction_id(prod_df)

# Step 4. Sending Data into Arize 💫

## Select the columns we want to send to Arize (optional)

This step is not really necessary, since we will select the columns we want to send to Arize using the `Schema` definition (below). However, for the purpose of visibility, this is our final `DataFrame` with the data that will be sent to Arize.

In [None]:
arize_columns = [
    'prediction_id',
    'prediction_ts',
    'language',
    'text',
    'labels',
    'pred_labels',
    'token_embeddings'
    ]

train_df = train_df[arize_columns]
val_df = val_df[arize_columns]
prod_df = prod_df[arize_columns]

train_df.head()

## Import and Setup Arize Client

The first step is to setup the Arize client. After that we will log the data.

Copy the Arize `API_KEY` and `SPACE_KEY` from your admin page (shown below) to the variables in the cell below. We will also be setting up some metadata to use across all logging.

<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-keys.png" width="700">

In [None]:
SPACE_KEY = "SPACE_KEY"
API_KEY = "API_KEY"
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
model_id = "NLP-demo-NER-token-drift" # Remove '-token-drift' if you chose a model without the drifting problem
model_version = "1.0"
model_type = ModelTypes.SCORE_CATEGORICAL
if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("✅ Import and Setup Arize Client Done! Now we can start using Arize!")


Now that our Arize client is setup, let's go ahead and log all of our data to the platform. For more details on how **`arize.pandas.logger`** works, visit our documentation.

[![Buttons_OpenOrange.png](https://storage.googleapis.com/arize-assets/fixtures/Buttons_OpenOrange.png)](https://docs.arize.com/arize/sdks-and-integrations/python-sdk/arize.pandas)

## Define the Schema

A Schema instance specifies the column names for corresponding data in the dataframe. While we could define different Schemas for training and production datasets, the dataframes have the same column names, so the Schema will be the same in this instance.

To ingest non-embedding features, it suffices to provide a list of column names that contain the features in our dataframe. Embedding features, however, are a little bit different.

Arize allows you to ingest not only the embedding vector, but the raw data associtated with that embedding, or a URL link to that raw data. Therefore, up to 3 columns can be associated to the same _embedding object_*. To be able to do this, Arize's SDK provides the `EmbeddingColumnNames` class, used below.

*NOTE: This is how we refer to the 3 possible pieces of information that can be sent as embedding objects:
* Embedding `vector` (required)
* Embedding `data` (optional): raw text, image, ...; associated with the embedding vector
* Embedding `link_to_data` (optional): link to the data associated with the embedding vector

Learn more [here](https://docs.arize.com/arize/data-ingestion/model-schema/7b.-embedding-features).

In [None]:
features = [
    'language',
]

embedding_features = [
    EmbeddingColumnNames(
        vector_column_name="token_embeddings",  # Will be name of embedding feature in the app
        data_column_name="text",
    ),
]

# Define a Schema() object for Arize to pick up data from the correct columns for logging
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_labels",
    actual_label_column_name="labels",
    feature_column_names=features,
    embedding_feature_column_names=embedding_features
)



## Log Training Data

In [None]:
# Logging Training DataFrame
response = arize_client.log(
    dataframe=train_df,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.TRAINING,
    schema=schema,
    sync=True
)


# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged training set to Arize")


## Log Validation Data

In [None]:
# Logging Training DataFrame
response = arize_client.log(
    dataframe=val_df,
    model_id=model_id,
    model_version=model_version,
    batch_id="validation",
    model_type=model_type,
    environment=Environments.VALIDATION,
    schema=schema,
    sync=True
)


# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged training set to Arize")


## Log Production Data

In [None]:
# send production data
response = arize_client.log(
    dataframe=prod_df,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
    schema=schema,
    sync=True
)

if response.status_code != 200:
    print(f"❌ logging failed with response code {response.status_code}, {response.text}")
else:
    print(f"✅ You have successfully logged production set to Arize")

# Step 5. Confirm Data in Arize ✅
Note that the Arize platform takes about 15 minutes to index embedding data. While the model should appear immediately, the data will not show up until the indexing is complete. Feel free to head over to the **Data Ingestion** tab for your model to watch Arize works its magic!🔮

You will be able to see the predictions, actuals, and feature importances that have been sent in the last 30 minutes, last day or last week.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.

<img src="https://storage.googleapis.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">

# Step 6. Check the Embedding Data in Arize

First, set the baseline to the training set that we logged in the previous section.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/embedding_setup_baseline.gif" width="700">

Once data is ingested and models contain embedding data, you will see it on the Model Overview page.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NER-token-drift-overview.jpg" width="700">

Click on the Embedding Name or the Euclidean Distance value to see how your embedding data is drifting over time. In the picture below, we represent the global euclidean distance between your production set (at different points in time) and the baseline (which we set to be our training set).

We can see there is a period of a week where suddenly the distance is remarkably higher. When the drift distance is high, the tag proportions in our production set are different to those of the baseline. Conversely, when the drift distance is low, the tag proportions in our production set are similar to those of the baseline. More precisely, the LOC tags in the baseline were under-represented. Hence, when they are reasonably present in production, we observe higher drift values.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NER-token-drift-emb-0.jpg" width="700">

In addition to the drift tracking plot above, below you can find the UMAP visualization of your data, according to the point in time selected. Notice that the production data and our baseline (training) data are superimposed, which is indicative that the model is seeing data in production similar to the data it was trained on.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NER-token-drift-emb-1.jpg" width="700">

Next, select a point in time when the drift was high and generate a UMAP visualization in 2D. We can see that both training and production data are superimposed for the most part, with the exception of two areas of production data that have no training data present. This indicates that the model is seeing data in production qualitatively different from the data it was trained on, and in this case causing performance degradation.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NER-token-drift-emb-2.jpg" width="700">

You can also choose to color your data points by prediction label. This is very useful to gather insight on how your model is thinking about the inputs, and identify any potential flaws in your model's decision process. For instance, in the following image there are some purple data points close to the brown cluster. You can inspect those points and determine if the model is wrong or there is a labeling problem. You can also identify patterns in the common mistakes the model makes that will help root cause the problem.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NER-token-drift-emb-3.jpg" width="700">

For further inspection, generate the UMAP vizualization in 3D, and click _"Explore UMAP"_. With this view, you can interact with the dataset in 3D. You can zoom, rotate, and drag to see any areas of interest within the dataset.

<img src="https://storage.cloud.google.com/arize-assets/fixtures/Embeddings/NLP/NER-token-drift-workflow.gif" width="700">

The different coloring options can help you understand your model's decisions and flaws, and your dataset's structure and potential drift. More coloring options will be added to help understand and debug your model and dataset, including:

* Color by actual label
* Color by feature value
* Color by accuracy (correct vs incorrect predictions)

# Wrap Up 🎁
Congratulations, you've now sent your first machine learning embedding data to the Arize platform!!

Additionally, if you want to remove this example model from your account, just click **Models** -> **NLP-demo-NER-token-drift** -> **config** -> **delete**

### Overview
Arize is an end-to-end ML observability and model monitoring platform. The platform is designed to help ML engineers and data science practitioners surface and fix issues with ML models in production faster with:
- Automated ML monitoring and model monitoring
- Workflows to troubleshoot model performance
- Real-time visualizations for model performance monitoring, data quality monitoring, and drift monitoring
- Model prediction cohort analysis
- Pre-deployment model validation
- Integrated model explainability

### Website
Visit Us At: https://arize.com/model-monitoring/

### Additional Resources
- [What is ML observability?](https://arize.com/what-is-ml-observability/)
- [Monitor Unstructured Data with Arize](https://arize.com/blog/monitor-unstructured-data-with-arize)
- [Getting Started With Embeddings Is Easier Than You Think](https://arize.com/blog/getting-started-with-embeddings-is-easier-than-you-think)
- [Playbook to model monitoring in production](https://arize.com/the-playbook-to-monitor-your-models-performance-in-production/)
- [Using statistical distance metrics for ML monitoring and observability](https://arize.com/using-statistical-distance-metrics-for-machine-learning-observability/)
<!-- - [ML infrastructure tools for data preparation](https://arize.com/ml-infrastructure-tools-for-data-preparation/) -->
- [ML infrastructure tools for model building](https://arize.com/ml-infrastructure-tools-for-model-building/)
- [ML infrastructure tools for production](https://arize.com/ml-infrastructure-tools-for-production-part-1/)
<!-- - [ML infrastructure tools for model deployment and model serving](https://arize.com/ml-infrastructure-tools-for-production-part-2-model-deployment-and-serving/) -->

- [ML infrastructure tools for ML monitoring and observability](https://arize.com/ml-infrastructure-tools-ml-observability/)

Visit the [Arize Blog](https://arize.com/blog) and [Resource Center](https://arize.com/resource-hub/) for more resources on ML observability and model monitoring.
