![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)


# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

## Installing `giskard` and other dependencies

In [None]:
!pip install giskard torch torchdata torchtext


# Text classification with the `torchtext` library

In this tutorial, we build a text classifier for the [`AG_NEWS` dataset](https://pytorch.org/text/stable/datasets.html?highlight=ag_news#torchtext.datasets.AG_NEWS). The classifier will take a news headline or article and detect its category (“World”, “Sports”, “Business”, “Sci/Tech”).

We will use `torch` and the `torchtext` library for the data processing pipeline, to implement model, and train in on the `AG_NEWS` data. Then, we will show how to upload the model to Giskard to inspect and validate its performances.

## 1. Data

### 1.1 Data preprocessing

The ``AG_NEWS`` dataset is made of tuples containing the label (category, e.g. “Sports”, “Business”) and text (news headline or description). Before training a ML model, we need to transform the data in a format that is easy to understand by model. A standard preprocessing pipeline consists in:

- tokenize sentences to produce a list of lexical tokens
- map each token to a numerical identifier (vocabulary)

We will implement this pipeline with the utilities provided by the `torchtext` library.

In [1]:
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# Get the AG News data
train_data, test_data = AG_NEWS()

# Simple English tokenizer provided by torchtext
tokenizer = get_tokenizer("basic_english")

# Build a volcabulary from all the tokens we can find in the train data
vocab = build_vocab_from_iterator(
    (tokenizer(text) for _, text in train_data), specials=["<unk>"]
)
vocab.set_default_index(vocab["<unk>"])


Let’s wrap the tokenization and vocabulary in a single `preprocess_text` function.

We also want to preprocess the labels associated to the news. In the `AG_NEWS` dataset, texts are categorized in four classes: “World”, “Sports”, “Business”, “Sci/Tech”. These are represented by integer values 1, 2, 3, and 4 respectively. For easier handling, we will convert these to integer IDs going from 0 to 3.

In [2]:
def preprocess_text(raw_text):
    return vocab(tokenizer(raw_text))


def preprocess_label(raw_label):
    return int(raw_label) - 1


Let’s test out preprocessing on a simple example:

In [3]:
preprocess_text("Here is a simple example!")

[475, 21, 5, 3390, 5297, 764]

The text preprocessing pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary.

::

    preprocess_text('here is the an example!')
    >>> [475, 21, 2, 30, 5297, 764]


## 2. Implement and train the model

### 2.1 Text classification model

The model is composed of the [nn.EmbeddingBag](https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag)_ layer plus a linear layer for the classification purpose. ``nn.EmbeddingBag`` with the default mode of "mean" computes the mean value of a “bag” of embeddings. Although the text entries here have different lengths, nn.EmbeddingBag module requires no padding here since the text lengths are saved in offsets.

Additionally, since ``nn.EmbeddingBag`` accumulates the average across
the embeddings on the fly, ``nn.EmbeddingBag`` can enhance the
performance and memory efficiency to process a sequence of tensors.

In [10]:
from torch import nn


class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded).softmax(axis=-1)


### 2.2 Preparing the data loaders

First, let’s define a custom function `collate_fn` that applied our preprocessing transformations (`preprocess_text`and `preprocess_label`), converts the data to `torch.tensor`, and groups sequences of samples in batches to be sent as input for the `TextClassificationModel` we defined above.

In [11]:
import torch
from torchtext.data.functional import to_map_style_dataset

# The device we are working on (GPU or CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Define how we collate data into batches that can be parsed by our model
def collate_fn(batch):
    label_list, text_list, offsets = [], [], [0]

    for _label, _text in batch:
        label_list.append(preprocess_label(_label))
        processed_text = torch.tensor(preprocess_text(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))

    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)

    return label_list.to(device), text_list.to(device), offsets.to(device)


# Create the datasets
train_dataset = to_map_style_dataset(train_data)
test_dataset = to_map_style_dataset(test_data)


To iterate and shuffle the data, `torch` provides a [DataLoader](https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader) object (you can find a tutorial [here](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)). Here we define the dataloaders for the train, validation, and test data.

We use a batch size of 64 samples. By setting `shuffle = True` in the `DataLoader` arguments, the samples will be returned in a random order when iterating over the data.

In [12]:
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split

BATCH_SIZE = 64

# We further divide the training data into a train and validation split.
train_split, valid_split = random_split(train_dataset, [0.95, 0.05])

# Prepare the data loaders
train_dataloader = DataLoader(
    train_split, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)
valid_dataloader = DataLoader(
    valid_split, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_fn
)


### 2.3 Training the text classifier model

We build a model with the embedding dimension of 64. The vocab size is equal to the length of the vocabulary instance. The number of classes is equal to the number of labels (1 = World, 2 = Sports, 3 = Business, 4 = Sci/Tech).

In [13]:
vocab_size = len(vocab)
embedding_size = 64
num_class = 4  # “World”, “Sports”, “Business”, “Sci/Tech”

model = TextClassificationModel(vocab_size, embedding_size, num_class).to(device)


Let’s define the a training and evaluate helpers

In [14]:
import time

# Hyperparameters
EPOCHS = 1
LR = 5  # learning rate

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)


def train_model(dataloader, epoch=0):
    model.train()
    total_acc = total_count = 0

    for label, text, offset in dataloader:
        optimizer.zero_grad()
        predicted_label = model(text, offset)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)

    return total_acc / total_count


def evaluate_model(dataloader):
    model.eval()

    total_acc = total_count = 0
    with torch.no_grad():
        for label, text, offsets in dataloader:
            predicted_label = model(text, offsets)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)

    return total_acc / total_count


In [16]:
EPOCHS = 1

total_accu = None
for epoch in range(1, EPOCHS + 1):
    start_time = time.perf_counter()

    train_model(train_dataloader)
    accu_val = evaluate_model(valid_dataloader)

    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val

    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.perf_counter() - start_time, accu_val
        )
    )
    print("-" * 59)


-----------------------------------------------------------
| end of epoch   1 | time: 27.21s | valid accuracy    0.871 
-----------------------------------------------------------


### 2.4 Evaluate the model on the test dataset

Now that we have a trained model, let’s evaluate its performances on the test data that we excluded from training.

In [17]:
accu_test = evaluate_model(test_dataloader)

print('Test accuracy {:8.3f}'.format(accu_test))

Test accuracy    0.869


We can see an example of classification, by using our model to predict the category of a golf news.

In [18]:
news_labels = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tech"}


def label_to_text(label_id: int):
    return news_labels[label_id]


def predict(text):
    """Given a text it predicts its category."""
    with torch.no_grad():
        text = torch.tensor(preprocess_text(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1


In [19]:
example_news = (
    "MEMPHIS, Tenn. – Four days ago, Jon Rahm was "
    "enduring the season’s worst weather conditions on Sunday at The "
    "Open on his way to a closing 75 at Royal Portrush, which "
    "considering the wind and the rain was a respectable showing. "
    "Thursday’s first round at the WGC-FedEx St. Jude Invitational "
    "was another story. With temperatures in the mid-80s and hardly any "
    "wind, the Spaniard was 13 strokes better in a flawless round. "
    "Thanks to his best putting performance on the PGA Tour, Rahm "
    "finished with an 8-under 62 for a three-stroke lead, which "
    "was even more impressive considering he’d never played the "
    "front nine at TPC Southwind."
)

model = model.to("cpu")

predicted = predict(example_news)

print(f"This is a {label_to_text(predicted)} news!")


This is a Sports news!


## Connect the external worker in daemon mode

## 3. Inspect the model with Giskard

### 3.1 Start the Giskard worker

In [None]:
!giskard worker start -d

### 3.2 Create a Giskard project via the API

In [22]:
from giskard import GiskardClient

GISKARD_URL = "http://localhost:9000"  # if Giskard is installed locally (see: https://docs.giskard.ai/start/guides/installation)
GISKARD_API_TOKEN = "YOUR_GISKARD_TOKEN"  # you can generate your API token in the settings of the Giskard application

client = GiskardClient(GISKARD_URL, GISKARD_API_TOKEN)

# Create a Giskard project if it does not exist
if "news_classification_demo" in [p.project_key for p in client.list_projects()]:
    project = client.get_project("news_classification_demo")
else:
    project = client.create_project(
        project_key="news_classification_demo", name="News text classification"
    )


First, we need to pack the dataset in a format recognized by Giskard, using the `giskard.Dataset` class. This is easy, since a Giskard dataset can be created from standard `pandas.DataFrame` objects.

In [28]:
import pandas as pd

# Our data in a dataframe format
df = pd.DataFrame(
    {"text": text, "label": label_to_text(label_id)} for label_id, text in test_data
)

df.head()

Unnamed: 0,text,label
0,Fears for T N pension after talks Unions repre...,Business
1,The Race is On: Second Private Team Sets Launc...,Sci/Tech
2,Ky. Company Wins Grant to Study Peptides (AP) ...,Sci/Tech
3,Prediction Unit Helps Forecast Wildfires (AP) ...,Sci/Tech
4,Calif. Aims to Limit Farm-Related Smog (AP) AP...,Sci/Tech


In [29]:
from giskard import Dataset

# Create the Giskard dataset
dataset = Dataset(df, name="Test Dataset", target="label", feature_types={"text": "text", "label": "category"})
dataset_id = dataset.upload(client, project.project_key)

Dataset successfully uploaded to project key 'news_classification_demo' with ID = d61d7470-52ee-4101-a34f-7a3f3f723b7b


Head over to the Giskard app to check your newly uploaded dataset!

To let Giskard work with our model, we need to tell it how to transform the dataset we just uploaded to a format that the model can handle. To do that, we specify a `dataframe_to_torch_dataset` function that will take the data from Giskard and convert it to a `torch.utils.data.Dataset`.

In [30]:
from torch.utils.data import Dataset as TorchDataset

def dataframe_to_torch_dataset(df: pd.DataFrame) -> TorchDataset:
    """Returns `(preprocessed_text, offset)` as torch tensors."""
    return to_map_style_dataset((torch.tensor(preprocess_text(text)), torch.tensor([0])) for text in df["text"])

In [33]:
from giskard import Model
import numpy as np

giskard_model = Model(
    clf=model,
    name="SimpleNewsClassificationModel",
    feature_names=["text"],
    model_type="classification",
    classification_labels=list(news_labels.values()),
    data_preprocessing_function=dataframe_to_torch_dataset,
)

# Create a small slice of the dataset to validate that our model works fine before uploading it to Giskard.
validate_ds = dataset.slice(lambda x: x.head())

model_id = giskard_model.upload(client, "news_classification_demo", validate_ds=validate_ds)

Model successfully uploaded to project key 'news_classification_demo' with ID = 2cee75ca-2c1a-49d8-aeff-9c63c7fba106
