# Example 1 - Default Approach

This notebook demonstrates how to use the TX2 dashboard with a sequence classification transformer using the default approach as described in the "Basic Usage" docs.

In [1]:
%cd -q ..
%load_ext autoreload
%autoreload 2
%matplotlib inline

We enable logging to view the output from `wrapper.prepare()` further down in the notebook. (It's a long running function, and logs which step it's on.)

In [2]:
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

In [3]:
import numpy as np
import pandas as pd
import torch
from torch import cuda
from torch.utils.data import DataLoader, Dataset
from tqdm.notebook import tqdm
from transformers import AutoModel, AutoTokenizer, BertTokenizer

In this example notebook, we use the 20 newsgroups dataset, which can be downloaded through sklearn via below:

In [4]:
from datasets import load_dataset

# getting newsgroups data from huggingface
train_data = pd.DataFrame(
    data=load_dataset("rungalileo/20_Newsgroups_Fixed", split="train")
)
test_data = pd.DataFrame(
    data=load_dataset("rungalileo/20_Newsgroups_Fixed", split="test")
)

train_data.drop(columns=["id"], axis=1, inplace=True)
test_data.drop(columns=["id"], axis=1, inplace=True)

# setting up pytorch device
if cuda.is_available():
    device = "cuda"
elif torch.has_mps:
    device = "mps"
else:
    device = "cpu"
    
device

Using custom data configuration rungalileo--20_Newsgroups_Fixed-66c0c93a05210a8a
Reusing dataset csv (/Users/s6t/.cache/huggingface/datasets/rungalileo___csv/rungalileo--20_Newsgroups_Fixed-66c0c93a05210a8a/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)
Using custom data configuration rungalileo--20_Newsgroups_Fixed-66c0c93a05210a8a
Reusing dataset csv (/Users/s6t/.cache/huggingface/datasets/rungalileo___csv/rungalileo--20_Newsgroups_Fixed-66c0c93a05210a8a/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)


'mps'

Defined below is a simple sequence classification model with a variable for the language model itself `l1`. Since it is a BERT model, we take the sequence embedding from the `[CLS]` token (via `output_1[0][:, 0, :])`) and pipe that into the linear layer.

In [21]:
class BERTClass(torch.nn.Module):
    def __init__(self):
        super(BERTClass, self).__init__()
        self.l1 = AutoModel.from_pretrained("bert-base-cased")
        self.l2 = torch.nn.Linear(768, 20)

    def forward(self, ids, mask):
        output_1 = self.l1(ids, mask)
        output = self.l2(output_1[0][:, 0, :])  # use just the [CLS] output embedding
        return output


model = BERTClass()
model.to(device)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


AttributeError: 'BertTokenizerFast' object has no attribute 'to'

Some simplistic data cleaning, and putting all data into dataframes for the wrapper

In [6]:
def clean_text(text):
    text = str(text)
    # text = text[text.index("\n\n") + 2 :]
    text = text.replace("\n", " ")
    text = text.replace("    ", " ")
    text = text.replace("   ", " ")
    text = text.replace("  ", " ")
    text = text.strip()
    return text

In [7]:
train_data.text = train_data.text.apply(lambda x: clean_text(x))
test_data.text = test_data.text.apply(lambda x: clean_text(x))

In [8]:
# convert labels to numeric
label_list = list(train_data.label.unique())

train_data.label = train_data.label.apply(lambda x: label_list.index(x))
test_data.label = test_data.label.apply(lambda x: label_list.index(x))

train_data

Unnamed: 0,text,label
0,I was wondering if anyone out there could enli...,0
1,A fair number of brave souls who upgraded thei...,1
2,"well folks, my mac plus finally gave up the gh...",1
3,Do you have Weitek's address/phone number? I'd...,2
4,"From article <C5owCB.n3p@world.std.com>, by to...",3
...,...,...
11309,DN> From: nyeda@cnsvax.uwec.edu (David Nye) DN...,5
11310,"I have a (very old) Mac 512k and a Mac Plus, b...",1
11311,I just installed a DX2-66 CPU in a clone mothe...,6
11312,Wouldn't this require a hyper-sphere. In 3-spa...,2


In [9]:
label_list

['rec.autos',
 'comp.sys.mac.hardware',
 'comp.graphics',
 'sci.space',
 'talk.politics.guns',
 'sci.med',
 'comp.sys.ibm.pc.hardware',
 'comp.os.ms-windows.misc',
 'rec.motorcycles',
 'talk.religion.misc',
 'None',
 'misc.forsale',
 'alt.atheism',
 'sci.electronics',
 'comp.windows.x',
 'rec.sport.hockey',
 'rec.sport.baseball',
 'soc.religion.christian',
 'talk.politics.mideast',
 'talk.politics.misc',
 'sci.crypt']

## Training

This section minimally trains the classification and language model - nothing fancy here, just to give the dashboard demo something to work with. Most of this is similar to the huggingface tutorial notebooks.

In [10]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params=model.parameters(), lr=1e-05)

In [11]:
class EncodedSet(Dataset):
    def __init__(self, dataframe: pd.DataFrame, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        print(self.len)

    def __getitem__(self, index):
        text = str(self.data.text[index])
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding="max_length",
            truncation=True,
            return_token_type_ids=True,
        )
        ids = inputs["input_ids"]
        mask = inputs["attention_mask"]

        return {
            "ids": torch.tensor(ids, dtype=torch.long),
            "mask": torch.tensor(mask, dtype=torch.long),
            "targets": torch.tensor(self.data.label[index], dtype=torch.long),
        }

    def __len__(self):
        return self.len

train_data.reset_index(drop=True)
test_data.reset_index(drop=True)

train_set = EncodedSet(train_data, tokenizer, 256)
test_set = EncodedSet(test_data[:1000], tokenizer, 256)

train_params = {"batch_size": 16, "shuffle": True, "num_workers": 0}

test_params = {"batch_size": 2, "shuffle": True, "num_workers": 0}

# put everything into data loaders
train_loader = DataLoader(train_set, **train_params)
test_loader = DataLoader(test_set, **test_params)

11314
1000


In [12]:
def train(epoch):
    model.train()

    loss_history = []
    for _, data in tqdm(enumerate(train_loader, start=0), total=len(train_loader), desc=f"Epoch {epoch}"):
        ids = data["ids"].to(device, dtype=torch.long)
        mask = data["mask"].to(device, dtype=torch.long)
        targets = data["targets"].to(device, dtype=torch.long)

        outputs = model(ids, mask).squeeze()

        optimizer.zero_grad()
        loss = loss_function(outputs, targets)
        if _ % 100 == 0:
            print(f"Epoch: {epoch}, Loss:  {loss.item()}")
        loss_history.append(loss.item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    #         torch.cuda.empty_cache()
    return loss_history

In [14]:
losses = []
for epoch in range(1):
    losses.extend(train(epoch))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 0:   0%|          | 0/708 [00:00<?, ?it/s]

Epoch: 0, Loss:  3.080918312072754
Epoch: 0, Loss:  1.955832600593567
Epoch: 0, Loss:  1.5755281448364258
Epoch: 0, Loss:  0.9863734245300293
Epoch: 0, Loss:  0.7981112003326416
Epoch: 0, Loss:  0.5531709790229797
Epoch: 0, Loss:  0.4949667155742645
Epoch: 0, Loss:  0.8490170240402222


The wrapper uses an `encodings` dictionary for various labels/visualizations, and can be set up with something similar to:

In [15]:
encodings = {}
for index, entry in enumerate(label_list):
    encodings[entry] = index
encodings

{'rec.autos': 0,
 'comp.sys.mac.hardware': 1,
 'comp.graphics': 2,
 'sci.space': 3,
 'talk.politics.guns': 4,
 'sci.med': 5,
 'comp.sys.ibm.pc.hardware': 6,
 'comp.os.ms-windows.misc': 7,
 'rec.motorcycles': 8,
 'talk.religion.misc': 9,
 'None': 10,
 'misc.forsale': 11,
 'alt.atheism': 12,
 'sci.electronics': 13,
 'comp.windows.x': 14,
 'rec.sport.hockey': 15,
 'rec.sport.baseball': 16,
 'soc.religion.christian': 17,
 'talk.politics.mideast': 18,
 'talk.politics.misc': 19,
 'sci.crypt': 20}

## TX2

This section shows how to put everything into the TX2 wrapper to get the dashboard widget displayed.

In [16]:
from tx2.dashboard import Dashboard
from tx2.wrapper import Wrapper

In [20]:
wrapper = Wrapper(
    train_texts=train_data.text,
    train_labels=train_data.label,
    test_texts=test_data.text[:2000],
    test_labels=test_data.label[:2000],
    encodings=encodings,
    classifier=model,
    language_model=model.l1,
    tokenizer=tokenizer,
    overwrite=True,
)
wrapper.prepare()

INFO:root:Cache path not found, creating...
INFO:root:Checking for cached predictions...
INFO:root:Running classifier...


RuntimeError: Placeholder storage has not been allocated on MPS device!

In [None]:
%matplotlib agg
import matplotlib.pyplot as plt

dash = Dashboard(wrapper)
dash.render()

To play with different UMAP and DBSCAN arguments without having to recompute the entire `prepare()` function, you can use `recompute_projections` (which will recompute both the projections and visual clusterings) or `recompute_visual_clusterings` (which will only redo the clustering)

In [None]:
# wrapper.recompute_visual_clusterings("KMeans", clustering_args=dict(n_clusters=18))
# wrapper.recompute_visual_clusterings("OPTICS", clustering_args=dict())
# wrapper.recompute_projections(umap_args=dict(min_dist=.2), dbscan_args=dict())

To test or debug the classification model/see what raw outputs the viusualizations are getting, or create your own visualization tools, you can manually call the `classify()`, `soft_classify()`, `embed()` functions, or get access to any of the cached data as seen in the bottom cell

In [None]:
wrapper.classify(["testing"])

In [None]:
wrapper.soft_classify(["testing"])

In [None]:
wrapper.embed(["testing"])

In [None]:
# cached data:
# wrapper.embeddings_training
# wrapper.embeddings_testing
# wrapper.projector
# wrapper.projections_training
# wrapper.projections_testing
# wrapper.salience_maps
# wrapper.clusters
# wrapper.cluster_profiles
# wrapper.cluster_words_freqs
# wrapper.cluster_class_word_sets