In [1]:
from datasets import list_datasets

# display all datasets
all_datasets = list_datasets()
print(f"There are {len(all_datasets)} datasets currently available on the Hub")
print(f"The first 10 are: {all_datasets[:10]}")

In [2]:
from datasets import load_dataset

emotions = load_dataset("emotion")
emotions

In [3]:
train_ds = emotions["train"]
train_ds # contains a Dataset class

In [4]:
len(train_ds)

In [5]:
train_ds[0]

In [6]:
train_ds.column_names

In [7]:
# can see the data types under the 
# value is a string
# label is  ClassLabel that contains inforamtion about class names and mapping to integers
print(train_ds.features)

In [8]:
# can access several rows via a slice
train_ds[:5]

In [9]:
# or just full columns by name
train_ds["text"][:5]

Sidebar: Can also load in datasets from csv,txt and json files with `load_dataset` function, and pass `data_files` argument specifying the path/URL to one/more files. Can also specify the delimiter with `sep` and column names. Datasets documentation gives a more complete overview.

In [10]:
import pandas as pd

# convenient to convert to PD dataframe so we can access APIs for data viz
# Datasets provides a set_format() that allows us to change the output format of the dataset
emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head()

In [11]:
# convert int labels to string labels using int2str() method
def label_int2str(row):
    return emotions["train"].features["label"].int2str(row)

df["label_name"] = df["label"].apply(label_int2str)
df.head()

In [12]:
# piece of advice, "becoming one with the data" is an essential step for training great models
# so we want to look at the distribution of the dataset generally
# a skewed distribution may require different treatment in terms of metric & loss than an evenly distributed one
import matplotlib.pyplot as plt

df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Frequency of Classes")
plt.show()

Ways to deal with imbalance:
- Randomly oversample minority
- Randomly undersample majority
- Gather more labeled data from underrepresented class

Can use imbalanced-learn library to aid in sampling techniques. Make sure to apply sampling methods *after* creating train/test splits to avoid leakage.

In [13]:
# transformer models have a maximum input sequence, referred to as maximum context size
# for distilBert is 512 tokens, approx a few paragraphs of text
df["Words Per Tweet"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Tweet", by="label_name", grid=False, showfliers=False, color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()

In [14]:
# tweets approx 15 words long. Longest tweets are well below DistilBERT's max context size
# Longer texts need to be truncated, which may result in loss of performance as we may lose
# crucial information

# reset format as we don't need dataframe anymore; will begin working towards Transformers
emotions.reset_format()

Tokenisation: Breaking down a string into atomic units used in a model.

In [15]:
# Character Tokenisation; feed each character individually to the model
# str objects are really arrays under the hood; so we can have character level
# tokens in one line
text = "Tokenizing text is a core task of NLP."
tokenized_text = list(text)
print(tokenized_text)

In [16]:
# not done yet; model expects integers, a process called numericalisation
# so encode each unique token with a unique integer
token2idx = {ch: idx for idx, ch in enumerate(sorted(set(tokenized_text)))}
print(token2idx) # get mapping of each character in vocab to an integer

In [17]:
input_ids = [token2idx[token] for token in tokenized_text]
print(input_ids) # each token mapped to a unique numerical identifier

In [18]:
# last step to convert input_ids to 2d tensor of one-hot vectors
# e.g. can map each name to a unique ID
categorical_df = pd.DataFrame(
    {"Name":["Bumblebee", "Optimus Prime", "Megatron"], "Label ID": [0,1,2]}
)
categorical_df

In [19]:
# creates fictitious ordering between names; and NNs are really good at learning these kinds of relationships.
# so instead create new col for each category and 1 where true, 0 where false (one-hot encode)
pd.get_dummies(categorical_df["Name"]) # rows are one-hot vectors

Here we have the case where if we have two 1's, then can be interpreted as both tokens (e.g. both Bumblebee and Megatron) co-occur. We can create one-hot encodings in PyTorchby converting input_ids to tensor with `one_hot()` function

In [20]:
import torch
import torch.nn.functional as F

input_ids = torch.tensor(input_ids)
# important to set num_classes, otherwise may have vectors shorter than vocab length
# and padded with zeros manually.
one_hot_encodings = F.one_hot(input_ids, num_classes = len(token2idx))

# for each of 38 input tokens, we have a one-hot vector with 20 dimensions
# since our vocabulary consists of 20 unique characters
one_hot_encodings.shape 

In [21]:
print(f"Token: {tokenized_text[0]}")
print(f"Tensor index: {input_ids[0]}")
print(f"One-hot: {one_hot_encodings[0]}")

So we see character-level tokenisation ignores structure in text and treats string as a stream of characters. This helps deal with misspellings and rare words, but the draw back is we lose information on linguistic structures such as words needs to be learned from data. Requiring significant compute, memory and data. So character tokenisation is rare in practice. 

Instead, some structure of text is preserved during tokenisation, word tokenisation is straightofrward to achieve this.


**Word Tokenization**

Now model doesn't need to learn words from characters so reduces complexity of training process.

In [22]:
tokenized_text = text.split()
print(tokenized_text)

There are simply too many words to consider, so we often discard rare words and limit the covabulary. Unknown words are mapped to "unknown" and mapped to a shared UNK token. We may potentially lose some important information however. 

A compromise between word and character tokenisation is *subword tokenisation*

**Subword Tokenisation**

Subword tokenisation is learned from pre-training corpus using a mix of statistical rules and algorithms. There are several subword tokenisation algs used in NLP, like WordPiece, used by BERT and DistilBERT tokenisers. Transformers provides `AutoTokenizer` class to quickly load the tokenizer associated with a pretrained model, just call its `from_pretrained()` method and provide the ID of the model to load from its checkpoint.

In [23]:
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
# Autotokenizer belongs to "auto" classes which automatically retrieve model's config, pre-trained weights or vocab from ckpt
# so we can quickly switch between models
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In [24]:
from transformers import DistilBertTokenizer

distilbert_tokenizer = DistilBertTokenizer.from_pretrained(model_ckpt)

In [25]:
encoded_text = tokenizer(text)
print(encoded_text)

In [26]:
tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)

-  We have `[CLS]` and `[SEP]` tokens at the start and end of the sentence, which indicate the start and end of a sequence
-  Tokens have been lowercased
-  We split tokenizing and NLP to two tokens (as they are not common words). The ## prefix means the preceding string is not whitespace, and that any token in this prefix should be merged with the previous token when converted back into a string. We can use `convert_tokens_to_string()` method for this:

In [27]:
print(tokenizer.convert_tokens_to_string(tokens))

In [28]:
# other attributes
tokenizer.vocab_size

In [29]:
tokenizer.model_max_length

In [30]:
tokenizer.model_input_names

Tip: When using pretrained model, it is *really* important to ensure the tokenizer is the same as the one the model was trained with. For a model, switching the tokenizer is like shuffling the vocabulary, so the model would have a hard time understanding what is going on.

In [31]:
# tokenise the whole dataset. Use `map()` method of DatasetDict
# convenient way to apply processing fn to each element in dataset
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

print(tokenize(emotions["train"][:2]))

We see that zeros have been added to element to make them the same length. Zeros have a corresponding `[PAD]` token (0), and special tokens include `[CLS]` and `[SEP]` tokens encountered earlier.

We also get `attention_mask` arrays, to not confuse the model with additional padding tokens, telling the model to ignore the padded parts of the input.

In [32]:
# apply across all the splits in the corpus with a single line of code
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

`batch_size=None` means that tokenize applies to full dataset as a single batch, this ensures the input tensors and attention masks have the same shape globally. This operation has added new `input_ids` and `attention_mask` columns to the dataset.

In [33]:
print(emotions_encoded["train"].column_names)

### Training a Text Classifier
Models like DistlBERT are pretrained to predict masked words in a sequence of text. We would thus need to modify these slightly for use in text classification. Looking at architecture:
-  First, text is tokenised and made to one-hot vectors called *token encodings*. The size of the tokenizer vocabulary determines the dimension, usualy 20k-200k unique tokens
-  Then these token encodings are converted to token embeddings, which are vectors in low-dimensional space
-  Token embedings pass through encoder blocks to yield hidden state for each input token
-  Hidden state is fed to a layer that predicts the masked input tokens
-  For classification, we replace language modeling with classification layer

Sidebar: In practice PyTorch skips the step of creating one-hot vectors for token encodings as multiplying a matrix with one-hot vector is the same as selecting a column from the matrix

So Two options to train such a model on our Twitter dataset:

-  *Feature Extraction*: Use hidden states as features and just train a classifier on them, without modifying the pretrained model
-  *Fine-Tuning*: Train whole model end-to-end, also updating the parameters of the pretrained model

We explore both options for DistilBERT and examine the trade-offs.

**Transformers as Feature Extractors**

Freeze the body's weights during training and use hidden states as features for classifier. Pros: Can quickly train a small/shallow model, even a model that does not rely on gradients, such as random forest. And is more convenient e.g. if a GPU is unavailable, as hidden states only need to be precomputed once.

In [34]:
# use AutoModel, this has a from_pretrained() method to load weights of pretrained model
from transformers import AutoModel

model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device) # chain model to GPU if we have one

In [35]:
# Automodel converts token encodings to embeddings, then feeds through encoder stack to return hidden states
# Retrieving the last hidden states
text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")

# has [batch_size, n_tokens]
print(f"Input tensor shape: {inputs['input_ids'].size()}")

In [36]:
# place tensors on same device and pass the inputs to the model
inputs = {k:v.to(device) for k, v in inputs.items()}

# use no_grad to not calculate automatic gradient, reduces memory footprint
with torch.no_grad():
    outputs = model(**inputs)

# model config can contain several objects, such as hidden states, losses, attns..
# here we just have instance of BaseModelOutput and can access its attributes by name
print(outputs)

In [37]:
outputs.last_hidden_state.size() # [batch_size, n_tokens, hidden_dim]

In [38]:
# common practice in text classification to just use hidden state associated with [CLS] as input feature
# as this token appears at the start of each sequence
# we can extract it at the start of each sequence by indexing into last hidden state
outputs.last_hidden_state[:,0].size() # last hidden state for single string

In [39]:
tokenizer.model_input_names

In [40]:
# build a mapping fn that extracts all hidden states in one go for entire dataset

def extract_hidden_states(batch):
    # place model inputs on GPU
    inputs = {k:v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
    
    # extract lst hidden states
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
        
    # return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:, 0].cpu().numpy()}
        

In [41]:
# model expects tensors as inputs, so convert input_ids and attn masks to "torch" format
emotions_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [42]:
# extract hidden states across all splits in one go
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True) # default batchsize=1000 is used

In [43]:
emotions_hidden["train"].column_names # a new exract_hidden_states() adds a new "hidden_state" column to our dataset

In [44]:
# create a feature matrix to train a classifier
import numpy as np
X_train = np.array(emotions_hidden["train"]["hidden_state"])
X_valid = np.array(emotions_hidden["validation"]["hidden_state"])
y_train = np.array(emotions_hidden["train"]["label"])
y_valid = np.array(emotions_hidden["validation"]["label"])
X_train.shape, X_valid.shape

In [45]:
# briefly visualise the training set, we can use UMAP to project to 2D
# works best if we scale to [0,1] first souse MinMaxScaler
from umap import UMAP
from sklearn.preprocessing import MinMaxScaler

# scale features to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# initialise and fit UMAP
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# create dataframe of 2D embedings
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
df_emb.head()

In [46]:
# so we reduce 768 features to 2. Now we can plot the density of the points for each category
fig, axes = plt.subplots(2,3, figsize=(7,5))
axes = axes.flatten()
cmaps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels = emotions["train"].features["label"].names

for i, (label, cmap) in enumerate(zip(labels, cmaps)):
    df_emb_sub = df_emb.query(f"label == {i}")
    axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap, gridsize=20, linewidths=(0,))
    axes[i].set_title(label)
    axes[i].set_xticks([]), axes[i].set_yticks([])
    
plt.tight_layout()
plt.show()

Note: These are only projections in 2D space. Overlap does not mean they are not separable in original space. Conversely, if they are separable in the projected space then they will be separable in the original space.

We see that negative emotions: Sadness, anger, fear occupy similar regions with similar distributions. On the other hand, joy and love are well separated from the negative emotions and share a similar space. Surprise is scattered all over the place. Separation is not guaranteed as model was not trained to distinguish, only learned by implicitly guessing masked words in texts.

**Training a simple classifier**

In [47]:
# use logistic regression, simple and does not require GPU
from sklearn.linear_model import LogisticRegression

# increase max_iter to guarantee convergence
lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)
lr_clf.score(X_valid, y_valid)

As we are dealing with an unbalanced multiclass dataset, it is actually significantly better. We can have a dummyclassifier as a baseline which is basedon simple heuristics such as choosing the majority class, or always drawing on a random class.

In [48]:
from sklearn.dummy import DummyClassifier

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_valid, y_valid)

We can investigate performance by looking at confusion matrix of the classifier, which tells us the relationship between true and predicted labels.

In [49]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
    
y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)

We see anger and fear are often confused with sadness, which aligns with the previous 2D observations of the embeddings. Also love and surprise are frequently mistaken for joy.

We will fine-tune now which leads to superior classification perofrmance, but may require more computational resources, ie. GPUs. If not available then such traditional and feature-based ML is a good compromise.

### Fine-Tuning Transformers

This requires the classification head to be differentiable, therefore a NN is preferable for classification. In fine-tuning the entire model, the hidden states adapt during training to decrease total model loss and increase performance.

In [50]:
# Sequence classification automodel has a classification head on top of pretrained model outputs so can easily be trained with base
# just need to specify number of labels the classification head has
from transformers import AutoModelForSequenceClassification

num_labels = 6
model = (AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels).to(device))

In [68]:
# define performance metrics
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
    """Returrns EvalPrediction object (named tuple with predictions and label_ids attributes)
    and needs toreturn a dictionary that maps each metric's name to its value. Here we have
    F1-Score and model accuracy."""
    labels = pred.label_ids
    pred = pred.predictions.argmax(-1)
    f1 = f1_score(labels, pred, average="weighted")
    acc = accuracy_score(labels, pred)
    return {"accuracy": acc, "f1": f1}

In [53]:
# log into huggingface hub so we can push our fine-tuned model to hub account and share with community
from huggingface_hub import notebook_login
notebook_login() 
# can run `huggingface-cli login` via terminal

In [64]:
# TrainingArguments class stores a lot of information and allows fine-grained control over training and evaluation
from transformers import Trainer, TrainingArguments

batch_size = 64
logging_steps = len(emotions_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-emotion"
training_args = TrainingArguments(output_dir=model_name,
                                 num_train_epochs=2,
                                 learning_rate=2e-5,
                                 per_device_train_batch_size=batch_size,
                                 per_device_eval_batch_size=batch_size,
                                 weight_decay=0.01,
                                 evaluation_strategy="epoch",
                                 disable_tqdm=False,
                                 logging_steps=logging_steps,
                                 report_to=None,
                                 push_to_hub=True,
                                 log_level="error")

In [59]:
!sudo apt-get install git-lfs

In [69]:
# finally, instantiate trainer and train model
trainer = Trainer(model=model, args=training_args, 
                  compute_metrics=compute_metrics, 
                  train_dataset=emotions_encoded["train"],
                  eval_dataset=emotions_encoded["validation"],
                  tokenizer=tokenizer
                 )
trainer.train();

In [70]:
# get prediction outputs
preds_output = trainer.predict(emotions_encoded["validation"])

# we get a PredictionOutput object that has arrays of predictions and label_ids along with metrics
preds_output.metrics

In [71]:
# can greedily decode with np.argmax(), yields predicted labels
y_preds = np.argmax(preds_output.predictions, axis=1)

In [72]:
plot_confusion_matrix(y_preds, y_valid, labels)

Results look much better, still some confusion between love and joy, also surprise and joy. Before we conclude, lets dive deeper into the mistakes our model is likely to make.

**Error Analysis**

A simple and powerful technique is to sort the validation samples by model loss. 

In [73]:
from torch.nn.functional import cross_entropy

# a function to return the loss along with predicted label
def forward_pass_with_label(batch):
    
    # place all input tensors on same device as model
    inputs = {k:v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
    
    # get output loss
    with torch.no_grad():
        output = model(**inputs)
        pred_label = torch.argmax(output.logits, axis=-1)
        loss = cross_entropy(output.logits, batch["label"].to(device), reduction="none")
    
    # place outputs on CPU for compatibility with other dataset columns
    return {"loss":loss.cpu().numpy(), "predicted_label": pred_label.cpu().numpy()}


In [76]:
# convert dataset back to PyTorch tensors
emotions_encoded.set_format("torch", columns=["input_ids", "attention_mask", "label"])

# compute loss values
emotions_encoded["validation"] = emotions_encoded["validation"].map(
    forward_pass_with_label, batched=True, batch_size=16
)

In [77]:
# finally, create dataframe with texts, losses and pred/true labels
emotions_encoded.set_format("pandas")
cols = ["text", "label", "predicted_label", "loss"]
df_test = emotions_encoded["validation"][:][cols]
df_test["label"] = df_test["label"].apply(label_int2str)
df_test["predicted_label"] = df_test["predicted_label"].apply(label_int2str)

Can now easily sort by losses. This is to detect:
-  *Wrong labels*: mistake in labelling process to quickly correct
-  *Quirks of dataset*; may always be messy, so inspecting weaknesses can help identify features ofsuch dataset

In [81]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 0)


# samples with highest losses
df_test.sort_values("loss", ascending=False).head(10)

Some of the texts appear dubious, so the labels are questionable in some places. So with this information we can refine the dataset which can lead to a big performance gain as having more data or larger models.

**Data Quality Matters!**

It is also worth looking at samples which model is most confident about, as there is a chance our model is exploiting shortcuts to get a prediction. Look at smallest loss:

In [82]:
df_test.sort_values("loss", ascending=True).head(10)

Our model is most confident about joy it seems. So we can keep an eye on this class and make targeted improvements to our training dataset.

In [83]:
# share a model to the community
trainer.push_to_hub(commit_message = "Training completed!")

In [84]:
from transformers import pipeline

# can use the model to make predictions on new tweets
model_id = "stevevee0101/distilbert-base-uncased-finetuned-emotion"
classifier = pipeline("text-classification", model=model_id)

In [85]:
custom_tweet = "I walked up the stairs and saw in the corner of my eye a giant tomato!"
preds = classifier(custom_tweet, return_all_scores=True)

In [88]:
# plot probabilities
preds_df = pd.DataFrame(preds)
plt.bar(labels, 100 * preds_df["score"], color='C0')
plt.title(f'"{custom_tweet}"')
plt.ylabel("Class probability (%)")
plt.show()

Trained a model to classify emotions in tweets! Have seen two complementary approaches in features and fine-tuning, assessing their strengths and weaknesses.

Potential issues:
-  Model in prodction: Want to serve predictions, like on huggingface hub
-  Faster predictions, knowledge distillation and tricks to speed up models in Chap8
-  Other things your model can do?
-  Non-english; multi-lingual variety!
-  No labels; then fine-tuning may not be an option, but there are potentially other techniques!