# 0. Install requirements

Installation for collab:

In [None]:
!pip install cloud-tpu-client==0.10 torch==1.11.0 https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.11-cp37-cp37m-linux_x86_64.whl
!pip install transformers datasets

Collecting torch-xla==1.11
  Using cached https://storage.googleapis.com/tpu-pytorch/wheels/colab/torch_xla-1.11-cp37-cp37m-linux_x86_64.whl (152.9 MB)


Installation for local use:

In [None]:
!pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu113


Setting cache dir can be good choice if you train on your pc. This way heavy datasets, models and other files can be stored to not be re-downloaded and won't use full space on your boot drive.
If you set cache_dir to ```None```, it will be set to default huggingface cache.

In [1]:
cache_dir = 'D:\\cache\\huggingface'

# 1. Imports for zero shot

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

# 3. Load model and pipeline

In [4]:
model_name = 'facebook/bart-large-mnli'
tokenizer_name = 'facebook/bart-large-mnli'
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, cache_dir=cache_dir)
model = AutoModelForSequenceClassification.from_pretrained(model_name, cache_dir=cache_dir)
zero_shooter = pipeline(
    'zero-shot-classification', model=model, tokenizer=tokenizer,
    device=0, multi_label=True, cache_dir=cache_dir
)

# 4. Evaluate

Let's implement evaluation loop and assign labels and hypothesis

In [5]:
labels = ['negative', 'positive']
hypothesis_template = "This review is {}."


Let's start with "softballs" - easy and unambiguous examples.

In [6]:
easy_set = [
    {'text': 'I loved this movie! Hundred percent recommend!', 'label': 1},
    {'text': 'The worst movie ever! You shouldn\'t watch it!', 'label': 0},
    {'text': 'This is my favorite movie ever! Cannot wait for sequel!',
     'label': 1},
    {'text': 'Simply just waste of time.', 'label': 0}
]
for i in range(len(easy_set)):
    example = easy_set[i]
    result = zero_shooter(
        example['text'], labels, hypothesis_template=hypothesis_template,
        multi_label=True
    )
    true_y = labels[example['label']]
    scores = result['scores']
    res_labels = result['labels']
    pred_y = res_labels[1]
    if scores[0] > scores[1]:
        pred_y = res_labels[0]
    print(f"Example: {example['text']}")
    print(f"True sentiment: {labels[example['label']]}")
    print(f"Predicted sentiment: {pred_y}\n\n")

Example: I loved this movie! Hundred percent recommend!
True sentiment: positive
Predicted sentiment: positive


Example: The worst movie ever! You shouldn't watch it!
True sentiment: negative
Predicted sentiment: negative


Example: This is my favorite movie ever! Cannot wait for sequel!
True sentiment: positive
Predicted sentiment: positive


Example: Simply just waste of time.
True sentiment: negative
Predicted sentiment: negative




So far so good! To evaluate on real world examples we need to set up evaluation loop.

Our loop will print couple of missmatched examples. This way we can evaluate model not only quantitatively but also qualitatively.

In [7]:

def eval_loop(dataset):
    confusion_dict = {
        'positive': {'positive': 0, 'negative': 0},
        'negative': {'positive': 0, 'negative': 0}
    }
    shown_errors = 0
    for i in range(len(dataset)):
        example = dataset[i]
        result = zero_shooter(
            example['text'], labels, hypothesis_template=hypothesis_template,
            multi_label=False
        )
        true_y = labels[example['label']]
        scores = result['scores']
        res_labels = result['labels']
        if scores[0] > scores[1]:
            pred_y = res_labels[0]
        else:
            pred_y = res_labels[1]

        confusion_dict[true_y][pred_y] += 1
        if shown_errors < 5 and true_y != pred_y:
            shown_errors += 1
            print('Wrong prediction:')
            print(f"Example: {example['text']}\n")
            print(f"True sentiment: {labels[example['label']]}")
            print(f"Predicted sentiment: {pred_y}\n\n")

    n_hits = confusion_dict['positive']['positive'] + \
        confusion_dict['negative']['negative']
    n_examples = n_hits + confusion_dict['positive']['negative'] + \
        confusion_dict['negative']['positive']
    print(f'Accuracy of zero-shot is: {n_hits / n_examples:.2f}')

All set up. Now we only need evaluation examples. We can find them in IMDB dataset uploaded to huggingface datasets.

Extracting about 10% of data for evaluation is a good practice. We will keep 90% of trainset for supervised finetuning. We also need to specify seed, so our results will be repeatable.

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb", cache_dir='D:\\cache\\huggingface')
dataset = raw_datasets['train'].shuffle(seed=42)
cut_val = int(0.9*len(dataset))
train_set = dataset.select(range(cut_val))
eval_set = dataset.select(range(cut_val, len(dataset)))
print(f'Amount of training examples: {len(train_set)}')
print(f'Amount of evaluation examples: {len(eval_set)}')

Reusing dataset imdb (D:\cache\huggingface\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at D:\cache\huggingface\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1\cache-8a9e43a6ac4acdff.arrow


Amount of training examples: 22500
Amount of evaluation examples: 2500


Dataset ready. Zero-shot model ready. Eval loop ready. Now we can combine all parts and evaluate how accurate is this approach!

In [9]:
eval_loop(eval_set)



Wrong prediction:
Example: There are people claiming this is another "bad language" ultra violence Mexican movie. They are right, but more than that this film is a call to create awareness of what we have become. The awful truth hurts, or bores when you already have accepted the paradigm of living the third world as the only possible goal. One of the most important things of "Cero y van cuatro" is the open invitation to profound reflexion over our current identity. Is that what we all are? Is that all that we want to be? I am abroad and I realized how spoiled is the Mexican society when the Tlahuac Incident came to light. I still cannot understand viewers witnessing a mass broadcasted murder. I nearly puked when I saw some of the images. It was not Irak or Rwanda, just a tiny village near Mexico City when rampage was carried out with the indulgence of media and government. The recreation of a similar situation in this film shocked me deeply. The other stories were good portraying other

**Results are suprisingly good! 93% accuracy!**

Especially if we take under considiration fact that trained model didn't only see this specific dataset, but actually this model wasn't neither trained nor fine-tuned for the sentiment detection task!

Lastly we can free memory reserved for zero-shot approach above -  feel free to skip this step if not needed.

In [None]:
del tokenizer
del zero_shooter
del model

# 5. Fine tune pretrained bert model for sentiment analysis of movie reviews.

Let's redefine constants and reload imports. This way you can run this section as self-contained part.

In [1]:
cache_dir = 'D:\\cache\\huggingface'
from datasets import load_dataset

raw_datasets = load_dataset("imdb", cache_dir='D:\\cache\\huggingface')
dataset = raw_datasets['train'].shuffle(seed=42)
cut_val = int(0.9*len(dataset))
train_set = dataset.select(range(cut_val))
eval_set = dataset.select(range(cut_val, len(dataset)))
print(f'Amount of training examples: {len(train_set)}')
print(f'Amount of evaluation examples: {len(eval_set)}')

Reusing dataset imdb (D:\cache\huggingface\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached shuffled indices for dataset at D:\cache\huggingface\imdb\plain_text\1.0.0\2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1\cache-8a9e43a6ac4acdff.arrow


Amount of training examples: 22500
Amount of evaluation examples: 2500


We can try to fine tune pretrained large model to see how more usual approach would stack against zero-shot classification.

In [2]:
# Create new model based on pretrained bert
import torch
from torch import nn
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

class ImdbReviewFineTunedModel(nn.Module):
    def __init__(
            self,
            hidden_size: int = 768,
            hidden_scale: float = 2,
            dropout: float = 0.2,
            device: str = 'cpu'
    ):
        super().__init__()
        hid_size_2 = int(hidden_size * hidden_scale)
        hid_size_3 = int(hid_size_2 // 2)
        self.fine_tuned = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(hidden_size, hid_size_2),
            nn.Dropout(dropout),
            nn.ReLU(),
            nn.Linear(hid_size_2, hid_size_3),
            nn.ReLU(),
            nn.Linear(hid_size_3, 2)
        ).to(device)

    def forward(self, batch_in):
        result = self.fine_tuned(batch_in)
        return result

In [3]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
pretrained_name = 'all-mpnet-base-v2'

pretrain_lm = SentenceTransformer(pretrained_name)
model = ImdbReviewFineTunedModel(
    hidden_size=768,
    hidden_scale=1.75,
    dropout=0.5,
    device=device
)
model = model.to(device)


In [4]:
import os

def train_loop(_model, path):
    train_loader = torch.utils.data.DataLoader(train_set, batch_size=256)
    eval_loader = torch.utils.data.DataLoader(eval_set, batch_size=256)
    model_path = os.path.join(path, "model_state.pt")
    if os.path.exists(model_path):
        _model.load_state_dict(torch.load(model_path, map_location=device))
    _model.train()
    _model.to(device)
    optim = torch.optim.Adam(_model.parameters(), lr=3e-4)
    optim_path = os.path.join(path, "optim_state.pt")
    if os.path.exists(optim_path):
        optim.load_state_dict(torch.load(optim_path, map_location=device))
    loss_f = nn.CrossEntropyLoss()
    counter = 10
    for batch in train_loader:
        optim.zero_grad()
        x = batch['text']
        y = batch['label'].to(device)
        embedded = pretrain_lm.encode(x, convert_to_tensor=True)
        pred = _model(embedded)
        loss = loss_f(pred, y)
        loss.backward()
        optim.step()
        counter -= 1
        if counter == 0:
            print(f"Current loss: {loss:.4f}")
            torch.save(_model.cpu().state_dict(), model_path)
            _model.to(device)
            torch.save(optim.state_dict(), optim_path)
            counter = 50
    torch.save(_model.state_dict(), model_path)
    confusion_dict = {
        'positive': {'positive': 0, 'negative': 0},
        'negative': {'positive': 0, 'negative': 0}
    }
    with torch.no_grad():
        _model.eval()

        for batch in eval_loader:
            x = batch['text']
            y = batch['label'].to(device)
            y = y.to(device)
            embedded = pretrain_lm.encode(x, convert_to_tensor=True)
            pred = _model(embedded)
            pred_max = torch.argmax(pred, dim=-1).to(device)

            for i in range(y.size(0)):
                y_sentiment = 'negative' if y[i] == 0 else 'positive'
                pred_sentiment = 'negative' if pred_max[i] == 0 else 'positive'
                confusion_dict[y_sentiment][pred_sentiment] += 1

    n_hits = confusion_dict['positive']['positive'] + \
        confusion_dict['negative']['negative']
    n_examples = n_hits + confusion_dict['positive']['negative'] + \
        confusion_dict['negative']['positive']
    _val_acc = n_hits / n_examples
    print(f'Accuracy of fine-tuned model is: {_val_acc:.2f}')
    return _val_acc

In [5]:
save_path = 'D:\\models\\sentiment-fine-tuning\\'

In [6]:
epochs = 15
best_val_acc = 0
best_model_path = os.path.join(save_path, "best_model_state.pt")
for epoch in range(epochs):
    print("Epoch: ", epoch + 1)
    val_acc = train_loop(model, save_path)
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.cpu().state_dict(), best_model_path)

    print('\n')

Epoch:  1
Current loss: 0.6663
Current loss: 0.3864
Accuracy of fine-tuned model is: 0.89


Epoch:  2
Current loss: 0.3464
Current loss: 0.3503
Accuracy of fine-tuned model is: 0.89


Epoch:  3
Current loss: 0.2920
Current loss: 0.3573
Accuracy of fine-tuned model is: 0.89


Epoch:  4
Current loss: 0.3177
Current loss: 0.3396
Accuracy of fine-tuned model is: 0.89


Epoch:  5
Current loss: 0.3158
Current loss: 0.3535
Accuracy of fine-tuned model is: 0.89


Epoch:  6
Current loss: 0.3136
Current loss: 0.3279
Accuracy of fine-tuned model is: 0.89


Epoch:  7
Current loss: 0.3134
Current loss: 0.3193
Accuracy of fine-tuned model is: 0.90


Epoch:  8
Current loss: 0.3252
Current loss: 0.3144
Accuracy of fine-tuned model is: 0.90


Epoch:  9
Current loss: 0.2986
Current loss: 0.3021
Accuracy of fine-tuned model is: 0.90


Epoch:  10
Current loss: 0.3014
Current loss: 0.3135
Accuracy of fine-tuned model is: 0.89


Epoch:  11
Current loss: 0.2975
Current loss: 0.2787
Accuracy of fine-tuned mod

As we can see, even one of best sentence-transformers language model with auxiliary layer trained specifically for this task is barely achieving similar results.
We probably could get better scores with engineering features, using larger pre-trained language model, fine-tuning transformer layers and not only auxiliary layers. However, we need to remember this is trade off. More time we spend on engineering more costly project will become. And more time we spend on finding hyperparameters for training auxiliary more task oriented solution will become.
All of this is important to know when zero-shot is ready "out of the box".