# Practical classification with pre-trained BERT

In this notebook I download pre-trained BERT model and fine-tune it with high-level HuggingFace tools.

There is another notebook, doing the same with lower-level PyTorch tools only.

## References:
* https://huggingface.co/course/chapter3/4?fw=pt - HuggingFace transformers course reference

In [None]:
# minimal example of using a pre-trained model for classification
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification


checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"]

tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
output = model(**tokens)

torch.nn.functional.softmax(output.logits, dim=1)

In [1]:
import pandas as pd


essays = pd.read_csv("./data/essays.csv")

essays.loc[essays['cEXT'] == 'n', 'cEXT'] = 0
essays.loc[essays['cEXT'] == 'y', 'cEXT'] = 1

essays.loc[essays['cNEU'] == 'n', 'cNEU'] = 0
essays.loc[essays['cNEU'] == 'y', 'cNEU'] = 1

essays.loc[essays['cAGR'] == 'n', 'cAGR'] = 0
essays.loc[essays['cAGR'] == 'y', 'cAGR'] = 1

essays.loc[essays['cCON'] == 'n', 'cCON'] = 0
essays.loc[essays['cCON'] == 'y', 'cCON'] = 1

essays.loc[essays['cOPN'] == 'n', 'cOPN'] = 0
essays.loc[essays['cOPN'] == 'y', 'cOPN'] = 1

essays.astype({'cEXT': 'int32', 'cNEU': 'int32', 'cAGR': 'int32', 'cCON': 'int32', 'cOPN': 'int32'}).dtypes

essays

Unnamed: 0,#AUTHID,TEXT,cEXT,cNEU,cAGR,cCON,cOPN
0,1997_504851.txt,"Well, right now I just woke up from a mid-day ...",0,1,1,0,1
1,1997_605191.txt,"Well, here we go with the stream of consciousn...",0,0,1,0,0
2,1997_687252.txt,An open keyboard and buttons to push. The thin...,0,1,0,1,1
3,1997_568848.txt,I can't believe it! It's really happening! M...,1,0,1,1,0
4,1997_688160.txt,"Well, here I go with the good old stream of co...",1,0,1,0,1
...,...,...,...,...,...,...,...
2462,2004_493.txt,I'm home. wanted to go to bed but remembe...,0,1,0,1,0
2463,2004_494.txt,Stream of consiousnesssskdj. How do you s...,1,1,0,0,1
2464,2004_497.txt,"It is Wednesday, December 8th and a lot has be...",0,0,1,0,0
2465,2004_498.txt,"Man this week has been hellish. Anyways, now i...",0,1,0,0,1


In [2]:
import torch
from torch.utils.data import DataLoader, random_split, default_convert
from transformers import AdamW, AutoTokenizer, BertForSequenceClassification
from datasets import Dataset, DatasetDict


# prepare dataset
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(essays):
    return tokenizer(essays["TEXT"], padding="max_length", truncation=True)  # , return_tensors="pt")

essays_dataset = Dataset.from_pandas(essays)
tokenized_dataset = essays_dataset.map(tokenize_function, batched=True, batch_size=8)
tokenized_dataset = tokenized_dataset.rename_column("TEXT", "text")
tokenized_dataset = tokenized_dataset.rename_column("cNEU", "labels")
tokenized_dataset = tokenized_dataset.remove_columns(['#AUTHID', 'text', 'cEXT', 'cAGR', 'cCON', 'cOPN'])

train_dataset, validation_dataset = random_split(tokenized_dataset, [2000, len(tokenized_dataset) - 2000])

ds = DatasetDict()
ds['train'] = train_dataset
ds['validation'] = validation_dataset

# vocab = tokenizer.get_vocab()
# ivocab = {v: k for k, v in vocab.items()}
print(ds['train'][0]['input_ids'])

train_dataloader = DataLoader(ds['train'], shuffle=True, batch_size=8)

  from .autonotebook import tqdm as notebook_tqdm
100%|█████████████████████████████████████████████████| 309/309 [00:01<00:00, 203.17ba/s]

[101, 10047, 5458, 1010, 13233, 1998, 1045, 2428, 2123, 1005, 1056, 2514, 2035, 2008, 2204, 2157, 2085, 1012, 2026, 4308, 13403, 2021, 10047, 5458, 1012, 1045, 2514, 25227, 1012, 3778, 2013, 7249, 1012, 2082, 1010, 2147, 1010, 2166, 1012, 2129, 2079, 1045, 2113, 2054, 10047, 2725, 2007, 2026, 2166, 2003, 2054, 1045, 2001, 3214, 2000, 2079, 1029, 1045, 2293, 6864, 2016, 2965, 1996, 2088, 2000, 2033, 1012, 1045, 4299, 2008, 1045, 2910, 1005, 1056, 3631, 2039, 2007, 2014, 2197, 2095, 1012, 2009, 9868, 1037, 2843, 1997, 2477, 1999, 2026, 2166, 1012, 2021, 1045, 2245, 2008, 1045, 2052, 2022, 19366, 2007, 2619, 2842, 1998, 1045, 2001, 2005, 1037, 2460, 2558, 1997, 2051, 2021, 2025, 1037, 2154, 2253, 2011, 2008, 1045, 2134, 1005, 1056, 2228, 2055, 6864, 1998, 4687, 2065, 2016, 2003, 2428, 1996, 2028, 1012, 1045, 3335, 2026, 15310, 2040, 2351, 2006, 1996, 2034, 2154, 1997, 2082, 2023, 2095, 1012, 4921, 2063, 2018, 1037, 2428, 7823, 2051, 7149, 2007, 2010, 2331, 1012, 1045, 3984, 1045, 2074, 22




In [20]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2, output_attentions=True)

# I'm running this on Apple Silicon. Activate Metal "mps" device, if available:
if not torch.backends.mps.is_available():
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")

else:
    mps_device = torch.device("mps")


torch.device("mps")
model.to(mps_device)

model.train()

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [4]:
from transformers import get_scheduler
from tqdm.auto import tqdm
from torch.optim import AdamW


# parameters
num_epochs = 1  # 3
num_training_steps = num_epochs * len(train_dataloader)

cross_entropy_loss = torch.nn.CrossEntropyLoss().to(mps_device)

optimizer = AdamW(model.parameters(), lr=5e-5)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)


# test on one batch
# batch = next(iter(train_dataloader))

# labels = batch["labels"]
# del batch["labels"]

# batch = {k: torch.transpose(torch.stack(default_convert(v)), 0, 1) for k, v in batch.items()}
# batch = {k: v.to(mps_device) for k, v in batch.items()}

# output = model(**batch)
# labels.to(mps_device)
# mps_labels = torch.as_tensor(labels, device=mps_device)

# loss = cross_entropy_loss(output.logits, mps_labels)
# loss.backward()


# progress bar
progress_bar = tqdm(range(num_training_steps))

# training
for epoch in range(num_epochs):
    for batch in train_dataloader:
        labels = batch["labels"]
        mps_labels = torch.as_tensor(labels, device=mps_device)
        del batch["labels"]
        
        batch = {k: torch.transpose(torch.stack(default_convert(v)), 0, 1) for k, v in batch.items()}
        batch = {k: v.to(mps_device) for k, v in batch.items()}

        output = model(**batch)

        loss = cross_entropy_loss(output.logits, mps_labels)
        loss.backward()        

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

100%|██████████████████████████████████████████████████| 250/250 [56:59<00:00, 14.26s/it]

In [14]:
model.eval()

# test on one batch
batch = next(iter(train_dataloader))

labels = batch["labels"]
del batch["labels"]

batch = {k: torch.transpose(torch.stack(default_convert(v)), 0, 1) for k, v in batch.items()}
batch = {k: v.to(mps_device) for k, v in batch.items()}

outputs = model(**batch)
print(f"outputs = {outputs}")

for index, item in enumerate(output.logits):
    softmax = torch.nn.Softmax()
    print(f"outputs = {softmax(item)}")
    print(f"label = {labels[index]}")

# labels.to(mps_device)
# mps_labels = torch.as_tensor(labels, device=mps_device)

# loss = cross_entropy_loss(output.logits, mps_labels)
# loss.backward()

outputs = SequenceClassifierOutput(loss=None, logits=tensor([[0.1740, 0.1375],
        [0.1741, 0.1376],
        [0.1743, 0.1378],
        [0.1740, 0.1377],
        [0.1741, 0.1377],
        [0.1743, 0.1377],
        [0.1741, 0.1377],
        [0.1743, 0.1379]], device='mps:0', grad_fn=<MpsLinearBackward0>), hidden_states=None, attentions=None)
outputs = tensor([0.4522, 0.5478], device='mps:0', grad_fn=<SoftmaxBackward0>)
label = 0
outputs = tensor([0.4769, 0.5231], device='mps:0', grad_fn=<SoftmaxBackward0>)
label = 1
outputs = tensor([0.3628, 0.6372], device='mps:0', grad_fn=<SoftmaxBackward0>)
label = 0
outputs = tensor([0.4993, 0.5007], device='mps:0', grad_fn=<SoftmaxBackward0>)
label = 0
outputs = tensor([0.5464, 0.4536], device='mps:0', grad_fn=<SoftmaxBackward0>)
label = 0
outputs = tensor([0.4529, 0.5471], device='mps:0', grad_fn=<SoftmaxBackward0>)
label = 1
outputs = tensor([0.5302, 0.4698], device='mps:0', grad_fn=<SoftmaxBackward0>)
label = 1
outputs = tensor([0.4415, 0.558

  print(f"outputs = {softmax(item)}")


In [None]:
from datasets import load_metric


metric = load_metric('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from datasets import load_metric


validation_dataloader = DataLoader(ds['validation'], shuffle=True, batch_size=8)

metric = load_metric("accuracy")
model.eval()
for batch in validation_dataloader:
    labels = batch["labels"]
    mps_labels = torch.as_tensor(labels, device=mps_device)
    del batch["labels"]

    batch = {k: torch.transpose(torch.stack(default_convert(v)), 0, 1) for k, v in batch.items()}
    batch = {k: v.to(mps_device) for k, v in batch.items()}
    
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    softmax = torch.nn.Softmax()
    for index, item in enumerate(logits):
        print(f"probabilities = {softmax(item)}")
        print(f"label = {labels[index]}")
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=mps_labels)

metric.compute()

In [24]:
from bertviz import head_view


inputs = tokenizer.encode("Well, here we go with the stream of consciousness essay. I used to do things like this in high school sometimes. They were pretty interesting, but I often find myself with a lack of things to say. I normally consider myself someone who gets straight to the point. I wonder if I should hit enter any time to send this back to the front. Maybe I'll fix it later. My friend is playing guitar in my room now. Sort of playing anyway. More like messing with it. He's still learning. There's a drawing on the wall next to me. Comic book characters I think, but I'm not sure who they are. It's been a while since I've kept up with comic's. I just heard a sound from ICQ. That's a chat program on the internet. I don't know too much about it so I can't really explain too well. Anyway, I hope I'm done with this by the time another friend comes over. It will be nice to talk to her again. She went home this weekend for Labor Day. So did my brother. I didn't go. I'm not sure why. No reason to go, I guess. Hmm. when did I start this. Wow, that was a long line. I guess I won't change it later. Okay, I'm running out of things to talk about. I've found that happens to me a lot in conversation. Not a very interesting person, I guess.", return_tensors='pt', truncation=True)
inputs = inputs.to(mps_device)
outputs = model(inputs)
attention = outputs[-1]  # Output includes attention weights when output_attentions=True
tokens = tokenizer.convert_ids_to_tokens(inputs[0])
head_view(attention, tokens)