Load dataset

In [3]:
from datasets import load_dataset

Load a small binary sentiment dataset

In [4]:
dataset = load_dataset("imdb", split="train[:2000]")
dataset = dataset.train_test_split(test_size=0.2)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 25000/25000 [00:00<00:00, 70133.98 examples/s]
Generating test split: 100%|██████████| 25000/25000 [00:00<00:00, 97943.48 examples/s] 
Generating unsupervised split: 100%|██████████| 50000/50000 [00:00<00:00, 98098.61 examples/s] 


Tokenize using the BertTokenizer

In [5]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_fn(example):
    return tokenizer(example["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_fn, batched=True)

Map: 100%|██████████| 1600/1600 [00:11<00:00, 138.06 examples/s]
Map: 100%|██████████| 400/400 [00:01<00:00, 216.52 examples/s]


Load model and prep for training

In [6]:
from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Setup Training

In [7]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./bert-sentiment",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)

TypeError: TrainingArguments.__init__() got an unexpected keyword argument 'evaluation_strategy'

Train the model

In [None]:
trainer.train()

Evaluate

In [None]:
trainer.evaluate()

Preidct on new text

In [None]:
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    outputs = model(**inputs)
    probs = outputs.logits.softmax(dim=1)
    return "Positive" if probs[0][1] > 0.5 else "Negative"

print(predict_sentiment("This movie was amazing!"))
print(predict_sentiment("I hated everything about this."))