In [16]:
!pip install transformers datasets evaluate accelerate




In [17]:
from datasets import load_dataset
imdb = load_dataset("imdb")

There are two fields in this dataset:

text: the movie review text.




label: a value that is either 0 for a negative review or 1 for a positive review.

In [18]:
imdb["test"][0]

{'text': 'I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn\'t match the background, and painfully one-dimensional characters cannot be overcome with a \'sci-fi\' setting. (I\'m sure there are those of you out there who think Babylon 5 is good sci-fi TV. It\'s not. It\'s clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It\'s really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it\'s rubbish as 

#*Preprocess*
load a DistilBERT tokenizer to preprocess the text field

In [19]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length

In [20]:
def preprocess_function(examples):
  return tokenizer(examples["text"], truncation=True)

In [21]:
# map applied preprocessing function over entire dataset
tokenized_imdb = imdb.map(preprocess_function, batched=True)


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [22]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [23]:
import evaluate
accuracy = evaluate.load("accuracy")

Function that passes your predictions and labels to compute to calculate the accuracy:

In [24]:
import numpy as np

def compute_metrics(eval_pred):
  predictions, labels = eval_pred
  predictions = np.argmax(predictions, axis=1)
  return accuracy.compute(predictions=predictions, references=labels)


create a map of the expected ids to their labels with id2label and label2id

In [25]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

In [26]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)



In [28]:
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2315,0.284751,0.8986
2,0.1593,0.21493,0.93304
3,0.097,0.307949,0.93264
4,0.0502,0.330506,0.93256
5,0.0368,0.352176,0.93244


TrainOutput(global_step=7815, training_loss=0.11883437961473423, metrics={'train_runtime': 7602.0001, 'train_samples_per_second': 16.443, 'train_steps_per_second': 1.028, 'total_flos': 1.6394784128794656e+16, 'train_loss': 0.11883437961473423, 'epoch': 5.0})

In [29]:
# text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [30]:
# from transformers import pipeline

# classifier = pipeline("sentiment-analysis", model="my_awesome_model")
# classifier(text)