In [6]:
!pip install transformers datasets evaluate accelerate tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Load the dataset

In [3]:
import pandas as pd

train_df = pd.read_csv('imdb_train.csv')
test_df = pd.read_csv('imdb_test.csv')
print('Load %s examples for training, %s examples for testing' % (len(train_df), len(test_df)))
train_df[1:10]

Load 2010 examples for training, 738 examples for testing


Unnamed: 0,text,label
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1
5,"The rest of the movie lacks art, charm, meanin...",0
6,Wasted two hours.,0
7,Saw the movie today and thought it was a good ...,1
8,A bit predictable.,0
9,Loved the casting of Jimmy Buffet as the scien...,1


Covert each dataset into a list of dict objects.

In [4]:
train = []
for _, row in train_df.iterrows():
  train.append({
      'text': row['text'],
      'label': row['label']
  })
test = []
for _, row in test_df.iterrows():
  test.append({
      'text': row['text'],
      'label': int(row['label'])
  })

# Preprocess

The next step is to load a tokenizer to preprocess the text field. In this case, we are using the DistillBERT model.

In [7]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

Create a preprocessing function to tokenize text and truncate sequences so they do not exceed DistilBERT's maximum input length:

In [8]:
def preprocess_function(examples):
  output = []
  for ex in examples:
    o = tokenizer(ex["text"], truncation=True)
    ex.update(o)
    output.append(ex)
  return output

tokenized_train = preprocess_function(train)
tokenized_test = preprocess_function(test)

print(tokenized_test[0])

{'text': 'The restaurant atmosphere was exquisite.', 'label': 1, 'input_ids': [101, 1996, 4825, 7224, 2001, 19401, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


Now create a batch of examples using DataCollatorWithPadding. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [9]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the Evaluate library. For this task, load the accuracy metric (see the Evaluate quick tour to learn more about how to load and compute a metric):

In [10]:
import evaluate

accuracy = evaluate.load("accuracy")

Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.20k/4.20k [00:00<00:00, 6.73MB/s]


Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [11]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# Train

Before you start training your model, create a map of the expected ids to their labels with id2label and label2id:

In [12]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

You’re ready to start training your model now! Load DistilBERT with AutoModelForSequenceClassification along with the number of expected labels, and the label mappings:

In [13]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [14]:
training_args = TrainingArguments(
    output_dir="text_classification_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss


RuntimeError: MPS backend out of memory (MPS allocated: 5.20 GB, other allocations: 1.38 GB, max allowed: 6.77 GB). Tried to allocate 192.00 MB on private pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).

# Inference

Now that you are ready for inference!

To get started, simply select some text that you would like to analyze using the model.

In [12]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="text_classification_model/checkpoint-252")
classifier(text)

[{'label': 'POSITIVE', 'score': 0.9864566326141357}]

# Task

Evaluate the performance of the BERT model and report its F1-score.

Prepare the test set.

In [16]:
y_text = []
y_true = []
for x in tokenized_test:
  y_text.append(x['text'])
  y_true.append(x['label'])

In [17]:
classifier = pipeline("sentiment-analysis", model="text_classification_model/checkpoint-252")
y_output = classifier(y_text)

In [18]:
y_pred = [label2id[o['label']] for o in y_output]

In [19]:
from sklearn.metrics import precision_recall_fscore_support
pre, recall, f1, support = precision_recall_fscore_support(y_true, y_pred, average='binary')
print('Precision:', pre)
print('   Recall:', recall)
print('       F1:', f1)

Precision: 0.8766404199475065
   Recall: 0.9355742296918768
       F1: 0.9051490514905149
