## Evaluating a Language Model with F1-Score

This project demonstrated the evaluation of a language model's performance on a binary classification task using PyTorch and the Hugging Face Transformers library. We employed a pre-trained BERT model to predict binary outcomes based on input text sequences. The core objective was to illustrate the end-to-end process of data preparation, model application, and evaluation within a language processing framework.

We tokenized input texts (a small illustrative example), fed them through the BERT model, and then assessed the model's predictive accuracy using the F1-score, a critical metric that balances precision and recall. The model achieved an F1-score of 0.6667, indicating a moderate level of performance with room for improvement.

In [1]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.metrics import f1_score
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from transformers import AdamW

In [2]:
# Set device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [3]:
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
model.to(device)  # Move the model to the GPU

# Example data (for illustration; in practice, use a larger, more diverse dataset)
texts = ["This is great!", "Awful experience.", "Loved it!", "Not good."]
labels = torch.tensor([1, 0, 1, 0])  # 1 for positive, 0 for negative


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# Tokenize the input texts
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")


In [5]:
# Create a dataset and dataloader
dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels)
dataloader = DataLoader(dataset, sampler=RandomSampler(dataset), batch_size=2)


In [6]:
# Training loop (simplified)
optimizer = AdamW(model.parameters(), lr=1e-5)
model.train()
for batch in dataloader:
    # Move batch to device
    batch = tuple(b.to(device) for b in batch)
    inputs = {'input_ids': batch[0], 'attention_mask': batch[1], 'labels': batch[2]}
    outputs = model(**inputs)
    loss = outputs[0]
    loss.backward()
    optimizer.step()
    model.zero_grad()




In [7]:
# Evaluation
model.eval()
predictions, true_labels = [], []
for batch in dataloader:
    batch = tuple(b.to(device) for b in batch)
    inputs = {'input_ids': batch[0], 'attention_mask': batch[1]}
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs[0]
    predictions.extend(torch.argmax(logits, -1).tolist())
    true_labels.extend(batch[2].tolist())

In [8]:
# Calculate F1-score
f1 = f1_score(true_labels, predictions)
print(f"F1-score: {f1}")

F1-score: 0.6666666666666666
