## Классификация текстов с использованием предобученных языковых моделей.

В данном задании вам предстоит обратиться к задаче классификации текстов и решить ее с использованием предобученной модели BERT.

In [9]:
import json
# do not change the code in the block below
# __________start of block__________
import os
import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from IPython.display import clear_output
from sklearn.metrics import accuracy_score, roc_auc_score, roc_curve

%matplotlib inline
# __________end of block__________

Обратимся к набору данных SST-2. Holdout часть данных (которая понадобится вам для посылки) доступна по ссылке ниже.

In [10]:
# do not change the code in the block below
# __________start of block__________

!wget https://raw.githubusercontent.com/girafe-ai/ml-course/refs/heads/24f_yandex_ml_trainings/homeworks/hw04_bert_and_co/texts_holdout.json
# __________end of block__________

--2024-11-23 16:28:37--  https://raw.githubusercontent.com/girafe-ai/ml-course/refs/heads/24f_yandex_ml_trainings/homeworks/hw04_bert_and_co/texts_holdout.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 51581 (50K) [text/plain]
Saving to: ‘texts_holdout.json.4’


2024-11-23 16:28:38 (946 KB/s) - ‘texts_holdout.json.4’ saved [51581/51581]



In [11]:
# do not change the code in the block below
# __________start of block__________
df = pd.read_csv(
    "https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv",
    delimiter="\t",
    header=None,
)
texts_train = df[0].values[:5000] # array of strings containign training texts
y_train = df[1].values[:5000] # array of labels (0 or 1) for training texts
texts_test = df[0].values[5000:] # array of strings containign test texts
y_test = df[1].values[5000:] # array of labels (0 or 1) for test texts
with open("texts_holdout.json") as iofile:
    texts_holdout = json.load(iofile)
# __________end of block__________

Весь остальной код предстоит написать вам.

Для успешной сдачи на максимальный балл необходимо добиться хотя бы __84.5% accuracy на тестовой части выборки__.

In [12]:
# Import necessary libraries
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm

# Load pre-trained DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize and encode the texts
def tokenize_data(texts):
    # Convert numpy array to list if necessary
    if isinstance(texts, np.ndarray):
        texts = texts.tolist()
    return tokenizer(texts, padding=True, truncation=True, max_length=128, return_tensors="pt")

# Tokenize train, test, and holdout data
train_encodings = tokenize_data(texts_train)
test_encodings = tokenize_data(texts_test)
holdout_encodings = tokenize_data(texts_holdout)

# Convert labels to tensors
train_labels = torch.tensor(y_train)
test_labels = torch.tensor(y_test)

# Create DataLoaders
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Load pre-trained DistilBERT model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Set up optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

num_epochs = 4
for epoch in range(num_epochs):
    model.train()
    for batch in tqdm(train_loader, desc=f"Epoch {epoch + 1}/{num_epochs}"):
        batch = tuple(t.to(device) for t in batch)
        input_ids, attention_mask, labels = batch
        
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Evaluation on test set
    model.eval()
    test_preds = []
    with torch.no_grad():
        for batch in tqdm(test_loader, desc="Evaluating"):
            batch = tuple(t.to(device) for t in batch)
            input_ids, attention_mask, _ = batch
            
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
            test_preds.extend(preds)
    
    test_accuracy = accuracy_score(y_test, [1 if p > 0.5 else 0 for p in test_preds])
    print(f"Epoch {epoch + 1}/{num_epochs}, Test Accuracy: {test_accuracy:.4f}")




model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/4: 100%|██████████| 157/157 [06:39<00:00,  2.55s/it]
Evaluating: 100%|██████████| 60/60 [00:37<00:00,  1.62it/s]


Epoch 1/4, Test Accuracy: 0.8766


Epoch 2/4: 100%|██████████| 157/157 [06:16<00:00,  2.40s/it]
Evaluating: 100%|██████████| 60/60 [00:37<00:00,  1.62it/s]


Epoch 2/4, Test Accuracy: 0.8833


Epoch 3/4: 100%|██████████| 157/157 [05:41<00:00,  2.17s/it]
Evaluating: 100%|██████████| 60/60 [00:36<00:00,  1.66it/s]


Epoch 3/4, Test Accuracy: 0.8760


Epoch 4/4: 100%|██████████| 157/157 [05:40<00:00,  2.17s/it]
Evaluating: 100%|██████████| 60/60 [00:36<00:00,  1.64it/s]

Epoch 4/4, Test Accuracy: 0.8875





In [13]:
# Predict probabilities for train, test, and holdout sets
def predict_proba(texts, encodings):
    dataset = TensorDataset(encodings['input_ids'], encodings['attention_mask'])
    dataloader = DataLoader(dataset, batch_size=16, shuffle=False)
    
    model.eval()
    proba_list = []
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Predicting"):
            batch = tuple(t.to(device) for t in batch)
            input_ids, attention_mask = batch
            
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            proba = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
            proba_list.extend(proba)
    
    return [float(p) for p in proba_list]

train_proba = predict_proba(texts_train, train_encodings)
test_proba = predict_proba(texts_test, test_encodings)
holdout_proba = predict_proba(texts_holdout, holdout_encodings)

Predicting: 100%|██████████| 313/313 [02:01<00:00,  2.58it/s]
Predicting: 100%|██████████| 120/120 [00:40<00:00,  2.98it/s]
Predicting: 100%|██████████| 32/32 [00:09<00:00,  3.36it/s]


In [31]:
# Continue training for 3 more epochs
additional_epochs = 3
total_epochs = num_epochs + additional_epochs

for epoch in range(num_epochs, total_epochs):
    model.train()
    # Shuffle the training data
    train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    
    for batch in tqdm(train_loader, desc=f"Epoch {epoch + 1}/{total_epochs}"):
        batch = tuple(t.to(device) for t in batch)
        input_ids, attention_mask, labels = batch
        
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    # Evaluation on test set
    model.eval()
    test_preds = []
    with torch.no_grad():
        for batch in tqdm(test_loader, desc="Evaluating"):
            batch = tuple(t.to(device) for t in batch)
            input_ids, attention_mask, _ = batch
            
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            preds = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
            test_preds.extend(preds)
    
    test_accuracy = accuracy_score(y_test, [1 if p > 0.5 else 0 for p in test_preds])
    print(f"Epoch {epoch + 1}/{total_epochs}, Test Accuracy: {test_accuracy:.4f}")

# Recalculate probabilities for train, test, and holdout sets
train_proba = predict_proba(texts_train, train_encodings)
test_proba = predict_proba(texts_test, test_encodings)
holdout_proba = predict_proba(texts_holdout, holdout_encodings)


Epoch 5/7: 100%|██████████| 157/157 [05:48<00:00,  2.22s/it]
Evaluating: 100%|██████████| 60/60 [00:38<00:00,  1.57it/s]


Epoch 5/7, Test Accuracy: 0.8792


Epoch 6/7: 100%|██████████| 157/157 [12:04<00:00,  4.61s/it] 
Evaluating: 100%|██████████| 60/60 [00:37<00:00,  1.59it/s]


Epoch 6/7, Test Accuracy: 0.8693


Epoch 7/7: 100%|██████████| 157/157 [05:46<00:00,  2.21s/it]
Evaluating: 100%|██████████| 60/60 [00:36<00:00,  1.62it/s]


Epoch 7/7, Test Accuracy: 0.8766


Predicting: 100%|██████████| 313/313 [01:46<00:00,  2.93it/s]
Predicting: 100%|██████████| 120/120 [00:38<00:00,  3.10it/s]
Predicting: 100%|██████████| 32/32 [00:09<00:00,  3.54it/s]


#### Сдача взадания в контест
Сохраните в словарь `out_dict` вероятности принадлежности к первому (положительному) классу

In [32]:
out_dict = {
    'train': train_proba,
    'test': test_proba,
    'holdout': holdout_proba
}

In [25]:
print(len(out_dict['holdout']))

500


Несколько `assert`'ов для проверки вашей посылки:

In [33]:
assert isinstance(out_dict["train"], list), "Object must be a list of floats"
assert isinstance(out_dict["train"][0], float), "Object must be a list of floats"
assert (
    len(out_dict["train"]) == 5000
), "The predicted probas list length does not match the train set size"

assert isinstance(out_dict["test"], list), "Object must be a list of floats"
assert isinstance(out_dict["test"][0], float), "Object must be a list of floats"
assert (
    len(out_dict["test"]) == 1920
), "The predicted probas list length does not match the test set size"

assert isinstance(out_dict["holdout"], list), "Object must be a list of floats"
assert isinstance(out_dict["holdout"][0], float), "Object must be a list of floats"
assert(
    len(out_dict["holdout"]) == 500
), "The predicted probas list length does not match the holdout set size"

Запустите код ниже для генерации посылки.

In [34]:
# do not change the code in the block below
# __________start of block__________
FILENAME = "submission_dict_hw_text_classification_with_bert.json"

with open(FILENAME, "w") as iofile:
    json.dump(out_dict, iofile)
print(f"File saved to `{FILENAME}`")
# __________end of block__________

File saved to `submission_dict_hw_text_classification_with_bert.json`


На этом задание завершено. Поздравляем!