## Задача

В папке с данными (data) расположен датасет bbc , который содержит 5 классов новостных статей (business, entertainment, politics, sport, tech). Вам 
необходимо реализовать следующее:
1. Обучить любой (на ваш выбор) алгоритм классификации из классических алгоритмов машинного обучения (не нейронная сеть!) и измерить 
метрику качества работы модели (не забываем также проверять качество на инференсе).
2. Дообучить модель DistillBert для многоклассовой классификации и проверить также качество работы алгоритма (н забываем проверять качество 
на инференсе). Похожий код был продемонстрирован на лекции, думаем головой, не копипастите, есть нюансы в данной задаче.
3. Оформить оба решения в пайплайны. Отдаем новости – получаем метку класса и вероятность. Соскрапить или собрать вручную по пять свежих 
новостей с ресурса (https://www.bbc.com/news - ссылки на новости прикрепить в ноутбуке) для каждого из классов и прогнать на них пайплайны.  
Получить результаты инференса пайплайнов, и оценить какое из решений лучше.

## Решение

### 1. Подготовка данных

Загружаем необходимые библиотеки. Указываем путь к папке с новостями

In [1]:
import numpy as np
import pandas as pd
import os
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
import warnings
import re
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import transformers
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import torch
from transformers import Trainer, TrainingArguments
from evaluate import load


nltk.download("punkt")
nltk.download("wordnet")
warnings.filterwarnings("ignore")

DATA_DIR = "data/bbc/"

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\1379\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\1379\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Создаем класс BbcDataset, который считывает все файлы с новостями, выполняет преобработку данных, после чего формирует датасет (в формате pandas), в котором первый столбец содержит новость, а второй столбец - метку (тема новости). Предобработка данных включает в себя удаление стоп-слов английского языка, лемматизацию. Для столбца "label" в созданной таблице осуществляется кодирование с помощью LabelEncoder.

In [2]:
class BbcDataset:
    def __init__(self, data_dir=DATA_DIR):
        self.data_dir = data_dir
        self.documents = [
            os.path.join(path, name)
            for path, _, files in os.walk(self.data_dir)
            for name in files
        ]

        self.label_encoder = LabelEncoder()

        self.dataframe = self.get_pandas_alike_dataset()

        self.dataframe['label_encoded'] = self.label_encoder.fit_transform(self.dataframe['label'])

    def preprocess_text(self, text: str) -> str:
        """
        Удаляет все, кроме английских букв и пробелов, затем лемматизирует и удаляет стоп-слова

        :param text: Текст для обработки
        :return: Обработанная строка текста
        """

        mystopwords = stopwords.words('english') + ['also', 'the', 'this', 'that', 'not', 'would', 'could', 'did']

        def remove_stopwords(text, mystopwords=mystopwords):
            try:
                text = re.sub(r'[^a-zA-Z\s]', '', text)
                return " ".join([token for token in text.split() if token.lower() not in mystopwords])
            except Exception as e:
                print("Ошибка!!!!!")
        
        def lemmatize(text):
            try:
                lemmatizer = WordNetLemmatizer()
                tokens = word_tokenize(text)
                return " ".join([lemmatizer.lemmatize(word) for word in tokens])
            except Exception as e:
                print("Ошибка!!!!!")

        text = remove_stopwords(text, mystopwords)
        text = lemmatize(text)
        return text

    def get_dataset(self):
        """Загружает все документы и их категории"""
        data = []
        for file in self.documents:
            label = os.path.basename(os.path.dirname(file))
            try:
                with open(file, "r", encoding="utf-8") as f:
                    content = f.read().strip()
                    processed_content = self.preprocess_text(content)
                    data.append((processed_content.lower(), label))
            except Exception as e:
                print(f"Ошибка чтения файла {file}: {e}")
        return data

    def get_pandas_alike_dataset(self):
        """Возвращает DataFrame с текстами и метками"""
        data = self.get_dataset()
        return pd.DataFrame(data, columns=["article", "label"])

In [3]:
bbc_pandas_dataset = BbcDataset()

In [4]:
bbc_pandas_dataset.dataframe

Unnamed: 0,article,label,label_encoded
0,ad sale boost time warner profit quarterly pro...,business,0
1,dollar gain greenspan speech dollar hit highes...,business,0
2,yukos unit buyer face loan claim owner embattl...,business,0
3,high fuel price hit bas profit british airways...,business,0
4,pernod takeover talk lift domecq shares uk dri...,business,0
...,...,...,...
2216,bt program beat dialler scam bt introducing tw...,tech,4
2217,spam email tempt net shopper computer user acr...,tech,4
2218,careful code new european directive put softwa...,tech,4
2219,us cyber security chief resigns man making sur...,tech,4


Изучив информацию по датасету, отмечаем отсутствие нулевых значений. Датасет включает 2221 новость

In [5]:
bbc_pandas_dataset.dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2221 entries, 0 to 2220
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   article        2221 non-null   object
 1   label          2221 non-null   object
 2   label_encoded  2221 non-null   int32 
dtypes: int32(1), object(2)
memory usage: 43.5+ KB


Выполним оценку частотного распределения классов новостных статей. Отмечаем несбалансированное распределение классов. При последующем разбиении датасета на наборы данных train, test и valid будем это частотное распределение. Кроме того, для учета несбалансированности классов мы в будущем при определении функции потерь зададим веса для корректировки несбалансированности классов.

In [6]:
bbc_pandas_dataset.dataframe.value_counts("label")

label
sport            511
business         506
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

In [7]:
bbc_pandas_dataset.dataframe.value_counts("label_encoded")

label_encoded
3    511
0    506
2    417
4    401
1    386
Name: count, dtype: int64

Выделим из датасета статьи как переменную Х и кодированные метки как переменную y

In [8]:
X = bbc_pandas_dataset.dataframe['article']
y = bbc_pandas_dataset.dataframe['label_encoded']

Выполним разделение на обучающую и тестовую выборки. В данном случае размер тестовой выборки составляет 20 %. Кроме того, зададим стратификацию при генерации обучающей и тестовой выборок с учетом соотношения между размерами классов.

In [9]:
train_texts, test_texts, train_labels, test_labels = train_test_split(
    X,
    y,
    test_size=0.2,
    shuffle=True,
    random_state=42,
    stratify=bbc_pandas_dataset.dataframe['label_encoded']
    )

Убеждаемся, что обучающая и тестовая выборки имеют одинаковый размер.

In [10]:
train_texts.shape, train_labels.shape, test_texts.shape, test_labels.shape

((1776,), (1776,), (445,), (445,))

Кроме того, выделим из тестовой выборки валидационную выборку для снятия итоговых параметров качества модели.

In [11]:
test_texts, valid_texts, test_labels, valid_labels = train_test_split(
    test_texts,
    test_labels,
    test_size=0.2,
    shuffle=True,
    random_state=42,
    stratify=test_labels
)

Выполняем проверку размера тестовой и валидационной выборок.

In [12]:
test_texts.shape, test_labels.shape, valid_texts.shape, valid_labels.shape

((356,), (356,), (89,), (89,))

Убедимся, что частотное распределение меток приблизительно одинаково во всех выборках 

In [13]:
train_labels.value_counts(normalize=True), test_labels.value_counts(normalize=True), valid_labels.value_counts(normalize=True)

(label_encoded
 3    0.229730
 0    0.228041
 2    0.187500
 4    0.180743
 1    0.173986
 Name: proportion, dtype: float64,
 label_encoded
 3    0.230337
 0    0.227528
 2    0.188202
 4    0.179775
 1    0.174157
 Name: proportion, dtype: float64,
 label_encoded
 3    0.235955
 0    0.224719
 2    0.191011
 4    0.179775
 1    0.168539
 Name: proportion, dtype: float64)

Отобразим train_texts. Обучащая выборка представляет собой массив статей. В данном случае не ограничиваем длину статей. Если впоследствии результаты логистической регрессии будут неприемлемыми, выполним это ограничение.

In [14]:
train_texts.values

array(['israeli club look africa four african player including zimbabwe goalkeeper energy murambadoro ready play israeli club hapoel bnei sakhnin uefa cup bnei sakhnin first arab side ever play european competition play english premiership side newcastle united first round warriors goalkeeper murambadoro made name african nations cup final tunisia helped bnei sakhnin overcome albanias partizani tirana previous round murambadoro moved israel recently brief stint south african club hellenic club israeli cup final last season based sakhnin near haifa club strong ethic high profile promoter peace cooperation within israel three africans club former cameroon defender ernest etchi dr congos alain masudi nigerian midfielder edith agoye stint tunisian side esperance',
       'humanoid robot learns run carmaker hondas humanoid robot asimo got faster smarter japanese firm leader developing twolegged robot new improved asimo advanced step innovative mobility run find way around obstacle well inte

### 2. Обучение модели логистической регрессии

Для последующего обучения модели необходимо выполнить токенизацию текста после его предобработки с помощью токенизатора из библиотеки nltk.

In [15]:
train_texts_tokenized_clf = train_texts.apply(lambda x: ' '.join(word_tokenize(x)))
valid_texts_tokenized_clf = valid_texts.apply(lambda x: ' '.join(word_tokenize(x)))
test_texts_tokenized_clf = test_texts.apply(lambda x: ' '.join(word_tokenize(x)))

Выполним векторизацию токенизированного текста с помощью TfidfVectorizer. После этого выполним обучение модели логистической регрессии с использованием параметров, заданных по умолчанию. Проведем 1000 итераций. Кроме того, учитывая несбалансированность классов, зададим аргумент class_weight="balanced" для балансировки классов.

Выведем значения точности, рассчитанные для обучающих, тестовых и валидационных данных. Кроме того, выведем классификационный отчет, который содержит метрики для каждого из классов.

In [16]:
vectorizer = TfidfVectorizer()

train_texts_vectorized_clf = vectorizer.fit_transform(train_texts_tokenized_clf)
test_texts_vectorized_clf = vectorizer.transform(test_texts_tokenized_clf)
valid_texts_vectorized_clf = vectorizer.transform(valid_texts_tokenized_clf)

clf = LogisticRegression(max_iter=1000, class_weight="balanced")
clf.fit(train_texts_vectorized_clf, train_labels)

train_pred = clf.predict(train_texts_vectorized_clf)
train_accuracy = accuracy_score(train_labels, train_pred)

test_pred = clf.predict(test_texts_vectorized_clf)
test_accuracy = accuracy_score(test_labels, test_pred)
classification_report = classification_report(test_labels, test_pred)

valid_pred = clf.predict(valid_texts_vectorized_clf)
valid_accuracy = accuracy_score(valid_labels, valid_pred)

print(f"Train Accuracy: {train_accuracy * 100:.2f}%")
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")
print(f"Valid Accuracy: {valid_accuracy * 100:.2f}%")
print(f"\nClassification report for test:\n{classification_report}")

Train Accuracy: 99.77%
Test Accuracy: 98.60%
Valid Accuracy: 100.00%

Classification report for test:
              precision    recall  f1-score   support

           0       1.00      0.96      0.98        81
           1       1.00      0.98      0.99        62
           2       0.97      0.99      0.98        67
           3       1.00      1.00      1.00        82
           4       0.96      1.00      0.98        64

    accuracy                           0.99       356
   macro avg       0.99      0.99      0.99       356
weighted avg       0.99      0.99      0.99       356



Представленные метрики демонстрируют, что модель логистической регрессии очень хорошо осуществляется многоклассовую классификацию для всех классов. И для обучающей, и для тестовой выборок получены значения, близкие к 100 %.

### 3. Обучение модели DistilBert

Выполним токенизацию подготовленных данных. В данном случае будем использовать токенизатор DistilBert. Токенизатор самостоятельно ограничивает длину каждой статьи 512 символами (truncation=True) и дополняет короткие строки нулями (padding=True).

In [17]:
tokenizer_bert = transformers.DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [18]:
train_encodings = tokenizer_bert(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer_bert(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer_bert(list(test_texts), truncation=True, padding=True)

Для последующего обучения модели DistilBert подготовим Dataset и DataLoader для подачи данных батчами. При генерации Dataset на вход подаются токенизированный текст и метки. Сформированный датасет передается в DataLoader, который формирует батчи (16) и перемешивает данные (только в случае обучающей выборки)

In [19]:
class TorchDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx]).long()
        return item

    def __len__(self):
        return len(self.labels)

In [20]:
train_dataset = TorchDataset(train_encodings, np.array(train_labels))
test_dataset = TorchDataset(test_encodings, np.array(test_labels))
valid_dataset = TorchDataset(valid_encodings, np.array(valid_labels))

In [21]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False)

Для оценки качества модели создадим функцию расчета точности предсказаний, которая сравнивает предсказание класса с истинным значением и рассчитывает долю правильных предсказаний.

In [22]:
def compute_accuracy(model, data_loader, device):
    with torch.no_grad():
        correct_pred, num_examples = 0, 0

        for batch_idx, batch in enumerate(data_loader):

            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs['logits']
            predicted_labels = torch.argmax(logits, 1)
            num_examples += labels.size(0)
            correct_pred += (predicted_labels == labels).sum()

        return correct_pred.float()/num_examples * 100

Загрузим модель DistilBert с помощью библиотеки transformers. В данном случае задача представляет собой классификацию с использованием 5 классов. Укажем это с помощью аргумента при задании модели (num_labels = 5). В качестве оптимизатора выберем Adam, скорость обучения = lr=1e-5.

In [23]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("DEVICE: ", DEVICE)
model_bert = transformers.DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 5)
model_bert.to(DEVICE)
model_bert.train()

optim = torch.optim.Adam(model_bert.parameters(), lr=1e-5)

DEVICE:  cuda


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Как и в случае модели логистической регрессии, выполним балансировку классов. Однако в отличие от модели логистической регрессии в моделей глубинного обучения отсутствует удобный инструмент балансировки классов. Поэтому приходится рассчитывать веса с помощью библиотеки sklearn и вводить рассчитанные поправки в функцию потерь CrossEntropyLoss. Необходимо отметить, что изначально данная корректировка не была внесена в модель, и, тем не менее, модель демонстрировала отличные результаты. Внесение поправки не улучшило результатов, однако отражает правильный подход в случае несбалансированных классов.

In [24]:
from sklearn.utils.class_weight import compute_class_weight
import torch.nn as nn

class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(train_labels), y=train_labels)

class_weights_tensor = torch.tensor(class_weights, dtype=torch.float32).to(DEVICE)

loss_fn = nn.CrossEntropyLoss(weight=class_weights_tensor)

Проведем обучение и оценку качества модели на обучающих и тестовых данных. В конце будет выполнена проверка качества модели на валидационных данных.

In [25]:
import time
start_time = time.time()

n_epochs = 4
for epoch in range(n_epochs):

    model_bert.train()

    for batch_idx, batch in enumerate(train_loader):

        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        outputs = model_bert(input_ids, attention_mask=attention_mask, labels=labels)
        logits = outputs['logits']
        loss = loss_fn(logits, labels)

        optim.zero_grad()
        loss.backward()
        optim.step()

        if not batch_idx % 30:
            print (f'Epoch: {epoch+1:04d}/{n_epochs:04d} | '
                   f'Batch {batch_idx:04d}/{len(train_loader):04d} | '
                   f'Loss: {loss:.4f}')

    model_bert.eval()

    with torch.no_grad():
        print(f'Training accuracy: '
              f'{compute_accuracy(model_bert, train_loader, DEVICE):.2f}%'
              f'\nTest accuracy: '
              f'{compute_accuracy(model_bert, test_loader, DEVICE):.2f}%')

    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min\n')

print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Valid accuracy: {compute_accuracy(model_bert, valid_loader, DEVICE):.2f}%')

Epoch: 0001/0004 | Batch 0000/0111 | Loss: 1.6301
Epoch: 0001/0004 | Batch 0030/0111 | Loss: 1.3137
Epoch: 0001/0004 | Batch 0060/0111 | Loss: 0.8163
Epoch: 0001/0004 | Batch 0090/0111 | Loss: 0.4038
Training accuracy: 97.41%
Test accuracy: 98.03%
Time elapsed: 1.85 min

Epoch: 0002/0004 | Batch 0000/0111 | Loss: 0.2808
Epoch: 0002/0004 | Batch 0030/0111 | Loss: 0.2931
Epoch: 0002/0004 | Batch 0060/0111 | Loss: 0.1240
Epoch: 0002/0004 | Batch 0090/0111 | Loss: 0.1399
Training accuracy: 99.27%
Test accuracy: 98.60%
Time elapsed: 3.73 min

Epoch: 0003/0004 | Batch 0000/0111 | Loss: 0.0714
Epoch: 0003/0004 | Batch 0030/0111 | Loss: 0.0693
Epoch: 0003/0004 | Batch 0060/0111 | Loss: 0.0416
Epoch: 0003/0004 | Batch 0090/0111 | Loss: 0.0789
Training accuracy: 99.49%
Test accuracy: 98.88%
Time elapsed: 5.61 min

Epoch: 0004/0004 | Batch 0000/0111 | Loss: 0.0362
Epoch: 0004/0004 | Batch 0030/0111 | Loss: 0.0277
Epoch: 0004/0004 | Batch 0060/0111 | Loss: 0.0277
Epoch: 0004/0004 | Batch 0090/0111

Рассчитаем функцию потерь на тестовой выборке (test_loader).

In [26]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_bert.eval()

test_loss = 0.0
correct = 0

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        outputs = model_bert(input_ids, attention_mask=attention_mask, labels=labels)
        logits = outputs['logits']

        loss = loss_fn(logits, labels)
        test_loss += loss.item()

        predicted_labels = torch.argmax(logits, dim=1)
        correct += (predicted_labels == labels).sum().item()

test_loss /= len(test_loader)

test_accuracy = correct / len(test_loader.dataset) * 100.

print(f'\nTest Loss: {test_loss:.4f}')
print(f'Test Accuracy: {test_accuracy:.2f}%\n')


Test Loss: 0.0470
Test Accuracy: 98.88%



Модель DistilBert демонстрирует отличные результаты. Для всех выборок получены результаты, близкие к 100 %. Значения функции потерь для обучающей и тестовой выборок низкие, что свидетельствует о хорошо подобранных параметрах модели.

### 4. Обучение модели DistilBert с помощью Trainer (необязательная часть, в исследовательских целях)

Для обучения с помощью Trainer загрузим отдельную модель. Зададим для процесса обучения те же параметры, что и для предыдущей модели (num_classes, batch, lr). Однако в данном случае не будем осуществлять балансировку классов вследствие расчета функции потерь самим Trainer "под капотом".

In [27]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model_hugg_bert = transformers.DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels = 5)
model_hugg_bert.to(DEVICE)
model_hugg_bert.train()

optim = torch.optim.Adam(model_hugg_bert.parameters(), lr=1e-5)
metric = load("accuracy")

training_args = TrainingArguments(
    output_dir='results',
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir='logs',
    logging_steps=30,
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [28]:
def compute_metrics_trainer(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(
               predictions=predictions, references=labels)

In [29]:
trainer = Trainer(
    model=model_hugg_bert,
    compute_metrics=compute_metrics_trainer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    optimizers=(optim, None)
)

In [30]:
import time

start_time = time.time()
trainer.train()
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')

Step,Training Loss
30,1.5119
60,1.1214
90,0.709
120,0.4395
150,0.2815
180,0.1949
210,0.1793
240,0.1152
270,0.0947
300,0.0954


Total Training Time: 6.22 min


In [31]:
trainer.evaluate()

{'eval_loss': 0.07398007810115814,
 'eval_accuracy': 0.9887640449438202,
 'eval_runtime': 5.3575,
 'eval_samples_per_second': 66.448,
 'eval_steps_per_second': 4.293,
 'epoch': 4.0}

In [32]:
model_hugg_bert.eval()
model_hugg_bert.to(DEVICE)
print(f'Train accuracy: {compute_accuracy(model_hugg_bert, train_loader, DEVICE):.2f}%')
print(f'Test accuracy: {compute_accuracy(model_hugg_bert, test_loader, DEVICE):.2f}%')
print(f'Valid accuracy: {compute_accuracy(model_hugg_bert, valid_loader, DEVICE):.2f}%')

Train accuracy: 99.27%
Test accuracy: 98.88%
Valid accuracy: 100.00%


Результаты обучения модели DistilBert с помощью Trainer также отличные. Они идентичны результатам, полученным для этой же модели, но без использования Trainer.

### 5. Оформление решения с помощью Pipeline

Оформим представленное выше решение с помощью Pipeline. Пайплайн должен включать три основых этапа обучения модели: подготовка текста, его векторизация с предварительной токенизацией и само обучение с использованием модели логистической регрессии.

Для выполнения предподготовки текста у нас уже была сформирована функция preprocess_text, которая применялась в BbcDataset. На ее основе создадим класс TextPreprocessor. Это необходимо для последующего включения этого класса в пайплайн наряду с векторизатором и самой функцией машинного обучения. Созданный пайплайн используем для обучения на обучающей выборке и выплнения предсказания с помощью функции predict_log, получая на выходе предсказание класса в виде его метки и вероятности предсказания именно этого класса.

In [33]:
from sklearn.base import BaseEstimator, TransformerMixin

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.mystopwords = stopwords.words('english') + ['also', 'the', 'this', 'that', 'not', 'would', 'could', 'did']
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return [self.preprocess_text(text) for text in X]

    def preprocess_text(self, text: str) -> str:
        """
        Удаляет все, кроме английских букв и пробелов, затем лемматизирует и удаляет стоп-слова
        :param text: Текст для обработки
        :return: Обработанная строка текста
        """
        def remove_stopwords(text, mystopwords=self.mystopwords):
            text = re.sub(r'[^a-zA-Z\s]', '', text)
            return " ".join([token for token in text.split() if token.lower() not in mystopwords])

        def lemmatize(text):
            tokens = word_tokenize(text)
            return " ".join([self.lemmatizer.lemmatize(word) for word in tokens])

        text = remove_stopwords(text)
        text = lemmatize(text)

        return text


logistic_regression_pipeline = Pipeline([
    ('preprocessor', TextPreprocessor()),
    ('vectorization', TfidfVectorizer(tokenizer=word_tokenize)),
    ('classifier', LogisticRegression())
])

logistic_regression_pipeline.fit(train_texts, train_labels)

def predict_log(article):
    predictions = logistic_regression_pipeline.predict(article)
    probabilities = logistic_regression_pipeline.predict_proba(article)
    return predictions, probabilities

Создадим пайплайн и для модели model_bert (модель DistilBert, без Trainer)

In [34]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def preprocess_text(text: str) -> str:
    """
    Удаляет все, кроме английских букв и пробелов, затем лемматизирует и удаляет стоп-слова
    :param text: Текст для обработки
    :return: Обработанная строка текста
    """

    mystopwords = stopwords.words('english') + ['also', 'the', 'this', 'that', 'not', 'would', 'could', 'did']

    def remove_stopwords(text, mystopwords=mystopwords):
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        return " ".join([token for token in text.split() if token.lower() not in mystopwords])

    def lemmatize(text):
        lemmatizer = WordNetLemmatizer()
        tokens = word_tokenize(text)
        return " ".join([lemmatizer.lemmatize(word) for word in tokens])

    text = remove_stopwords(text)
    text = lemmatize(text)

    return text

class DistillBertPipeline:
    def __init__(self, model, tokenizer, preprocess_func, device=DEVICE):
        self.model = model
        self.tokenizer = tokenizer
        self.preprocess_func = preprocess_func
        self.device = device

    def predict(self, texts):
        preprocessed_text = [self.preprocess_func(text) for text in texts]

        self.model.eval()
        encodings = self.tokenizer(preprocessed_text, truncation=True, padding=True, return_tensors='pt').to(self.device)
        with torch.no_grad():
            outputs = self.model(**encodings)
        logits = outputs.logits
        probabilities = torch.softmax(logits, dim=1).cpu().numpy()
        predictions = torch.argmax(logits, dim=1).cpu().numpy()
        return predictions, probabilities

distillbert_pipeline = DistillBertPipeline(model_bert, tokenizer_bert, preprocess_func=preprocess_text, device=DEVICE)

def predict_distillbert(texts):
    return distillbert_pipeline.predict(texts)

### 6. Предсказание класса статей

Для удобства изучения полученных результатов предсказания представим полученные данные в формате таблицы pandas, включающей 3 столбца: статья (article), метка класса (но в текстовом, а не числовом выражении), а также вероятность предсказания класса.

In [35]:
def predictions_to_dataset(predict_function, article_set, model_name="log"):
    all_articles = []
    predicted_classes = []
    predicted_probabilities = []
    
    for articles in article_set:
        for article in articles:
            predictions, probabilities = predict_function([article])
            predicted_class = bbc_pandas_dataset.label_encoder.inverse_transform(predictions)
            
            all_articles.append(article)
            predicted_classes.append(predicted_class[0])
            predicted_probabilities.append(probabilities[0][int(predictions)])
    
    df = pd.DataFrame({
        "article": all_articles,
        "label_"+model_name: predicted_classes,
        "probability_"+model_name: predicted_probabilities
    })
    
    return df

Сформируем набор статей с сайта bbc.com. По 5 статей для каждого класса

In [36]:
articles_business = ["From chatbots to intelligent toys: How AI is booming in China Head in hands, eight-year-old Timmy muttered to himself as he tried to beat a robot powered by artificial intelligence at a game of chess. But this was not an AI showroom or laboratory – this robot was living on a coffee table in a Beijing apartment, along with Timmy. The first night it came home, Timmy hugged his little robot friend before heading to bed. He doesn't have a name for it – yet. Its like a little teacher or a little friend, the boy said, as he showed his mum the next move he was considering on the chess board. Moments later, the robot chimed in: Congrats! You win. Round eyes blinking on the screen, it began rearranging the pieces to start a new game as it continued in Mandarin: Ive seen your ability, I will do better next time. China is embracing AI in its bid to become a tech superpower by 2030. DeepSeek, the breakthrough Chinese chatbot that caught the world's attention in January, was just the first hint of that ambition. Money is pouring into AI businesses seeking more capital, fuelling domestic competition. There are more than 4,500 firms developing and selling AI, schools in the capital Beijing are introducing AI courses for primary and secondary students later this year, and universities have increased the number of places available for students studying AI.",
                      "Trump doubles planned tariffs on Canadian metal US President Donald Trump has said he will double the tariffs he previously announced on Canadian steel and aluminium imports into the US, taking the levies to 50% in total. In the latest twist in a deepening trade war, Trump said it was in retaliation for 25% tariffs Ontario placed on electricity it sends to northern US states. Trump said if tariffs including those on agricultural products were not dropped, he would hike taxes on the car industry, which will, essentially, permanently shut down the automobile manufacturing business in Canada. Ontario premier Doug Ford said: Until the threat of tariffs is gone for good, we won't back down. Ford added in a post on X that Trump had launched an unprovoked trade and tariff war with America's closest friend and ally. Writing on his social media platform Truth Social, Trump said his tariffs would go into effect on Wednesday morning, and that he would declare a national emergency on electricity in those states. He also said Canada relied on the US for military protection, and reiterated that he wanted the country to become the 51st US state. He add that it would make all tariffs, and everything else, totally disappear, if Canada were to join the US as a state. Canadian Prime Minister-designate Mark Carney has previously said Canada will never be part of America in any way, shape or form.",
                      "European stocks steady after US markets plunge A sell-off in global shares eased in Europe on Tuesday following a sharp fall in US stocks that came as investors raised concerns about the negative economic impact of President Donald Trump's tariffs. It followed the president saying in a TV interview that the world's biggest economy was in a period of transition, when asked about suggestions of a potential recession. Since those remarks were broadcast on Sunday, top Trump officials and advisers have sought to calm investor fears. The US S&P 500 share index fell nearly 3% on Monday, but in Europe most of the major markets opened little changed. In a Fox News interview broadcast at the weekend but recorded on Thursday, Trump appeared to acknowledge concerns about the economy. I hate to predict things like that, he said. There is a period of transition because what we're doing is very big. We're bringing wealth back to America. That's a big thing. Charu Chanana, an investment strategist at investment bank Saxo, told the BBC: The previous notion of Trump being a stock market president is being re-evaluated. On Monday in New York, the S&P 500, which tracks the biggest companies listed in the US, ended the trading day 2.7% lower, while the Dow Jones Industrial Average dropped 2%. The tech-heavy Nasdaq share index was hit particularly hard, sinking 4%.",
                      "Starmer says benefit system unfair and indefensible Sir Keir Starmer has called the current benefits system unsustainable, indefensible and unfair, and said the government could not shrug its shoulders and look away. Addressing Labour MPs on Monday evening, the prime minister said the current welfare system was the worst of all worlds, discouraging people from working while producing a spiralling bill. The comments come as Work and Pensions Secretary Liz Kendall prepares to set out changes to the welfare system and cut the benefits bill in the coming weeks. Chancellor Rachel Reeves has earmarked several billion pounds in draft spending cuts to welfare and other government departments ahead of the Spring Statement. Changes likely to be announced in the coming days include restrictions on eligibility for the Personal Independent Payment, which provides help with extra living costs to those with a long-term physical or mental health condition, and cuts to incapacity benefits for people unable to work and receiving Universal Credit. There is unease over the plans within the party, with Labour MP Rachael Maskell warning against draconian cuts that risk pushing disabled people into poverty. Maskell told the BBC she had picked up deep, deep concern among Labour MPs. She said: I look in the past at what Labour has achieved in this space and believe that we can hold on to our values, ensure that we're helping people and not harming people. Another Labour MP, Neil Duncan-Jordan, also expressed concern, telling Newsnight: If we are going to make poor people poorer then there will be a number of MPs who won't be able to sign up to that. It feels like it could be a re-run of austerity and I'm worried about that.",
                      "Luxury lounges: Credit card perks 'we are all paying for' I'm standing in what feels like a suite in a posh hotel - all soft lighting, marble counter tops, plush seating and parquet-style flooring. An ornate platter of food catches the corner of my eye. That is a seafood tower as a welcome food amenity, as well as caviar, and you can also see the champagne there for guests to enjoy, says Dana Pouwels, head of airport lounge benefits at US bank Chase. She is showing me around Chase's new Sapphire Lounge at New York's La Guardia Airport. This is what waiting to catch a flight can look like these days – if you can afford to pay $550 (£433) a year to have the correct credit card required to gain entry. Then once inside Chase's new La Guardia lounge you can then choose to pay up to $3,000 to access a private suite for a few hours. It is all part of what has been described as a global arms race among the credit card companies that you are probably entirely unaware of – they are competing to outdo each other with bigger, better, bolder airport lounges. And while most of us don't have access to these lounges, experts say we are almost certainly helping pay for them. And they don't come cheap. Yes, it's an arms race, and they're getting extraordinary, says Clint Thompson, the news editor at the flight and travel website The Points Guy. From what we do know, we're talking up to tens of millions of dollars per lounge."]
links_business = ["https://www.bbc.com/news/articles/ckg8jqj393eo",
                  "https://www.bbc.com/news/articles/cm2y811g1dgo",
                  "https://www.bbc.com/news/articles/c4gdwgjkk1no",
                  "https://www.bbc.com/news/articles/c0kgpyz3mmpo",
                  "https://www.bbc.com/news/articles/c5yen4k1xnko"]


articles_sport = ["Golden Ace wins dramatic Champion Hurdle at 25-1 Golden Ace was a shock 25-1 winner of the Champion Hurdle at Cheltenham after previous victors Constitution Hill and State Man fell. The 2023 champion Constitution Hill was sent off odds-on favourite but came down early in the big race. Last year's winner State Man looked set to claim victory before a dramatic fall at the last hurdle. That left Golden Ace, ridden by Lorcan Williams for trainer Jeremy Scott, to come through and win by nine lengths from 66-1 outsider Burdett Road. Both Constitution Hill and State Man galloped away unscathed from their falls. I'm lost for words. This is the best day of my life by far, said Williams who shrugged his shoulder after a first top-level Grade One win. You dream of these moments as a kid. I hope the others are OK - Constitution Hill and State Man are iconic horses. Somerset-based Scott said: It's marred by the two horses who came down, but I'm just delighted the gods favoured us. The trainer had been persuaded to run in the feature race, rather than the Mares' Hurdle, by owner Ian Gosden. Borrowing a line from Only Fools and Horses, Gosden said: He who dares, wins, Rodney. Constitution Hill, trained by Nicky Henderson, had recovered from a respiratory issue and suspected colic since his triumph two years ago to go into the race unbeaten from 10 starts. He was sent off the 1-2 favourite and his fall at the fifth hurdle is estimated to have saved bookmakers a £10m payout. Nobody is hurt, they're two jockeys and two horses who've had proper old falls but they're all OK. That's the main thing, said Henderson.",
                  "'Wales cannot let England win title in Cardiff' Wales lock Dafydd Jenkins says his side cannot let England realise their Six Nations title ambitions in Cardiff this weekend. Steve Borthwick's side will arrive at the Principality Stadium on Saturday afternoon hoping they can win the title, although France remain favourites when they host Scotland after events in Cardiff have finished. In contrast to the title contenders, Wales have lost 16 successive internationals and are bottom of the table hoping to avoid another Wooden Spoon. England hooker Jamie George has called the game Wales' World Cup final. This is the game you dream of playing in as a kid, said Jenkins. This game is a special one, and one we're definitely up for. They can win the championship, so we can't be having that in Cardiff. Jenkins plays his club rugby in Exeter so bragging rights are also important. Exeter centre Henry Slade is in the England squad, but missed Sunday's 42-17 win against Italy because of injury. If I go back to Exeter with a win I'd be a lot happier, in terms of you can start ripping into a few of the other boys, said Jenkins. I've taken my fair share of stick off them, so it'd be nice to hand out a bit. To derail's England's hopes, Wales have to win a Test match for the first time since October 2023. Wales have also lost their past eight home matches and previous eight Six Nations games played at the Principality Stadium.",
                  "'Boxing's not broken' - Hearn responds to White's new league Eddie Hearn denies boxing is broken after UFC president Dana White signed a deal with Saudi Arabian investors to create a new boxing league. Details of this league are unclear, with White declaring in some interviews they will rebuild boxing from the ground up and have their own world titles, while in others saying the league would focus on young talent. Hearn, one of the biggest boxing promoters in the world, took issue with White suggesting boxing is broken. I think it's great for boxing, Matchroom's Hearn said on 5 Live Boxing with Steve Bunce podcast. One thing I disagree with, is boxing's not broken. Boxing is in a great place, it always has been. There's always ways we can improve it, but the fact those guys want to come into boxing shows where it's at. White will partner with Turki Alalshikh, chairman of Saudi Arabia's general entertainment authority, who has spearheaded the Saudi investment in boxing in the last two years. The new outfit will fall under the TKO banner, which owns the UFC and WWE. The UFC use a league system in MMA, signing fighters to long-term, exclusive deals and having their own promotional world title. TKO is expected to take over the operation of some of Saudi's major boxing events, including the mooted super-fight between Saul 'Canelo' Alvarez and Terence Crawford in September in Las Vegas. MMA in America, however, is not bound by the 2000 Ali Act and Professional Boxing Safety Act 1996, which set legal guidelines for writing contracts and limits to the amount of time fighters can be signed to a promotion. TKO president Mark Shapiro has spoken out against Ali Act recently, and Hearn is unsure if the UFC model can thrive in boxing.",
                  "Scots call up 'two for the future' Miller & Wilson Uncapped teenagers Lennon Miller and James Wilson have been named in Scotland's squad for this month's Nations League play-off matches with Greece. Meanwhile, Lewis Ferguson, Kevin Nisbet and Kieran Tierney have been recalled by Steve Clarke. Motherwell midfielder Miller and Hearts forward Wilson, both 18, have represented their country at age grade level. Well regular Miller has captained his club side of late while Wilson, who also qualifies for Northern Ireland, has broken into the Hearts team in recent months, scoring six times. A lot of call-offs, a lot of injuries, especially in middle to forward areas, Clarke said. I just felt it was a chance to have a look at two young boys, who've caught the eye - Lennon certainly over the last 18 months and James over the last six months. Two for the future but also can help us just now. Lennon plays with a maturity beyond his year. Good qualities, he can play deep in midfield, can play higher midfield, good delivery, box-to-box, good energy. Now we have to see if he can fit in among the group. [James] is someone who catches your eye, runs behind, looks to score goals, which is a great trait. We're always looking for goal scorers.",
                  "Bradley and Ballard out as Hale named in NI squad Conor Bradley and Daniel Ballard will miss Northern Ireland's friendlies with Switzerland and Sweden, while there is a first call-up for striker Ronan Hale. Ross County's Hale has been included after his international clearance came through in January after switching his allegiance from the Republic of Ireland. Northern Ireland host Switzerland at Windsor Park on Friday, 21 March before Michael O'Neill's side face Sweden in Stockholm four days later - with both matches available to watch on BBC Sport and BBC Northern Ireland. Michael O'Neill has made a number of changes to his squad from November's Nations League matches. Key defenders Bradley and Ballard are left out through injury, as is midfielder Ali McCann, while goalkeeper Bailey Peacock-Farrell and striker Josh Magennis drop to the standby list. Caolan Boyd-Munce, Jamie Reid and Jamal Lewis miss out on O'Neill's 25-man squad. With Magennis' omission, Paddy McNair is the sole remaining player from the Euro 2016 squad as he returns from injury, while goalkeeper Conor Hazard, defenders Eoin Toal and Aaron Donnelly and forward Dale Taylor have also been recalled. Portsmouth's Terry Devlin, who can play at right-back or in midfield, is the only uncapped player in O'Neill's squad."]
links_sport = ["https://www.bbc.com/sport/horse-racing/articles/cwygpwxpdeyo",
               "https://www.bbc.com/sport/rugby-union/articles/c2kgp212nkzo",
               "https://www.bbc.com/sport/articles/cjevzl9v58yo",
               "https://www.bbc.com/sport/football/articles/cr52zdm88pro",
               "https://www.bbc.com/sport/football/articles/cx2edk25wvxo"]


articles_politics = ["How JD Vance sees the world - and why that matters An argument in the White House tore apart the US alliance with Ukraine, shook European leaders and highlighted JD Vance's key role in forcefully expressing Donald Trump's foreign policy. The vice-president has come out punching on the global stage - so what is it that drives his worldview? Vance's first major foreign speech, at the Munich Security Conference in mid-February, caught many by surprise. Rather than focusing on the war raging in Ukraine, the US vice-president only briefly mentioned the bloodiest European conflict since World War Two. Instead, he used his debut on the international stage to berate close US allies about immigration and free speech, suggesting the European establishment was anti-democratic. He accused them of ignoring the wills of their people and questioned what shared values they were truly banding together with the US to defend. If you are running in fear of your own voters, there is nothing America can do for you, nor for that matter is there anything you can do for the American people, he warned. It was a bold and perhaps unexpected way to introduce himself to the world - by angering European allies. But days later he was back in the news, at the centre of a blistering row with Ukrainian President Volodymyr Zelensky, whom he accused of being ungrateful. For those who have been studying the rise of Vance, these two episodes came as no surprise. The vice-president has come to represent an intellectual wing of the conservative movement that gives expression to Trumpism and in particular how its America First mantra applies beyond its borders. In writings and interviews, Vance has expressed an ideology that seems to join the dots between American workers, global elites and the role of the US in the wider world.",
                    "Pakistan militants attack train and take passengers hostage Armed militants in Pakistan's Balochistan region have attacked a train carrying hundreds of passengers and taken a number of hostages, military sources have told the BBC. The Baloch Liberation Army (BLA) fired at the Jaffar Express Train as it travelled from Quetta to Peshawar. A statement from the separatist group said it had bombed the track before storming the train in remote Sibi district. It claimed the train was under its control. Pakistani police told local reporters at least three people, including the train driver, had been injured. Security forces have been sent to the scene, as well as helicopters to try to rescue hostages, police told the BBC. There were reports of intense firing at the train, a Balochistan government spokesman told local newspaper Dawn. A senior police official said it remains stuck just before a tunnel surrounded by mountains, AFP news agency reports. A senior army official confirmed to the BBC that there were more than 100 army personnel travelling from Quetta on the train. The Baloch Liberation Army has warned of severe consequences if an attempt is made to rescue those it is holding. It has waged a decades-long insurgency to gain independence and has launched numerous deadly attacks, often targeting police stations, railway lines and highways. The Pakistani authorities - as well as several Western countries, including the UK and US - have designated the BLA as a terrorist organisation.",
                    "Sweden says Russia is greatest threat to its security Russia poses the greatest threat to Sweden due to its aggressive attitude towards the West, the Scandinavian nation's security service Sapo has said. It wrote in its annual report that while Sweden joining the Nato military alliance had strengthened its security, it had also led to increased Russian intelligence activity. Russia denies any wrongdoing. Sapo also said that the security situation in Sweden was serious - with foreign powers operating in more threatening ways, with hybrid warfare, alongside incidents of violent extremism. Charlotte von Essen, the head of Sapo, said there was a tangible risk that the security situation can deteriorate further in a way that may be hard to predict. Sweden became a Nato member last year, seeing it as the best guarantee against Russia following its full-scale invasion of Ukraine in February 2022. That January, its civil defence minister warned there could be war in Sweden in the near future due to Russian aggression. Sapo said on Tuesday that Russia's intelligence activities were primarily aimed at undermining cohesion between Nato members, counteracting Western support for Ukraine, and circumventing sanctions. It said these activities showed Russia was becoming increasingly offensive and risk-prone in the face of a build-up of Swedish, and wider European, defences. When gathering intelligence, the Russian security and intelligence services use a wide range of resources and different platforms, the agency wrote, adding that these activities had been limited by expelling intelligence officers. Ms von Essen said Swedes needed to be vigilant about widespread anti-state narratives and conspiracy theories that seek to act as a destabilising force, adding that it was important that we do not normalise the new situation. In its report, Sapo mentioned suspicious incidents involving infrastructure and which countries may have been behind them in some cases",
                    "Greenland goes to polls in vote dominated by Trump and independence Residents of Greenland are going to the polls in a vote that in previous years has drawn little outside attention - but which may prove pivotal for the Arctic territory's future. US President Donald Trump's repeated interest in acquiring Greenland has put it firmly in the spotlight and fuelled the longstanding debate on the island's future ties with Copenhagen. There's never been a spotlight like this on Greenland before, says Nauja Bianco, a Danish-Greenlandic policy expert on the Arctic. Greenland has been controlled by Denmark – nearly 3,000km (1,860 miles) away – for about 300 years. It governs its own domestic affairs, but decisions on foreign and defence policy are made in Copenhagen. Now, five out of six parties on the ballot favour Greenland's independence from Denmark, differing only on how quickly that should come about. Voting takes place over 11 hours at 72 polling stations, and ends at 20:00 local time on Tuesday (22:00G). The debate over independence has been put on steroids by Trump, says Masaana Egede, editor of Greenlandic newspaper Sermitsiaq. The island's strategic location and untapped mineral resources have caught the US president's eye. He first floated the idea of buying Greenland during his first term in 2019. Since taking office again in January, he has reiterated his intention to acquire the territory. Greenland and Denmark's leaders have repeatedly rebuffed his demands. Addressing the US Congress last week, however, Trump again doubled down. We need Greenland for national security. One way or the other we're gonna get it, he said, prompting applause and laughter from a number of politicians, including Vice-President JD Vance.",
                    "Kurdish-led SDF agrees to integrate with Syrian government forces A Kurdish-led militia alliance which controls north-eastern Syria has signed a deal to integrate all military and civilian institutions into the Syrian state, the country's presidency says. The agreement, which includes a complete cessation of hostilities, says the US-backed Syrian Democratic Forces (SDF) will hand over control of the region's border posts, airport, and vital oil and gas fields. It also recognises the Kurdish minority as an integral part of the Syrian state and guarantees the rights of all Syrians to representation and participation in the political process. SDF commander Mazloum Abdi called the deal a real opportunity to build a new Syria. We are committed to building a better future that guarantees the rights of all Syrians and fulfils their aspirations for peace and dignity, he wrote on X after signing the deal in Damascus on Monday alongside interim President Ahmed al-Sharaa. The deal represents a major step towards Sharaa's goal to unify the fractured country after his Sunni Islamist group led the rebel offensive that overthrew president Bashar al-Assad in December and ended 13 years of devastating civil war. It could also de-escalate the SDF's conflict with neighbouring Turkey and Turkish-backed Syrian former rebel factions allied to the government, which are trying to push the alliance out of areas near the border. There were celebrations welcoming the announcement of the deal on the streets of several cities on Monday night, with many people expressing relief at a time when Syria is facing several other threats to its stability."]
links_politics = ["https://www.bbc.com/news/articles/cly82yx09zeo",
                  "https://www.bbc.com/news/articles/c5y2q5v9249o",
                  "https://www.bbc.com/news/articles/c89y8gn2w8vo",
                  "https://www.bbc.com/news/articles/cr4236e2wz2o",
                  "https://www.bbc.com/news/articles/cedlx0511w7o"]


articles_tech = ["How a tiny village became India's YouTube capital In Tulsi, a village in central India, social media has sparked an economic and social revolution. It's a microcosm of YouTube's effect on the world. As villagers head into the fields of Tulsi, a village outside Raipur in central India, on a muggy September morning, 32-year-old YouTuber Jai Varma asks a group of women to join him for his latest video. They gather around him – adjusting their sarees and sharing a quick word and a smile.  Varma places an elderly woman on a plastic chair, asks another to touch her feet and a third to serve water, staging a scene of a rural village festival for fans who will enjoy his content from cities and countries thousands of kilometres away. The women, familiar with this kind of work, are happy to oblige. Varma captures the moment, and they return to their farmwork. A few hundred metres away, another group is busy setting up their own production. One holds up a mobile phone, filming as 26-year-old Rajesh Diwar moves to the rhythm of a hip-hop track, his hands and body animated in the expressive style of a seasoned performer. Tulsi is like any other Indian village. The small outpost in the central state of Chhattisgarh is home to one-storey houses and partially paved roads. A water storage tank peers out above the buildings, overseeing the town. Banyan trees with concrete bases serve as gathering spots. But what sets Tulsi apart is its distinction as India's YouTube Village. Some 4,000 people live in Tulsi, and reports suggest more than 1,000 of them work on YouTube in some capacity. Walk around the village itself and it's hard to find someone who hasn't appeared in one of the many videos being filmed there.  The money that YouTube brings has transformed the local economy, locals say, and beyond financial benefits, the social media platform has become an instrument for equality and social change. The residents who've launched successful YouTube channels and found new streams of income include a number of women who previously had few opportunities for advancement in this rural setting. Under the banyan trees, conversations have turned to technology and the internet.",
                 "Global smartwatch sales fall for first time Global sales of smartwatches have fallen for the first time, new figures indicate, in large part due to a sharp decline in the popularity of market leader, Apple. Market research firm Counterpoint says 7% fewer of the devices were shipped in 2024 compared to the year before. Shipments of Apple Watches fell by 19% in that period, Counterpoint says. It blames the slump on a lack of new features in Apple's latest devices, and the fact a rumoured high-end Ultra 3 model never materialised. The biggest driver of the decline was North America, where the absence of the Ultra 3 and minimal feature upgrades in the S10 lineup led consumers to hold back purchases, said Counterpoint senior research analyst Anshika Jain. Apple was also hit with sales and import bans in the US in late 2023 and early 2024 over a disputed patent regarding blood oxygen level monitoring - which Ms Jain says also contributed to lower sales figures in the first half of 2024. It retained 22% of market share in the final three months of 2024, down from 25% a year earlier. We've been through a period where the smartwatch has gone from being a new and exciting gadget, to something now that's stabilising - the feature set isn't changing very dramatically year over year, said Leo Gebbie, principal analyst at CCS Insight. Despite the overall decline, last year did see a massive rise in sales for Chinese-made smartwatches from brands such as Xiaomi, Huawei and Imoo.",
                 "'Garbage' to blame Ukraine for massive X outage, experts say Experts have cast doubt on Elon Musk's claim that a large-scale outage which hit X was caused by hackers in Ukraine. Platform monitor Downdetector says it had more than 1.6 million reports of problems with the social media site from users around the world on Monday. We're not sure exactly what happened but there was a massive cyber-attack to try and bring down the X system with IP [Internet Protocol] addresses originating in the Ukraine area, Musk said in an interview with the Fox Business channel. However, Ciaran Martin, professor at Oxford University's Blavatnik School of Government told the BBC that explanation was wholly unconvincingn and pretty much garbage. Prof Martin - former head of the UK's National Cyber Security Centre - says it looks as if X was targeted by what's known as a distributed denial of service (DDoS) attack, where hackers flood a server with internet traffic to prevent users from connecting to a website. It's not that sophisticated - it's a very old technique, Mr Martin told Radio 4's Today programme. I can't think of a company of the size and standing internationally of X that's fallen over to a DDoS attack for a very long time, he added. He said the incident at X doesn't reflect well on their cyber security. Many users trying to access the platform and refresh feeds on its app and desktop site during Monday's outages were met with a loading icon. Musk, who has been a frequent critic of Ukraine and its President Volodymyr Zelensky, has offered no evidence to support his claim and did not say whether or not he thought state actors were involved. He posted on X that either a large, coordinated group and/or a country is involved. But Prof Martin said tracing IP addresses tells you absolutely nothing, because hackers in this situation would hijack devices from all over the world.",
                 "Deep-sea mining tech advances but doubts remain There's one. And another. This robot was hunting for rocks. A three-pronged claw descended from above and plucked a stone off the seabed. All the while, the autonomous machine's on-board camera scanned for creatures that might be resting on those rocks, to avoid snatching an innocent lifeform from its habitat. The test, carried out in a harbour in November, demonstrated one approach to mining for polymetallic nodules, potato-sized lumps containing metals scattered on the seabed in vast quantities, in much deeper parts of the ocean. Such metals are sought-after for use in renewable energy devices and batteries, for example. But deep-sea mining is a controversial means of obtaining them because of its potentially significant environmental impacts. We felt that a vehicle that used AI to look for life and avoid it could have much less of an environmental footprint, explains Oliver Gunasekara, co-founder and chief executive of Impossible Metals. The firm's system is 95% accurate at detecting lifeforms of 1mm or greater in size, he says. The robot's arms are similar to those that pick and place items in automated warehouses - they are optimised for speed. Plus, each claw kicks up a relatively small puff of sediment as it plucks its target off the seafloor. Impossible Metals aims to further reduce this disturbance. Such a system is not likely to convince everyone that deep-sea mining is a good idea, however. Mining would by its nature remove the very substrate of life in and on the deep seafloor, no matter the technology, says Jessica Battle, who leads the global no deep-seabed mining initiative at the WWF.",
                 "A dating app for video games tackles one of the industry's big issues Almost 19,000 titles went live on PC games store Steam in 2024 - about 360 a week. There are positive ways to look at this. Tools are more accessible and easier to use, barriers to entry are lower, self-publishing is easier and ideas are never short in supply. But for developers discoverability – getting your new release noticed - has never been more challenging in a landscape dominated by blockbusters and online games such as Fortnite and Call of Duty. It's also harder for potential customers to find them, with recommendations often dictated by search engine and store algorithms. But Ludocene - described as a dating app for video games hopes to change that. Games journalist Andy Robertson, the man behind the project, says the goal is to help people find those ones that got away. In any given year there's just so many games and some of those will rise to the top, they'll get lucky or they'll just be brilliant enough to punch through that noise, he tells BBC Newsbeat. But there's loads of really great games that just get sort of buried and lost in that shuffle. Ludocene itself looks a bit like a game - each title in its database is represented by a card with a trailer on one side and more information on the reverse. The dating app element comes from users swiping to keep - or discard - the suggestions, slowly building up a collection of recommended titles. Ludocene's entries are chosen by a selection of well-known gaming experts – journalists, streamers and other figures."]
links_tech = ["https://www.bbc.com/future/article/20250217-how-a-tiny-village-became-indias-youtube-capital",
              "https://www.bbc.com/news/articles/cx20d3r7p5do",
              "https://www.bbc.com/news/articles/c62x5k44rl0o",
              "https://www.bbc.com/news/articles/cg45zwe0v0ro",
              "https://www.bbc.com/news/articles/cr52rey0ng8o"]


articles_entertainment = ["Noel Clarke says life 'smashed' by allegations Actor and producer Noel Clarke has accused the publisher of the Guardian of having smashed my life for years as he gave evidence at his High Court libel trial. The star of Doctor Who and Kidulthood began his testimony on Monday. The 49-year-old is suing Guardian News and Media (GNM) for libel over a series of articles from 2021 and 2022 that included allegations of sexually inappropriate behaviour. Clarke denies the allegations, while GNM is defending its reporting as being both true and in the public interest. Asked about his alleged inappropriate sexual behaviour towards an actress who appeared in a film he was involved with, he became quite emotional and tearful, telling the Guardian's barrister Gavin Millar KC: They have smashed my life for years with this rubbish. You know what you're doing. You make me sick, I would not do this. Mr Millar asked Mr Clarke about an allegation that while working on Doctor Who, he made an inappropriate sexual suggestion to a female costume assistant. He replied: I don't remember that incident, I don't remember the woman in question. So I say it didn't happen. Mr Millar asked: It didn't happen or you don't remember it? Mr Clarke replied: It didn't happen. He was also asked about his interactions with a woman whom he worked with in the run-up to a particular project, where it was alleged that he physically pushed his body against her in a sexual way, and groped her.",
                        "From Doechii to Nicole Kidman: Why celebrities and Gen Z women love the jacket-and-tie look From the catwalk to the red carpet, the jacket-and-necktie combo is back for women – it's a statement of power and authority, say fans of the look. The jacket-and-necktie combination has been a staple of male dress for centuries, but it's always been most striking and subversive when worn by women. Now, it's making a comeback in women's fashion once more. On runways, designers are re-inventing the look. In her recent London Fashion Week collection, Tolu Coker showcased oversized leather blazers, paired with satin ties and tailored shirts. While at New York Fashion Week, Thom Browne teamed classic pattern ties with structured patchwork jackets.   The trend isn't limited to the catwalk. Viewers of the recent Grammy awards may have noticed Sabrina Carpenter's Dolce & Gabbana show-girl outfit, a Swarovski crystal-encrusted black blazer, with matching tie and skirt. Billie Eilish even offers a tie as part of her official merchandise. The red carpet is adopting the look, too. At the Berlin Film Festival, Vicky Krieps wore an oversized Bottega Veneta suit. Meanwhile, Doechii styled a Thom Browne exaggerated trouser with a cropped jacket and tie to accept her Grammy win. Nicole Kidman has gone for the jacket-and-tie look too in YSL, joining a growing list of celebrities including Rihanna, Bella Hadid and Iris Law, who have all embraced the label's style of tailoring. Yasmine Tangou is also in the fan club. The content creator and architect, who lives in Paris, likes the oversized, masculine style of a YSL suit, in contrast to other styles of jacket that accentuate curves.",
                        "Anna Foster replaces Husain on Radio 4's Today Anna Foster is to join BBC Radio 4's Today programme as one of the programme's main presenters, following the departure of Mishal Husain in December. Foster is a former Middle East correspondent and has also previously presented BBC Radio 5 Live Drive and Radio 1's Newsbeat, as well as the corporation's flagship TV news bulletins. She will present many editions of Today from Salford, as part of the BBC's efforts to better represent more areas of the UK beyond London. In a statement, Foster said: There are few more exciting opportunities for a journalist than presenting Today, and I'm thrilled to be joining the team. I've always loved making important, agenda-setting, engaging radio, and there's nowhere better to do that. It's such a beloved programme to so many people, and I can't wait to be a part of it. The BBC said that, in addition to presenting Today, Foster would continue to play a key role in helping to lead the BBC's coverage of foreign news and will still be seen on TV news bulletins for major stories. Bosses are reported to have been keen to ensure the presenter replacing Husain had a similar level of international reporting knowledge and experience. She brings important international reporting experience at a time when it is needed so urgently by listeners to Radio 4, said the station's controller Mohit Bakaya. Foster is one of several correspondents and presenters who have co-hosted episodes of Today in recent weeks following Husain's departure.",
                        "Dancing on Ice winner crowned after 'brilliant' performance The latest series of Dancing on Ice has drawn to a close after its winners were announced following a public vote. Coronation Street actor Sam Aston and his skating partner Molly Lanaghan were crowned winners of the ice-skating competition on Sunday. BBC Springwatch presenter Michaela Strachan was the runner-up alongside her ice skating partner Mark Hanretty after former English footballer Anton Ferdinand and his skating partner Annette Dytrt lost the public vote. Former Olympic figure skater Christopher Dean, who was involved in choreographing the showcase routines of all three finalists, told Aston that tonight, everything came together. Before the winners were revealed, the Coronation Street star said his time on the competition had been such a journey, it's a mad one. The 31-year-old was seen covering his face with his hands after he was announced as the winner. Aston and Lanaghan performed to The Pink Panther Theme, after which Dean told Aston he was really proud of you. Dean, alongside fellow former Olympic figure skater Jayne Torvill, choreographed the showcase performances of all three finalists. After Ferdinand lost the public vote, Aston and Strachan performed the pair's Olympic gold medal winning Bolero routine, after which Torvill described Aston's performance as brilliant. I mean tonight, everything came together, Dean said. Your skating skills are always on show, but you had to have timing and acting, Ferdinand and Dytrt were the first finalists to perform in the final episode and impressed the judges with their showcase routine to Let's Go Crazy by Prince And The Revolution.",
                        "Next James Bond should be British, Pierce Brosnan says Actor Pierce Brosnan has said it is a given that the next James Bond should be British. In an interview with the Sunday Telegraph, the former Bond also said he thought it was the right decision for the franchise's long-standing producers to hand creative control to Amazon. It takes great courage for them to let go, said Brosnan, who is Irish. I hope that [Amazon] handles the work and the character with dignity and imagination and respect. The choice of Daniel Craig's successor will be a decision for Amazon MGM Studios. James Norton, Aaron Taylor-Johnson and Theo James - who are all English - are among the bookmakers' favourites to fill Craig's shoes. Bond has been played by two non-British actors in the past - Australian George Lazenby as well as Irishman Brosnan - but 007 has never been an American, and among the names mooted for the role is California-born Austin Butler. Other non-British names that have been suggested include Irish stars Paul Mescal, Cillian Murphy and Aidan Turner, or Australian Jacob Elordi. American Clint Eastwood once reportedly turned down the role, with the Hollywood legend previously claiming he was approached to take over after Sean Connery in 1967 but said: It didn't feel right for me to be doing it. According to author Ian Fleming's novels, Bond had a Scottish father and Swiss mother. Under the deal announced last month, Bond producers Barbara Broccoli and Michael G Wilson will remain co-owners of the franchise but Amazon MGM Studios will gain creative control."]
links_entertainment = ["https://www.bbc.com/news/articles/c74kje402klo",
                       "https://www.bbc.com/culture/article/20250310-why-celebrities-and-gen-z-women-love-the-jacket-and-tie-look",
                       "https://www.bbc.com/news/articles/cpwdvex0587o",
                       "https://www.bbc.com/news/articles/cx2edpdwk7qo",
                       "https://www.bbc.com/news/articles/c4g0wldz314o"]

all_articles = [articles_business, articles_sport, articles_politics, articles_tech, articles_entertainment]
all_links = [links_business, links_sport, links_politics, links_tech, links_entertainment]

Представим собранные статьи в формате таблицы pandas

In [37]:
articles_classes = ["business", "sport", "politics", "tech", "entertainment"]
control_dataset = pd.DataFrame(columns=["article", "link", "label"])
for article, link, label in zip(all_articles, all_links, articles_classes):
    df = pd.DataFrame({
                        "article": article,
                        "link": link,
                        "label": [label] * len(article),     
                        })
    control_dataset = pd.concat([control_dataset, df], ignore_index=True)

control_dataset

Unnamed: 0,article,link,label
0,From chatbots to intelligent toys: How AI is b...,https://www.bbc.com/news/articles/ckg8jqj393eo,business
1,Trump doubles planned tariffs on Canadian meta...,https://www.bbc.com/news/articles/cm2y811g1dgo,business
2,European stocks steady after US markets plunge...,https://www.bbc.com/news/articles/c4gdwgjkk1no,business
3,Starmer says benefit system unfair and indefen...,https://www.bbc.com/news/articles/c0kgpyz3mmpo,business
4,Luxury lounges: Credit card perks 'we are all ...,https://www.bbc.com/news/articles/c5yen4k1xnko,business
5,Golden Ace wins dramatic Champion Hurdle at 25...,https://www.bbc.com/sport/horse-racing/article...,sport
6,'Wales cannot let England win title in Cardiff...,https://www.bbc.com/sport/rugby-union/articles...,sport
7,'Boxing's not broken' - Hearn responds to Whit...,https://www.bbc.com/sport/articles/cjevzl9v58yo,sport
8,Scots call up 'two for the future' Miller & Wi...,https://www.bbc.com/sport/football/articles/cr...,sport
9,Bradley and Ballard out as Hale named in NI sq...,https://www.bbc.com/sport/football/articles/cx...,sport


Используя созданную выше функцию представления результатов предсказания в табличном виде, представим результаты классификации новостей с помощью модели логистической регрессии и модели DistilBert, после чего объединим все эти таблицы в одну.

In [38]:
predictions_log = predictions_to_dataset(predict_log, all_articles, "log")
predictions_log

Unnamed: 0,article,label_log,probability_log
0,From chatbots to intelligent toys: How AI is b...,tech,0.256702
1,Trump doubles planned tariffs on Canadian meta...,business,0.383486
2,European stocks steady after US markets plunge...,business,0.797402
3,Starmer says benefit system unfair and indefen...,politics,0.720381
4,Luxury lounges: Credit card perks 'we are all ...,business,0.302136
5,Golden Ace wins dramatic Champion Hurdle at 25...,sport,0.506349
6,'Wales cannot let England win title in Cardiff...,sport,0.909692
7,'Boxing's not broken' - Hearn responds to Whit...,sport,0.304767
8,Scots call up 'two for the future' Miller & Wi...,sport,0.695775
9,Bradley and Ballard out as Hale named in NI sq...,sport,0.725688


In [39]:
predictions_bert = predictions_to_dataset(predict_distillbert, all_articles, "bert")
predictions_bert

Unnamed: 0,article,label_bert,probability_bert
0,From chatbots to intelligent toys: How AI is b...,business,0.565327
1,Trump doubles planned tariffs on Canadian meta...,business,0.832203
2,European stocks steady after US markets plunge...,business,0.981626
3,Starmer says benefit system unfair and indefen...,politics,0.984708
4,Luxury lounges: Credit card perks 'we are all ...,business,0.945092
5,Golden Ace wins dramatic Champion Hurdle at 25...,sport,0.88905
6,'Wales cannot let England win title in Cardiff...,sport,0.988739
7,'Boxing's not broken' - Hearn responds to Whit...,sport,0.780503
8,Scots call up 'two for the future' Miller & Wi...,sport,0.9892
9,Bradley and Ballard out as Hale named in NI sq...,sport,0.98881


In [40]:
control_dataset = control_dataset.merge(predictions_log, how="left", on="article").merge(predictions_bert, how="left", on="article")
control_dataset = control_dataset[["link", "label", "label_log", "probability_log", "label_bert", "probability_bert"]]

Для удобства изучения полученных результатов выделим цветом те ячейки в столбцах с предсказаниями классов новостей, которые не соответствуют истинной классификации.

In [41]:
def highlight_cell(row):
    color_log = 'red' if row['label'] != row['label_log'] else ''
    color_bert = 'yellow' if row['label'] != row['label_bert'] else ''
    
    return [
        f'background-color: {color_log}' if col == 'label_log' and color_log else 
        f'background-color: {color_bert}' if col == 'label_bert' and color_bert else ''
        for col in row.index
    ]

df_styled = control_dataset.style.apply(highlight_cell, axis=1, subset=["label", "label_log", "label_bert"])
df_styled

Unnamed: 0,link,label,label_log,probability_log,label_bert,probability_bert
0,https://www.bbc.com/news/articles/ckg8jqj393eo,business,tech,0.256702,business,0.565327
1,https://www.bbc.com/news/articles/cm2y811g1dgo,business,business,0.383486,business,0.832203
2,https://www.bbc.com/news/articles/c4gdwgjkk1no,business,business,0.797402,business,0.981626
3,https://www.bbc.com/news/articles/c0kgpyz3mmpo,business,politics,0.720381,politics,0.984708
4,https://www.bbc.com/news/articles/c5yen4k1xnko,business,business,0.302136,business,0.945092
5,https://www.bbc.com/sport/horse-racing/articles/cwygpwxpdeyo,sport,sport,0.506349,sport,0.88905
6,https://www.bbc.com/sport/rugby-union/articles/c2kgp212nkzo,sport,sport,0.909692,sport,0.988739
7,https://www.bbc.com/sport/articles/cjevzl9v58yo,sport,sport,0.304767,sport,0.780503
8,https://www.bbc.com/sport/football/articles/cr52zdm88pro,sport,sport,0.695775,sport,0.9892
9,https://www.bbc.com/sport/football/articles/cx2edk25wvxo,sport,sport,0.725688,sport,0.98881


## Заключение

Итоговая таблица показывает, что обе модели хорошо справляются с предсказанием класса новостей, однако модель DistilBert дает гораздо более уверенные ответы (вероятность многих ответов превышает 80 % и 90 %). Новости, относящиеся к классам спорт и развлечения, обеими моделями предсказываются с 100 % точностью. Однако для классов бизнес, политика и технологии модели допустили ошибки, причем модель логистической регрессии ошибалась чаще. Однако надо признать, что ряд новостей, касающихся бизнеса и политики, можно было отнести к разным классам. Более того, на портале bbc.com эти статьи были опубликованы в разных новостных разделах, т.е. их классификация неоднозначна. Учитывая эту информации можно судить о том, что обе модели хорошо осуществляют классификацию новостей по классам. Однако модель логистической регрессии требовало существенно меньше ресурсов и работала на порядки быстрее. При этом модель DistilBert была более точной в предсказаниях и давала более уверенные ответы.