## Task

- Analyze text classification using LLM Qwen2-7B-Instruct
- generate a classification_report with metrics
- measure the retrieval time of LLM and logreg predictions
- compare LLM and logreg metrics

https://huggingface.co/Qwen/Qwen2-7B-Instruct

In [None]:
# install required libraries
!pip install fuzzywuzzy
!pip install -U datasets fsspec
!pip install python-Levenshtein

Collecting fsspec
  Using cached fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)


In [None]:
from typing import Dict, Union, List
from datasets import load_dataset
from fuzzywuzzy import fuzz, process
from sklearn.metrics import classification_report
from tqdm.notebook import tqdm
from transformers import pipeline

In [None]:
# function for selecting the prompt for llm
def prepare_message_for_llm(text: Union[str, List[str]], categories: List[str]) -> Dict[str, Union[List[Dict[str, str]], List[List[Dict[str, str]]]]]:
    if isinstance(text, str):
        text = [text]

    messages = []

    for msg in text:
        messages.append([{
            "role": "user",
            "content": f"Прочтите текст и определите, какая тема из списка наиболее представлена в следующем тексте. Текст: {msg} В качестве ответа напишите только название темы из списка, больше ничего: {', '.join(categories)}."
        }])

    return {'message_for_llm': messages}

In [None]:
# Experimental examples
messages = [[{'role': 'user', 'content': 'Прочтите текст и определите, какая тема из списка наиболее представлена в следующем тексте. Текст: Если увеличить расстояние для бега с четверти до половины мили, скорость становится не так важна, тогда как выносливость превращается в абсолютную необходимость. В качестве ответа напишите только название темы из списка, больше ничего: entertainment, geography, health, politics, science/technology, sports, travel.'}],
            [{'role': 'user', 'content': 'Прочтите текст и определите, какая тема из списка наиболее представлена в следующем тексте. Текст: Посмотрите, какие поездки рекламирует агент, на сайте или на витрине офиса. В качестве ответа напишите только название темы из списка, больше ничего: entertainment, geography, health, politics, science/technology, sports, travel.'}],
            [{'role': 'user', 'content': 'Прочтите текст и определите, какая тема из списка наиболее представлена в следующем тексте. Текст: Население Ватикана составляет около 800 человек. Это самая маленькая независимая страна в мире, а также страна, имеющая наименьшее население. В качестве ответа напишите только название темы из списка, больше ничего: entertainment, geography, health, politics, science/technology, sports, travel.'}],]

In [None]:
# load model Qwen/Qwen2-7B-Instruct
device = 'cuda'

llm = pipeline("text-generation", model="Qwen/Qwen2-7B-Instruct", return_full_text=False,
                max_new_tokens=256, device_map='auto', torch_dtype='auto')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Device set to use cuda:0


In [None]:
# loading of dataset by train, validation, test samples
dataset = load_dataset("Davlan/sib200", "rus_Cyrl")
dataset

README.md:   0%|          | 0.00/47.9k [00:00<?, ?B/s]

train.tsv:   0%|          | 0.00/195k [00:00<?, ?B/s]

dev.tsv:   0%|          | 0.00/25.3k [00:00<?, ?B/s]

test.tsv:   0%|          | 0.00/57.4k [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index_id', 'category', 'text'],
        num_rows: 701
    })
    validation: Dataset({
        features: ['index_id', 'category', 'text'],
        num_rows: 99
    })
    test: Dataset({
        features: ['index_id', 'category', 'text'],
        num_rows: 204
    })
})

In [None]:
# list of all categories from all samples
dataset['train'] = dataset['train'].class_encode_column('category')
dataset['validation'] = dataset['validation'].class_encode_column('category')
dataset['test'] = dataset['test'].class_encode_column('category')

categories = dataset['validation'].features['category'].names
categories

Casting to class labels:   0%|          | 0/701 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/99 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/204 [00:00<?, ? examples/s]

['entertainment',
 'geography',
 'health',
 'politics',
 'science/technology',
 'sports',
 'travel']

In [None]:
# add the ‘message_for_llm’ column to the list of features, which will result from applying the prepare_message_for_llm function to texts
dataset['train'] = dataset['train'].add_column(name="message_for_llm", column=prepare_message_for_llm(dataset['train']['text'], categories)['message_for_llm'])
dataset['validation'] = dataset['validation'].add_column(name="message_for_llm", column=prepare_message_for_llm(dataset['validation']['text'], categories)['message_for_llm'])
dataset['test'] = dataset['test'].add_column(name="message_for_llm", column=prepare_message_for_llm(dataset['test']['text'], categories)['message_for_llm'])

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['index_id', 'category', 'text', 'message_for_llm'],
        num_rows: 701
    })
    validation: Dataset({
        features: ['index_id', 'category', 'text', 'message_for_llm'],
        num_rows: 99
    })
    test: Dataset({
        features: ['index_id', 'category', 'text', 'message_for_llm'],
        num_rows: 204
    })
})

In [None]:
import time
start_time = time.time()

In [None]:
y_pred_val = list(map(lambda x: llm(x)[0]['generated_text'], dataset['validation']['message_for_llm']))

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [None]:
# get predictions for the verification sample and generate a classification_report
valid_pred = [categories.index(process.extractOne(pred, categories)[0]) for pred in y_pred_val]
print(classification_report(dataset['validation']['category'], valid_pred))

              precision    recall  f1-score   support

           0       0.86      0.67      0.75         9
           1       0.78      0.88      0.82         8
           2       0.90      0.82      0.86        11
           3       0.93      0.93      0.93        14
           4       0.89      0.96      0.92        25
           5       0.92      0.92      0.92        12
           6       0.80      0.80      0.80        20

    accuracy                           0.87        99
   macro avg       0.87      0.85      0.86        99
weighted avg       0.87      0.87      0.87        99



In [None]:
# do the same for the test sample
y_pred_test = list(map(lambda x: llm(x)[0]['generated_text'], dataset['test']['message_for_llm']))

In [None]:
test_pred = [categories.index(process.extractOne(pred, categories)[0]) for pred in y_pred_test]
print(classification_report(dataset['test']['category'], test_pred))

              precision    recall  f1-score   support

           0       0.82      0.47      0.60        19
           1       0.79      0.88      0.83        17
           2       0.77      0.77      0.77        22
           3       0.85      0.93      0.89        30
           4       0.82      0.98      0.89        51
           5       0.91      0.80      0.85        25
           6       0.89      0.80      0.84        40

    accuracy                           0.84       204
   macro avg       0.84      0.81      0.81       204
weighted avg       0.84      0.84      0.83       204



In [None]:
# measure the time to get one prediction and predictions for the whole dataset
end_time = time.time()
print(end_time - start_time)

909.0742793083191
