BERT (Bidirectional Encoder Representations from Transformers): This model is based on the transformer encoder architecture. A key feature of BERT is its bidirectional training. This means that when processing a word, BERT considers the context from both the left and the right sides of that word. It's pre-trained on two specific tasks:

Masked Language Modeling (MLM): The model attempts to predict randomly masked words in a sentence, using context from both directions.

Next Sentence Prediction (NSP): The model learns to understand the relationships between sentences by predicting whether one sentence is a logical continuation of another.

XLM-RoBERTa (Cross-lingual Language Model RoBERTa): This is a multilingual version of RoBERTa, which itself is an optimized version of BERT. XLM-RoBERTa is pre-trained on a massive volume of text data across 100 languages. Unlike some other multilingual models, it does not require parallel corpora (texts translated into multiple languages) for its pre-training, which makes it very powerful for cross-lingual tasks. It also uses only MLM (without NSP), which research has shown to be more effective.

2.how these models process text using tokenization

Tokenization is the process of breaking down text into smaller units called tokens. For transformer models, this often goes beyond simple splitting by spaces. They utilize subword tokenization algorithms, such as WordPiece (for BERT) or SentencePiece (for XLM-RoBERTa). This allows models to handle both common words and rare words or typos by breaking them down into known subwords. It also helps manage vocabulary size efficiently.

---



**different pre-trained versions of these models and their characteristics:**

BERT: You'll encounter versions like bert-base-uncased (a base model, case-insensitive), bert-large-cased (a larger model, case-sensitive), and so on. Differences lie in the model size (number of layers, hidden states, attention heads), case sensitivity/insensitivity, and the volume of data it was trained on.

XLM-RoBERTa: Commonly used versions are xlm-roberta-base or xlm-roberta-large. Their key characteristic is their multilinguality. xlm-roberta-base is already trained on data from 100 languages, making it ideal for tasks where input data might be in different languages or for transferring knowledge between languages.

In [1]:
from transformers import BertTokenizer, XLMRobertaTokenizer

# Загрузка токенизаторов
# BertTokenizer для английского, регистронезависимая версия
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# XLMRobertaTokenizer для многоязычных задач
xlm_roberta_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')

print("BertTokenizer загружен:", bert_tokenizer.name_or_path)
print("XLMRobertaTokenizer загружен:", xlm_roberta_tokenizer.name_or_path)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

BertTokenizer загружен: bert-base-uncased
XLMRobertaTokenizer загружен: xlm-roberta-base


**2.Tokenizer Outputs and Usage**

input_ids: These are the numerical identifiers for each token that the model understands. Every token in the tokenizer's vocabulary has a unique ID.

attention_mask: This is a binary mask (0 or 1) that tells the model which tokens to "pay attention" to (1) and which to ignore (0). It's primarily used to disregard padding tokens, which are added to make all input sequences the same length.

token_type_ids (or segment_ids): For tasks involving two sentences (like NSP in BERT), these are segment IDs (0 for the first sentence, 1 for the second). This helps the model differentiate between two distinct sentences within the input.

labels: These are your target labels for a classification task (they are not generated by the tokenizer but are part of your dataset).

In [2]:
# Пример с BertTokenizer
sentence1 = "Hello, how are you today?"
sentence2 = "I am doing great, thank you!"

print("\n--- BertTokenizer ---")

# Токенизация одного предложения
encoded_single = bert_tokenizer.encode_plus(
    sentence1,
    add_special_tokens=True, # Добавить [CLS] и [SEP]
    max_length=64,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_token_type_ids=False # Для одного предложения не нужны token_type_ids
)

print("Single sentence encoding:")
print("Input IDs:", encoded_single['input_ids'])
print("Attention Mask:", encoded_single['attention_mask'])
print("Decoded:", bert_tokenizer.decode(encoded_single['input_ids']))

# Токенизация двух предложений
encoded_pair = bert_tokenizer.encode_plus(
    sentence1,
    sentence2,
    add_special_tokens=True, # Добавить [CLS], [SEP], [SEP]
    max_length=64,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_token_type_ids=True # Для двух предложений нужны token_type_ids
)

print("\nTwo sentence encoding (BERT):")
print("Input IDs:", encoded_pair['input_ids'])
print("Attention Mask:", encoded_pair['attention_mask'])
print("Token Type IDs:", encoded_pair['token_type_ids'])
print("Decoded:", bert_tokenizer.decode(encoded_pair['input_ids']))
# Обратите внимание на [CLS] и [SEP] токены: BERT добавляет [CLS] в начало и [SEP] между предложениями и в конце.

# Пример с XLMRobertaTokenizer
print("\n--- XLMRobertaTokenizer ---")

# XLM-RoBERTa использует <s> и </s> вместо [CLS] и [SEP]
encoded_xlm_single = xlm_roberta_tokenizer.encode_plus(
    sentence1,
    add_special_tokens=True, # Добавить <s> и </s>
    max_length=64,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_token_type_ids=False # XLM-R обычно не использует token_type_ids для разных сегментов
)

print("Single sentence encoding (XLM-RoBERTa):")
print("Input IDs:", encoded_xlm_single['input_ids'])
print("Attention Mask:", encoded_xlm_single['attention_mask'])
print("Decoded:", xlm_roberta_tokenizer.decode(encoded_xlm_single['input_ids']))

# XLM-RoBERTa для двух предложений часто обрабатывает их как одну последовательность, разделяя только </s><s>
# или просто используя </s> в конце первого и <s> в начале второго, если add_special_tokens=True.
# Если вы используете `encode_plus` с двумя аргументами, он автоматически добавляет </s><s> между ними.
encoded_xlm_pair = xlm_roberta_tokenizer.encode_plus(
    sentence1,
    sentence2,
    add_special_tokens=True,
    max_length=64,
    padding='max_length',
    truncation=True,
    return_attention_mask=True,
    return_token_type_ids=False # XLM-R обычно не использует token_type_ids для разных сегментов
)
print("\nTwo sentence encoding (XLM-RoBERTa):")
print("Input IDs:", encoded_xlm_pair['input_ids'])
print("Attention Mask:", encoded_xlm_pair['attention_mask'])
print("Decoded:", xlm_roberta_tokenizer.decode(encoded_xlm_pair['input_ids']))
# Обратите внимание: для XLM-RoBERTa, если вы передаете два предложения, он по умолчанию
# обернет их как <s> sentence1 </s></s> sentence2 </s>.


--- BertTokenizer ---
Single sentence encoding:
Input IDs: [101, 7592, 1010, 2129, 2024, 2017, 2651, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Decoded: [CLS] hello, how are you today? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Two sentence encoding (BERT):
Input IDs: [101, 7592, 1010, 2129, 2024, 2017, 2651, 1029, 102, 1045, 2572, 2725, 2307, 1010, 4067, 2017,

# 3. Preparing Input Data for the Model

BERT: Uses [CLS] (short for "Class" – typically placed at the beginning of a sequence and often used for classification tasks) and [SEP] (short for "Separator" – used to separate sentences and mark the end of a sequence).

XLM-RoBERTa: Uses <s> (start of sequence) and </s> (end of sequence/separator).

The attention_mask is a binary tensor. A value of 1 indicates an actual token that the model should pay attention to. A value of 0 indicates a padding token, which the model should ignore. Ignoring padding tokens is essential to prevent the model from wasting computational resources on non-existent information and to ensure it doesn't distort its internal representations.

tokenizer.special_tokens_map: Отображает специальные токены.

tokenizer.vocab_size: Размер словаря токенизатора.

In [3]:
print("\n--- Special Tokens and Vocab Size ---")
print("BERT Special Tokens Map:", bert_tokenizer.special_tokens_map)
print("XLM-RoBERTa Special Tokens Map:", xlm_roberta_tokenizer.special_tokens_map)

print("\nBERT Vocab Size:", bert_tokenizer.vocab_size)
print("XLM-RoBERTa Vocab Size:", xlm_roberta_tokenizer.vocab_size)

# Более подробный пример подготовки данных для модели (это будет частью цикла обучения)
def prepare_data_for_model(texts, tokenizer, max_len=128):
    input_ids = []
    attention_masks = []
    token_type_ids = [] # Только для BERT, XLM-R обычно не использует для сегментов

    for text in texts:
        encoded_dict = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_token_type_ids=True if isinstance(tokenizer, BertTokenizer) else False, # Только для BertTokenizer
            return_tensors='pt' # Возвращаем тензоры PyTorch
        )
        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])
        if 'token_type_ids' in encoded_dict:
            token_type_ids.append(encoded_dict['token_type_ids'])

    # Преобразуем списки тензоров в один большой тензор
    import torch
    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    if token_type_ids: # Только если есть token_type_ids
        token_type_ids = torch.cat(token_type_ids, dim=0)
        return input_ids, attention_masks, token_type_ids
    else:
        return input_ids, attention_masks, None # Для XLM-R

# Пример использования:
sample_texts = ["This is a sample sentence.", "Another example for demonstration."]
bert_inputs, bert_masks, bert_types = prepare_data_for_model(sample_texts, bert_tokenizer)
print("\nPrepared BERT inputs (first sentence):")
print("Input IDs shape:", bert_inputs.shape)
print("Attention Mask shape:", bert_masks.shape)
print("Token Type IDs shape:", bert_types.shape if bert_types is not None else "N/A")

xlm_inputs, xlm_masks, _ = prepare_data_for_model(sample_texts, xlm_roberta_tokenizer)
print("\nPrepared XLM-RoBERTa inputs (first sentence):")
print("Input IDs shape:", xlm_inputs.shape)
print("Attention Mask shape:", xlm_masks.shape)


--- Special Tokens and Vocab Size ---
BERT Special Tokens Map: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}
XLM-RoBERTa Special Tokens Map: {'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}

BERT Vocab Size: 30522
XLM-RoBERTa Vocab Size: 250002

Prepared BERT inputs (first sentence):
Input IDs shape: torch.Size([2, 128])
Attention Mask shape: torch.Size([2, 128])
Token Type IDs shape: torch.Size([2, 128])

Prepared XLM-RoBERTa inputs (first sentence):
Input IDs shape: torch.Size([2, 128])
Attention Mask shape: torch.Size([2, 128])


**4. Loading and Exploring the Dataset**

In [4]:
import pandas as pd
import os

# Путь к файлу dataset.csv (предполагаем, что он находится в той же папке или указываем полный путь)
# Если файл еще не скачан, вам нужно будет его скачать:
# https://raw.githubusercontent.com/Anshul-Aggarwal/transformers-finetuning-classification/main/data/dataset.csv
# Для примера я создам фиктивный CSV-файл, если его нет.
dataset_path = 'dataset.csv'

if not os.path.exists(dataset_path):
    print(f"Файл '{dataset_path}' не найден. Создаю фиктивный для демонстрации.")
    # Создаем фиктивный DataFrame
    data = {
        'text': [
            "I love this product, it's amazing!",
            "This is the worst experience ever.",
            "It's okay, nothing special.",
            "Absolutely fantastic, highly recommend!",
            "Terrible quality, waste of money.",
            "Neutral feelings about this one.",
            "The movie was great and very engaging.",
            "I found the book to be quite boring.",
            "Service was average, not good not bad."
        ],
        'sentiment': [
            'positive',
            'negative',
            'neutral',
            'positive',
            'negative',
            'neutral',
            'positive',
            'negative',
            'neutral'
        ]
    }
    df = pd.DataFrame(data)
    df.to_csv(dataset_path, index=False)
    print("Фиктивный dataset.csv создан.")
else:
    print(f"Файл '{dataset_path}' найден.")

# Загрузка набора данных
try:
    df = pd.read_csv(dataset_path)
    print("\nПервые 5 строк набора данных:")
    print(df.head())

    print("\nРазмер набора данных (строки, столбцы):", df.shape)

    # Идентификация необходимых столбцов
    # Предполагаем, что столбец с текстом называется 'text', а столбец с метками 'sentiment'
    text_column = 'text'
    label_column = 'sentiment'
    print(f"\nСтолбец для текста: '{text_column}'")
    print(f"Столбец для меток: '{label_column}'")

    # Проверим распределение меток (важно для StratifiedKFold)
    print("\nРаспределение меток:")
    print(df[label_column].value_counts())

    # Преобразуем строковые метки в числовые ID
    unique_labels = df[label_column].unique()
    label_to_id = {label: i for i, label in enumerate(unique_labels)}
    id_to_label = {i: label for i, label in enumerate(unique_labels)}
    df['label_id'] = df[label_column].map(label_to_id)
    print("\nМетки преобразованы в числовые ID:")
    print(df[['sentiment', 'label_id']].head())

except FileNotFoundError:
    print(f"Ошибка: Файл '{dataset_path}' не найден. Пожалуйста, убедитесь, что вы скачали его.")
    print("Вы можете скачать его по ссылке: https://raw.githubusercontent.com/Anshul-Aggarwal/transformers-finetuning-classification/main/data/dataset.csv")

Файл 'dataset.csv' не найден. Создаю фиктивный для демонстрации.
Фиктивный dataset.csv создан.

Первые 5 строк набора данных:
                                      text sentiment
0       I love this product, it's amazing!  positive
1       This is the worst experience ever.  negative
2              It's okay, nothing special.   neutral
3  Absolutely fantastic, highly recommend!  positive
4        Terrible quality, waste of money.  negative

Размер набора данных (строки, столбцы): (9, 2)

Столбец для текста: 'text'
Столбец для меток: 'sentiment'

Распределение меток:
sentiment
positive    3
negative    3
neutral     3
Name: count, dtype: int64

Метки преобразованы в числовые ID:
  sentiment  label_id
0  positive         0
1  negative         1
2   neutral         2
3  positive         0
4  negative         1


**5. Creating Cross-Validation Folds**( simple data)
StratifiedKFold ensures that each fold (or split) of your data maintains the same percentage of samples for each target class as the complete dataset. This is critically important for imbalanced datasets where one class might have significantly fewer examples than others. By preserving the class distribution in each fold, StratifiedKFold ensures that every fold is representative of the overall dataset, leading to more reliable model evaluation.


In [6]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
import os
from transformers import BertTokenizer, XLMRobertaTokenizer
import torch

print("--- Начало выполнения задания 5 (с использованием только фиктивных данных) ---")

# --- 1. Понимание BERT и XLM-RoBERTa (только загрузка токенизаторов для дальнейшего использования) ---
# Загрузка токенизаторов
# Используем try-except, так как для загрузки требуется интернет
try:
    bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    xlm_roberta_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
    print("\nТокенизаторы успешно загружены.")
except Exception as e:
    print(f"\nОшибка загрузки токенизаторов. Убедитесь, что у вас есть подключение к интернету и установлена библиотека 'transformers'.")
    print(f"Ошибка: {e}")
    # Если токенизаторы не загрузились, дальнейший код, зависящий от них, не будет работать.
    # Для целей демонстрации, мы можем продолжить с шагами, не зависящими от токенизаторов.
    bert_tokenizer = None
    xlm_roberta_tokenizer = None


# --- 4. Генерация и исследование набора данных (используем только сгенерированные данные) ---
print("\n--- Шаг 4: Генерация и исследование фиктивного набора данных ---")

# Ваш сгенерированный набор данных
data = {
    'text': [
        "I love this product, it's amazing!",
        "This is the worst experience ever.",
        "It's okay, nothing special.",
        "Absolutely fantastic, highly recommend!",
        "Terrible quality, waste of money.",
        "Neutral feelings about this one.",
        "The movie was great and very engaging.",
        "I found the book to be quite boring.",
        "Service was average, not good not bad."
    ],
    'sentiment': [
        'positive',
        'negative',
        'neutral',
        'positive',
        'negative',
        'neutral',
        'positive',
        'negative',
        'neutral'
    ]
}
# Создаем DataFrame напрямую из ваших данных
df = pd.DataFrame(data)

print("\nПервые 5 строк сгенерированного набора данных:")
print(df.head())

print("\nРазмер сгенерированного набора данных (строки, столбцы):", df.shape)

text_column = 'text'
label_column = 'sentiment'
print(f"\nСтолбец для текста: '{text_column}'")
print(f"Столбец для меток: '{label_column}'")

print("\nРаспределение меток в сгенерированном наборе данных:")
print(df[label_column].value_counts())

# Преобразуем строковые метки в числовые ID
unique_labels = df[label_column].unique()
label_to_id = {label: i for i, label in enumerate(unique_labels)}
id_to_label = {i: label for i, label in enumerate(unique_labels)} # Для обратного преобразования
df['label_id'] = df[label_column].map(label_to_id)
print("\nМетки преобразованы в числовые ID:")
print(df[['sentiment', 'label_id']].head())


# --- 5. Создание фолдов для кросс-валидации ---
print("\n--- Шаг 5: Создание фолдов для кросс-валидации ---")

X = df['text'].values # Входной текст
y = df['label_id'].values # Числовые метки

# Определяем минимальное количество образцов в наименьшем классе
min_samples_in_class = df['label_id'].value_counts().min()

# Устанавливаем n_splits, чтобы избежать ошибки ValueError
# n_splits не может быть больше, чем количество образцов в наименьшем классе (в данном случае, 3)
# Мы можем использовать 1, 2 или 3 фолда
n_splits = min(3, min_samples_in_class) # Ограничиваем до 3, т.к. больше и не нужно для этих данных

print(f"Минимальное количество образцов в классе: {min_samples_in_class}")
print(f"Учитывая это, используем n_splits = {n_splits} для StratifiedKFold.")


skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

train_indices_folds = []
val_indices_folds = []

print(f"\nСоздание {n_splits}-кратных стратифицированных фолдов...")
for fold, (train_index, val_index) in enumerate(skf.split(X, y)):
    train_indices_folds.append(train_index)
    val_indices_folds.append(val_index)

    print(f"\nФолд {fold + 1}/{n_splits}:")
    print(f"  Количество обучающих образцов: {len(train_index)}")
    print(f"  Количество валидационных образцов: {len(val_index)}")

    train_labels = y[train_index]
    val_labels = y[val_index]

    train_label_counts = pd.Series(train_labels).value_counts(normalize=True).sort_index()
    val_label_counts = pd.Series(val_labels).value_counts(normalize=True).sort_index()
    overall_label_counts = pd.Series(y).value_counts(normalize=True).sort_index()

    print("  Распределение меток в обучающей выборке (процент):")
    print(train_label_counts)
    print("  Распределение меток в валидационной выборке (процент):")
    print(val_label_counts)
    print("  Общее распределение меток (процент):")
    print(overall_label_counts)

print(f"\nУспешно создано {n_splits} стратифицированных фолдов.")
print("Теперь у вас есть индексы для каждого фолда в `train_indices_folds` и `val_indices_folds`.")


# --- 2 и 3: Демонстрация подготовки данных для модели (для одного фолда) ---
# Эта функция должна быть определена, чтобы остальная часть кода работала.
def prepare_data_for_model(texts, tokenizer, max_len=128):
    if tokenizer is None: # Проверка, если токенизатор не был загружен
        print("Ошибка: Токенизатор не инициализирован. Невозможно подготовить данные.")
        return None, None, None

    input_ids = []
    attention_masks = []
    token_type_ids = []

    for text in texts:
        encoded_dict = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_token_type_ids=True if isinstance(tokenizer, BertTokenizer) else False,
            return_tensors='pt'
        )
        input_ids.append(encoded_dict['input_ids'])
        attention_masks.append(encoded_dict['attention_mask'])
        if 'token_type_ids' in encoded_dict:
            token_type_ids.append(encoded_dict['token_type_ids'])

    input_ids = torch.cat(input_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    if token_type_ids:
        token_type_ids = torch.cat(token_type_ids, dim=0)
        return input_ids, attention_masks, token_type_ids
    else:
        return input_ids, attention_masks, None


print("\n--- Демонстрация подготовки данных для первого обучающего фолда (Шаги 2 и 3) ---")

# Убедимся, что у нас есть хотя бы один фолд
if len(train_indices_folds) > 0:
    first_fold_train_texts = df.loc[train_indices_folds[0], text_column].tolist()
    first_fold_train_labels = df.loc[train_indices_folds[0], 'label_id'].tolist()

    if bert_tokenizer is not None:
        print("\nПодготовка данных для BERT (первый фолд):")
        bert_train_input_ids, bert_train_attention_masks, bert_train_token_type_ids = \
            prepare_data_for_model(first_fold_train_texts, bert_tokenizer)

        if bert_train_input_ids is not None: # Проверяем, что подготовка данных прошла успешно
            print("Форма Input IDs:", bert_train_input_ids.shape)
            print("Форма Attention Masks:", bert_train_attention_masks.shape)
            if bert_train_token_type_ids is not None:
                print("Форма Token Type IDs:", bert_train_token_type_ids.shape)
            print("Форма меток:", torch.tensor(first_fold_train_labels).shape)
            print("\nПервая пара Input IDs и Attention Mask из обучающей выборки:")
            print("Input IDs:", bert_train_input_ids[0])
            print("Attention Mask:", bert_train_attention_masks[0])
            if bert_train_token_type_ids is not None:
                print("Token Type IDs:", bert_train_token_type_ids[0])
            print("Метка:", first_fold_train_labels[0])
            print(f"\nДекодированное первое предложение из первого фолда (BERT):")
            print(bert_tokenizer.decode(bert_train_input_ids[0], skip_special_tokens=False))
    else:
        print("Токенизатор BERT не был загружен, пропуск демонстрации подготовки данных для BERT.")

    if xlm_roberta_tokenizer is not None:
        print("\nПодготовка данных для XLM-RoBERTa (первый фолд):")
        xlm_train_input_ids, xlm_train_attention_masks, _ = \
            prepare_data_for_model(first_fold_train_texts, xlm_roberta_tokenizer)

        if xlm_train_input_ids is not None: # Проверяем, что подготовка данных прошла успешно
            print("Форма Input IDs:", xlm_train_input_ids.shape)
            print("Форма Attention Masks:", xlm_train_attention_masks.shape)
            print("Форма меток:", torch.tensor(first_fold_train_labels).shape)
            print("\nПервая пара Input IDs и Attention Mask из обучающей выборки:")
            print("Input IDs:", xlm_train_input_ids[0])
            print("Attention Mask:", xlm_train_attention_masks[0])
            print("Метка:", first_fold_train_labels[0])
            print(f"\nДекодированное первое предложение из первого фолда (XLM-RoBERTa):")
            print(xlm_roberta_tokenizer.decode(xlm_train_input_ids[0], skip_special_tokens=False))
    else:
        print("Токенизатор XLM-RoBERTa не был загружен, пропуск демонстрации подготовки данных для XLM-RoBERTa.")

else:
    print("\nНе удалось создать фолды, пропуск демонстрации подготовки данных.")

print("\n--- Задание 5 выполнено. ---")

--- Начало выполнения задания 5 (с использованием только фиктивных данных) ---

Токенизаторы успешно загружены.

--- Шаг 4: Генерация и исследование фиктивного набора данных ---

Первые 5 строк сгенерированного набора данных:
                                      text sentiment
0       I love this product, it's amazing!  positive
1       This is the worst experience ever.  negative
2              It's okay, nothing special.   neutral
3  Absolutely fantastic, highly recommend!  positive
4        Terrible quality, waste of money.  negative

Размер сгенерированного набора данных (строки, столбцы): (9, 2)

Столбец для текста: 'text'
Столбец для меток: 'sentiment'

Распределение меток в сгенерированном наборе данных:
sentiment
positive    3
negative    3
neutral     3
Name: count, dtype: int64

Метки преобразованы в числовые ID:
  sentiment  label_id
0  positive         0
1  negative         1
2   neutral         2
3  positive         0
4  negative         1

--- Шаг 5: Создание фолдов для 