## Installing libraries

In [None]:
# Update datasets to avoid the error of using load_dataset
!pip install --upgrade datasets

# Install the fireducks framework, a fast DataFrame library designed to replace pandas, especially when faster data processing is required
!pip install fireducks

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

## Import libraries and modules

In [None]:
# Deep learning
import torch

# To generate random numbers
import random

# Linear algebra
import numpy as np

# Pandas
import fireducks.pandas as pd

# Dataset loading
from datasets import load_dataset, DatasetDict

# Transformers
from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments, AutoTokenizer, AutoModelForCausalLM

# Disable interfering warnings
import warnings
warnings.filterwarnings("ignore")

## Loading a model

Let's use Sberbank's Russian-language GPT model of medium size `sberbank-ai/rugpt3medium_based_on_gpt2` to fit on the GPU. We will also specify to the pytorch library that we will perform calculations on a GPU with `cuda` support:

In [None]:
DEVICE = torch.device("cuda:0")

# Load and initialize the model and tokenizer
model_name = "ai-forever/rugpt3medium_based_on_gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(DEVICE)

tokenizer_config.json:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/574 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/761 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

To check that the tokenizer object being used is actually supported, its is_fast attribute is used:

In [None]:
tokenizer.is_fast

True

## Loading and preparing a dataset

As a dataset we will use **MLSUM**, a large-scale dataset for multilingual summarization. The data is extracted from online newspapers and contains over 1.5 million article/resume pairs in five different languages - French, German, Spanish, Russian and Turkish. Download **RU** data:

In [None]:
dataset = load_dataset("mlsum", "ru", trust_remote_code=True)

README.md:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

mlsum.py:   0%|          | 0.00/3.72k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/714M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/26.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25556 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/750 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/757 [00:00<?, ? examples/s]

Data Structure:

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 25556
    })
    validation: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 750
    })
    test: Dataset({
        features: ['text', 'summary', 'topic', 'url', 'title', 'date'],
        num_rows: 757
    })
})

Let's look at an example of the data:

In [None]:
print("Text: ", dataset["train"][0]["text"])
print("Summary: ", dataset["train"][0]["summary"])
print("Topic: ", dataset["train"][0]["topic"])
print("URL: ", dataset["train"][0]["url"])
print("Title: ", dataset["train"][0]["title"])
print("Date: ", dataset["train"][0]["date"])

Text:  Сладострастник в течение трех лет преследовал подростка в надежде совратить его. Как сообщили “МК” в следственном отделе по Хорошевскому району СУ СК при Прокуратуре РФ по Москве, 26 августа 2006 года 13-летний Павел вместе с другом отдыхал на берегу Москвы–реки рядом с Крылатским мостом. Там к ребятам подошел мужчина. Новый знакомый представился Евгением и предложил вместе пообедать в ресторане быстрого питания, а потом искупаться. Именно там, на берегу, педагог начал приставать к мальчику. Школьник убежал, но педофил успел снять голого подростка на мобильный телефон. После этого жизнь мальчика превратилась в сущий ад. Евгений узнал, где живет Павел, и стал шантажировать его. Этот кошмар продолжался три года. Преподаватель угрожал показать фотографию друзьям и знакомым Павла. Негодяй исписал непотребными надписями стены подъезда, где проживали друзья школьника. В один из дней он приехал в Сергиев Посад, к бабушке мальчика, и там накинулся на школьника с ножом. Наконец, отчаявши

Let's check what values the ‘topic’ column accepts:

In [None]:
np.unique(dataset["train"]["topic"])

array(['auto', 'culture', 'daily', 'economics', 'editions', 'incident',
       'moscow', 'mosobl', 'nasha-moskva', 'new-year-2016', 'politics',
       'science', 'social', 'specprojects', 'sport', 'zloba-dnya'],
      dtype='<U13')

Since we have a task to teach GPT to write headlines for Russian-language news texts, we should remove unnecessary columns: ‘summary’, ‘topic’, ‘url’, ‘date’.

In [None]:
# Columns to remove
columns_to_remove = ['summary', 'topic', 'url', 'date']

# Remove columns in all parts of the dataset (train/val/test)
dataset = dataset.remove_columns(columns_to_remove)

# Check remaining columns
print(dataset["train"].column_names)

['text', 'title']


Let's check for null values:

In [None]:
for split in dataset:
    print(f"\nSplit: {split}")
    for column in dataset[split].column_names:
        null_count = sum(1 for item in dataset[split][column] if item is None)
        print(f"Column '{column}': {null_count} null values")


Split: train
Столбец 'text': 0 null значений
Столбец 'title': 0 null значений

Split: validation
Столбец 'text': 0 null значений
Столбец 'title': 0 null значений

Split: test
Столбец 'text': 0 null значений
Столбец 'title': 0 null значений


Reduce train to 1000 rows (select the first 10000):

In [None]:
small_train = dataset["train"].select(range(10000))

# Create a new DatasetDict with a smaller train
df = DatasetDict({
 "train": small_train,
 "validation": dataset["validation"], # validation unchanged
 "test": dataset["test"] # test unchanged
})

In [None]:
df

DatasetDict({
    train: Dataset({
        features: ['text', 'title'],
        num_rows: 10000
    })
    validation: Dataset({
        features: ['text', 'title'],
        num_rows: 750
    })
    test: Dataset({
        features: ['text', 'title'],
        num_rows: 757
    })
})

Great! No Null values.

## Preparing training data

Data preparation

In [None]:
def prepare_examples(examples):
    texts = examples["text"]
    titles = examples["title"]
    inputs = [f"{text}\n\Title: {title}<|endoftext|>" for text, title in zip(texts, titles)]
    return {"formatted": inputs}

In [None]:
tokenized_datasets = {
    "train": df["train"].map(prepare_examples, batched=True, remove_columns=["text", "title"]),
    "validation": df["validation"].map(prepare_examples, batched=True, remove_columns=["text", "title"])
}

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/750 [00:00<?, ? examples/s]

Tokenization

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["formatted"], truncation=True, max_length=512)

In [None]:
tokenized_datasets["train"] = tokenized_datasets["train"].map(tokenize_function, batched=True, remove_columns=["formatted"])
tokenized_datasets["validation"] = tokenized_datasets["validation"].map(tokenize_function, batched=True, remove_columns=["formatted"])

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/750 [00:00<?, ? examples/s]

DataCollator for language modeling

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We don't use masked language modeling
)

Clearing the memory:

In [None]:
del dataset, small_train

## Training

In [None]:
training_args = TrainingArguments(
    output_dir="./finetuned",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    fp16=True if DEVICE == "cuda" else False,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
)

Starting the training:

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnesterenkoms2001[0m ([33mnesterenkoms2001-digitaltech[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,3.0206,3.031713
2,2.7772,3.038888
3,2.5939,3.060896


There were missing keys in the checkpoint model loaded: ['lm_head.weight'].


TrainOutput(global_step=7500, training_loss=2.8386087890625, metrics={'train_runtime': 12907.5867, 'train_samples_per_second': 2.324, 'train_steps_per_second': 0.581, 'total_flos': 2.7811328093208576e+16, 'train_loss': 2.8386087890625, 'epoch': 3.0})

## Saving the model

In [None]:
# Saving the model and tokenizer
model.save_pretrained("./news_title_generator")
tokenizer.save_pretrained("./news_title_generator")

## Load the model

In [None]:
DEVICE = torch.device("cuda:0")

# Path to saved model
model_path = "./news_title_generator"

# Loading tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Load model
model = AutoModelForCausalLM.from_pretrained(model_path).to(DEVICE)

# Check loading
print("Model and tokenizer successfully loaded!")
print(f"Model architecture: {model.__class__.__name__}")

Модель и токенизатор успешно загружены!
Архитектура модели: GPT2LMHeadModel


## Test

In [None]:
def generate_title(text, max_new_tokens=50):
    # Generate prompt with explicit separator
    prompt = f "Text: {text}\nTitle:"
    input_ids = tokenizer.encode(prompt, return_tensors="pt", truncation=True).to(DEVICE)

    # Generate with explicit tokens
    output = model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        num_beams=5,
        early_stopping=True,
        no_repeat_ngram_size=2,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

    # Decode and clean up the output
    full_output = tokenizer.decode(output[0], skip_special_tokens=True)
    title = full_output.replace(prompt, "").strip()

    # Remove possible HTML tags and special characters
    title = title.split('<|endoftext|>')[0].split('</p>')[0].strip()

    return title

Let's perform the check on test cases from the dataset:

In [None]:
def generate_comparison_table(dataset_dict):
    # Convert test dataset to DataFrame
    df = dataset_dict['test'].to_pandas()

    # Function to truncate text to 100 words
    def truncate_to_100_words(text):
        words = text.split()[:100] # Take the first 100 words
        return ' '.join(words)

    # Apply truncate to all texts
    df['text'] = df['text'].apply(truncate_to_100_words)

    # Select 5 random samples from the dataset
    random_samples = df.sample(n=10)

    # Create rows for output
    output_lines = []

    for i, (_, row) in enumerate(random_samples.iterrows(), 1):
        original_text = row['text']
        original_title = row['title']
        predicted_title = generate_title(original_text)

        output_lines.append(f "Example {i}")
        output_lines.append(f "Original text: {original_text}")
        output_lines.append(f "Original title: {original_title}")
        output_lines.append(f "Predicted title: {predicted_title}")
        output_lines.append("") # Blank line between examples

    # Join all hyphenated lines
    return '\n'.join(output_lines)

In [None]:
comparison_table = generate_comparison_table(df)
print(comparison_table)

Пример 1
Оригинальный текст: — Юлия Викторовна, расскажите для начала, в чем заключается основная задача логопеда? — Я бы сказала так: логопед — это главный специалист в дошкольном детстве. Логопед занимается не только развитием общей речевой активности, фонематического слуха, коррекцией звукопроизношения, накоплением словаря, развитием грамматической стороны речи, обучением навыкам словообразования, развитием связной речи, но и развитием психических процессов — внимание, память, восприятие, мышление, формирует предпосылки обучения грамоте, т.е. дает понятия «звук», «слово», «предложение», занимается развитием общей и мелкой моторики. Логопедия всегда стояла на стыке таких наук, как педагогика, психология, нейропсихология, психолингвистика, физиология и неврология. Для того чтобы скорректировать дефект, логопед должен обладать всеми этими
Оригинальный заголовок: Логопед рассказала, как воспитать умного ребенка
Предсказанный заголовок: Как научить ребенка говорить правильно?

Пример 2
О