I retrieved the data following the instructions provided at:
https://www.kaggle.com/code/d0rj3228/authorship-attribution-for-russian-literature

## Importing libraries, downloading punkt

In [None]:
from typing import List
import random

import glob
from nltk import tokenize, download
import numpy as np
import pandas as pd
download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

##Making lists of processed sentences per author

The split_text function should give us a list of sentences, the length of which is more than 5 characters, since shorter ones, most likely, do not carry information useful for attribution. To improve the performance of the offer tokenizer, some character combinations are replaced. So, the replicas will be separate from the speech of the author in sentences, and the problem with quotes should be solved.

In [None]:
def split_text(filepath: str, min_char: int = 5) -> List[str]:

    text = str()
    with open(filepath, 'r', encoding='utf-8') as file:
        text = file.read().replace('\n', '. ')
        text = text.replace('.”', '”.').replace('."', '".').replace('?”', '”?').replace('!”', '”!')
        text = text.replace('--', ' ').replace('. . .', '').replace('_', '').replace("...", "…")
        text = text.replace("\xa0", " ").replace("\x7f", "").replace("\x01", "")

    sentences = tokenize.sent_tokenize(text)
    sentences = [sentence for sentence in sentences if len(sentence) >= min_char]

    return list(sentences)

Let's print the sentences from one txt file.

In [None]:
path = "./Chekhov/Агафья.txt"
new_text = split_text(path)
for number, sent in enumerate(new_text, start=1):
  print(number, sent)

1 Антон Чехов.
2 АГАФЬЯ.
3 В бытность мою в С—м уезде мне часто приходилось бывать на Дубовских огородах у огородника Саввы Стукача, или попросту Савки.
4 Эти огороды были моим излюбленным местом для так называемой «генеральной» рыбной ловли, когда, уходя из дому, не знаешь дня и часа, в которые вернешься, забираешь с собой все до одной рыболовные снасти и запасаешься провизией.
5 Собственно говоря, меня не так занимала рыбная ловля, как безмятежное шатанье, еда не вовремя, беседа с Савкой и продолжительные очные ставки с тихими летними ночами.
6 Савка был парень лет 25, рослый, красивый, здоровый, как кремень.
7 Слыл он за человека рассудительного и толкового, был грамотен, водку пил редко, но как работник этот молодой и сильный человек не стоил и гроша медного.
8 Рядом с силой в его крепких, как веревка, мышцах разливалась тяжелая, непобедимая лень.
9 Жил он, как и все на деревне, в собственной избе, пользовался наделом, но не пахал, не сеял и никаким ремеслом не занимался.
10 Старух

In [None]:
chekhov = []
for path in glob.glob('./Chekhov/*.txt'):
    chekhov += split_text(path)

dostoevsky = []
for path in glob.glob('./Dostoevsky/*.txt'):
    dostoevsky += split_text(path)

tolstoy = []
for path in glob.glob('./Tolstoy/*.txt'):
    tolstoy += split_text(path)

gogol = []
for path in glob.glob('./Gogol/*.txt'):
    gogol += split_text(path)

In [None]:
text_dict = {'Chekhov': chekhov, 'Dostoevsky': dostoevsky, 'Tolstoy': tolstoy, 'Gogol': gogol}

for key in text_dict.keys():
    print(key, ':', len(text_dict[key]), ' sentences')

Chekhov : 22729  sentences
Dostoevsky : 118479  sentences
Tolstoy : 92213  sentences
Gogol : 23750  sentences


Each list contains 22729 to 118475 sentences. In order to have an even distribution of authors in our set, we will limit the set for each, for example, to 2750 sentences. Why this number? 2500 sentences for each author will for the training set (10000 sentences) and 250 sentences for each author will form the test set (1000 sentences).

We are going to use NumPy's random module because it was used by the creators of the dataset on kaggle.

In [None]:
# Set the random seed for reproducibility
np.random.seed(1)

# Maximum number of sentences to select from each author
max_len = 2750

# Create "names" - a list of lists of sentences for each author
names = [chekhov, dostoevsky, tolstoy, gogol]

# Create an empty list to store the combined sentences
combined = []

for author_sentences in names:
    # Randomly select up to max_len sentences without replacement
    selected_sentences = np.random.choice(author_sentences, max_len, replace = False)
    # Add the selected sentences to the combined list
    combined += list(selected_sentences)

print('Length of combo and internally shuffled list:', len(combined))

Length of combo and internally shuffled list: 11000


However, we could also use Python's built-in random module

In [None]:
# Set the random seed for reproducibility
random.seed(1)

# Maximum number of sentences to select from each author
max_len = 2750

# List of author-specific sentence lists
names = [chekhov, dostoevsky, tolstoy, gogol]

# Initialize an empty list to store the combined sentences
combined = []

# Iterate through each author's sentences
for author_sentences in names:
    # Randomly select max_len sentences without replacement
    selected_sentences = random.sample(author_sentences, max_len)

    # Add the selected sentences to the combined list
    combined += selected_sentences

# Print the total number of sentences in the combined list
print('Length of combined and internally shuffled list:', len(combined))

Length of combined and internally shuffled list: 11000


## Make a list of all labels

At this point, it is important to indicate the labels of the authors (their names) in the same order as in the previous step, otherwise the data will simply turn out to be incorrect.

In [None]:
labels = ['Chekhov'] * max_len + ['Dostoevsky'] * max_len + ['Tolstoy'] * max_len +  ['Gogol'] * max_len

print('Length of marked list:', len(labels))

Length of marked list: 11000


The output of the quantity was needed for additional control over the data and their labels. Equality means that every sentence in our dataset will have a label.

In [None]:
len(combined) == len(labels)

True

##Splitting the list of all sentences into a training and a testing set

Separate the combined list into author-specific lists


In [None]:
chekhov = combined[:max_len]
dostoevsky = combined[max_len:2*max_len]
tolstoy = combined[2*max_len:3*max_len]
gogol = combined[3*max_len:]

Let's write a function to split a list into training and test sets:

In [None]:
def split_train_test(data, train_size=2500, test_size=250):
    return data[:train_size], data[train_size:train_size + test_size]

Split each author's data

In [None]:
chekhov_train, chekhov_test = split_train_test(chekhov)
dostoevsky_train, dostoevsky_test = split_train_test(dostoevsky)
tolstoy_train, tolstoy_test = split_train_test(tolstoy)
gogol_train, gogol_test = split_train_test(gogol)

Combine training data and test data separately

In [None]:
train_data = chekhov_train + dostoevsky_train + tolstoy_train + gogol_train
test_data = chekhov_test + dostoevsky_test + tolstoy_test + gogol_test

Create labels

In [None]:
train_labels = (['Chekhov'] * len(chekhov_train) + ['Dostoevsky'] * len(dostoevsky_train) +
               ['Tolstoy'] * len(tolstoy_train) + ['Gogol'] * len(gogol_train))
test_labels = (['Chekhov'] * len(chekhov_test) + ['Dostoevsky'] * len(dostoevsky_test) +
              ['Tolstoy'] * len(tolstoy_test) + ['Gogol'] * len(gogol_test))

## Shuffle the training and test sets

Shuffle training and test sets

In [None]:
random.seed(3)
train_zipped = list(zip(train_data, train_labels))
random.shuffle(train_zipped)
train_data, train_labels = zip(*train_zipped)

test_zipped = list(zip(test_data, test_labels))
random.shuffle(test_zipped)
test_data, test_labels = zip(*test_zipped)

In [None]:
train_data_2024 = pd.DataFrame()
train_data_2024['text'] = train_data
train_data_2024['author'] = train_labels

print(train_data_2024.head())
print(train_data_2024.tail())

                                                text      author
0  Он старался не развлекаться и не портить себе ...     Tolstoy
1         Всегда этак у меня перед припадком бывает.     Chekhov
2  Катерина Николаевна тут же и.   отказала ему, ...  Dostoevsky
3                                    Анна Андреевна.       Gogol
4  — То, что я видел сейчас, хуже всякой простуды...     Chekhov
                                                   text      author
9995  Он признавал себя теперь, так же как и Марью П...     Tolstoy
9996  Судя по выражению его морды и усов, он сам был...     Chekhov
9997  Он стал видеть во всем какое-то революционное ...       Gogol
9998            — Великий господин, ясновельможный пан!       Gogol
9999                                         Глава XIV.  Dostoevsky


# Save the shuffled sets to csv

In [None]:
test_data_2024 = pd.DataFrame()
test_data_2024['text'] = test_data
test_data_2024['author'] = test_labels

print(test_data_2024.head())
print(test_data_2024.tail())

                                                text   author
0  — Это мы понимаем… Мы ведь не все отвинчиваем…...  Chekhov
1                                          Да что я?    Gogol
2  Я думаю, у меня горло замерзло от проклятого м...    Gogol
3  На деда, несмотря на весь страх, смех напал, к...    Gogol
4  Действительно, влияние товарищей оказало на не...  Tolstoy
                                                  text   author
995           Он, увидевши, что нет меня, начал звать.    Gogol
996  Как только табун загнали, лошади собрались вок...  Tolstoy
997                                       Ну, да что!.  Chekhov
998  И он вспомнил то робкое, жалостное выражение, ...  Tolstoy
999    И жених и невеста были предметом общей зависти.    Gogol


In [None]:
test_data_2024.to_csv('test_data_2024.csv', index=False)
train_data_2024.to_csv('train_data_2024.csv', index=False)