<a href="https://colab.research.google.com/github/Frellaa/Bilingual-audiobook-from-plain-ebook-pdf/blob/main/ReginaFiam_BigDataProject_Bilingualaudiobook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


##Aim of this Project:
I am an Erasmus student in Bologna, and I wanted to learn Italian in a fun way. I also enjoy reading but don't really have the time for the traditional method, so I prefer audiobooks that I can listen to at the gym, while swimming, or walking. Therefore, I decided to combine these interests and created a tool that generates a bilingual audiobook from plain PDF ebook inputs. I used English as the original language because I feel comfortable with it, although my mother tongue is Hungarian. I know that language learning for babies happens by connecting the meaning of the spoken language they hear. Thus, I segmented the original book by sentences (or by other punctuation if the sentences were too long) and created an ABAB structured book, where:

- A: Original language
- B: Translated Language

I then joined these segments to make an audiobook with proper pronunciation. I used **George Orwell's - Animal Farm**as an example in this notebook. However, note that even with GPU usage, the text-to-speech (TTS) part of the notebook takes more than 2.5 hours. I will provide a YouTube link with the rendered file.


## Part I. Bilingual book making procedure

In [1]:
!pip install transformers torch PyMuPDF pydub

Collecting PyMuPDF
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub, PyMuPDF
Successfully installed PyMuPDF-1.25.1 pydub-0.25.1


### Function: `extract_text_from_pdf`

This function extracts text from a PDF file, performing the following steps:

- **Opens the PDF**: Uses the `fitz` library to open the specified PDF file.
- **Extracts Text**: Iterates through each page, getting the text content.
- **Processes Text**: Splits the page text into lines, then filters out lines that:
  - Start with 'https://' to remove URLs.
  - Contain the word "CHAPTER" (case-insensitive) to possibly skip chapter headings.
- **Concatenates**: Joins the filtered lines back into a single string.

Finally, it returns this cleaned, concatenated text string.

In [2]:
import fitz
import re

def extract_text_from_pdf(pdf_path):
    document = fitz.open(pdf_path)
    text = ''
    for page in document:
        # Get the text from the page and split it into lines
        page_text = page.get_text()
        lines = page_text.split('\n')

        # Filter out lines that start with 'https://' or contain "CHAPTER"
        filtered_lines = [
            line for line in lines
            if not line.strip().startswith('https://')
            and not re.search(r'\bCHAPTER\b', line.upper())
        ]

        # Join the filtered lines back into text
        text += '\n'.join(filtered_lines)

    return text

In [None]:
#pdf_extracted=extract_text_from_pdf("/content/animalfarm_merged.pdf")

### Function: `segment_sentences`

This function segments a given text into smaller parts based on token count, ensuring that each segment does not exceed a specified number of tokens (`max_tokens`). **It is important for the further audio processing because there are limits of tokens that can be processed.**  Here's how it works:

- **Tokenization**: Converts the input text into sentences using `sent_tokenize` from NLTK, and then into words with `word_tokenize`.

- **Processing Each Sentence:**
  - If a sentence exceeds `max_tokens`:
    - It splits the sentence into smaller segments at punctuation points like commas, semicolons, or colons if found within the last 10 words of the segment.
    - If no suitable punctuation is found for segmentation, it splits at exactly `max_tokens`.

  - If a sentence is within the token limit, it adds the whole sentence to the result.

- **Return**: The function returns a list of these segments.

### Notes:
- Requires NLTK's 'punkt_tab' dataset for sentence tokenization, which is downloaded in the script.
- The `max_tokens` parameter is set to 250 by default, but can be adjusted for different needs.


In [3]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Download necessary data for sentence tokenization
nltk.download('punkt_tab') # Download the 'punkt_tab' dataset explicitly

def segment_sentences(text, max_tokens=200):
    sentences = sent_tokenize(text)
    result = []

    for sentence in sentences:
        words = word_tokenize(sentence)

        if len(words) > max_tokens:
            # If the sentence exceeds max_tokens, split it at reasonable punctuation
            segments = []
            segment = []
            for word in words:
                segment.append(word)
                if len(segment) == max_tokens:
                    # Look for a good split point within the last few words
                    for i in range(min(10, len(segment)), 0, -1):
                        if segment[-i] in [',', ';', ':']:
                            segments.append(' '.join(segment[:-i]))
                            segment = segment[-i:]
                            break
                    else:
                        # If no punctuation found, just split at max_tokens
                        segments.append(' '.join(segment))
                        segment = []

            # Append any remaining words in the last segment
            if segment:
                segments.append(' '.join(segment))

            result.extend(segments)
        else:
            # Sentence fits within token limit, add it directly
            result.append(sentence)

    return result

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [4]:
# Example usage
text="This is an extraordinarily long sentence that we've crafted to go beyond three hundred tokens, including a plethora of words that are quite mundane like the, and, but, or, because, however, therefore, while, which, who, whom, whose, where, when, why, how, if, then, so, yet, nor, not, only, just, maybe, perhaps, possibly, likely, certainly, indeed, surely, definitely, absolutely, positively, undeniably, and we continue with nouns like book, chair, table, lamp, house, car, tree, river, mountain, ocean, sky, cloud, sun, moon, stars, planet, galaxy, universe, atom, molecule, cell, organism, species, genre, type, kind, variety, class, category, group, team, band, orchestra, choir, ensemble, collection, set, array, list, inventory, catalog, directory, index, archive, record, document, paper, essay, story, novel, poem, drama, comedy, tragedy, history, science, math, physics, chemistry, biology, geology, geography, astronomy, economics, psychology, sociology, philosophy, linguistics, literature, art, music, dance, theater, cinema, television, radio, internet, software, hardware, computer, laptop, smartphone, tablet, device, gadget, tool, instrument, machine, engine, motor, wheel, gear, lever, pulley, screw, nail, hammer, saw, drill, axe, knife, fork, spoon, plate, bowl, cup, glass, bottle, jar, can, box, bag, sack, wallet, purse, key, lock, door, window, wall, floor, ceiling, roof, garden, yard, park, forest, jungle, desert, beach, island, lake, pond, stream, brook, creek, canal, bridge, road, street, avenue, boulevard, highway, path, trail, track, route, journey, trip, voyage, expedition, adventure, experience, memory, thought, idea, concept, theory, hypothesis, principle, law, rule, regulation, policy, strategy, tactic, method, technique, skill, talent, ability, capacity, potential, power, strength, energy, force, speed, acceleration, momentum, velocity, mass, weight, volume, area, length, width, height, depth, dimension, size, shape, form, structure, pattern, design, style, fashion, trend, tradition, custom, culture, heritage, legacy, history, past, present, future, moment, instant, second, minute, hour, day, week, month, year, decade, century, millennium, era, epoch, period, age, stage, phase, step, level, degree, grade, rank, position, status, condition, situation, circumstance, context, environment, setting, scene, scenario, case, example, instance, occurrence, event, happening, incident, accident, emergency, crisis, challenge, problem, issue, matter, concern, topic, subject, theme, motif, element, component, part, piece, segment, section, portion, fraction, percentage, ratio, proportion, balance, harmony, melody, rhythm, beat, tempo, pace, tone, pitch, note, chord, scale, key, signature, measure, bar, line, verse, chorus, refrain, bridge, hook, lyric, word, syllable, sound, noise, silence, quiet, peace, calm, tranquility, serenity, bliss, happiness, joy, delight, pleasure, satisfaction, fulfillment, achievement, success, victory, triumph, progress, improvement, growth, development, evolution, change, transformation, transition, shift, move, action, activity, operation, function, purpose, goal, objective, aim, target, focus, concentration, attention, awareness, consciousness, knowledge, understanding, wisdom, insight, intuition, instinct, feeling, emotion, mood, temperament, character, personality, identity, self, ego, soul, spirit, mind, heart, body, health, fitness, wellness, vitality, vigor, strength, endurance, resilience, flexibility, agility, dexterity, coordination, balance, control, discipline, determination, willpower, motivation, inspiration, creativity, imagination, innovation, invention, discovery, exploration, investigation, research, study, learning, education, teaching, training, practice, exercise, work, labor, effort, exertion, struggle, fight, battle, war, peace, negotiation, compromise, agreement, contract, deal, transaction, exchange, trade, commerce, business, industry, market, economy, finance, investment, capital, asset, resource, wealth, prosperity, abundance, plenty, excess, surplus, deficit, shortage, scarcity, poverty, need, necessity, demand, supply, production, consumption, use, utility, benefit, advantage, gain, profit, income, revenue, earnings, salary, wage, fee, cost, expense, price, value, worth, merit, quality, standard, norm, average, mean, median, mode, range, variation, deviation, anomaly, exception, rarity, uniqueness, individuality, diversity, variety, multiplicity, complexity, intricacy, detail, specificity, precision, accuracy, exactness, correctness, truth, fact, reality, actuality, existence, being, life, living, survival, sustainability, continuity, permanence, stability, security, safety, protection, defense, guard, shield, armor, weapon, tool, utility, function, performance, efficiency, effectiveness, productivity, output, result, consequence, effect, impact, influence, power, authority, control, command, order, directive, instruction, guideline, rule, law, legislation, regulation, code, standard, norm, convention, tradition, custom, practice, habit, routine, procedure, process, method, approach, technique, strategy, tactic, plan, scheme, design, blueprint, model, prototype, sample, specimen, example, illustration, demonstration, explanation, description, definition, interpretation, analysis, evaluation, assessment, judgment, opinion, view, perspective, outlook, vision, foresight, prediction, forecast, estimate, guess, speculation, assumption, belief, faith, trust, confidence, assurance, certainty, conviction, commitment, dedication, devotion, loyalty, allegiance, fidelity, honesty, integrity, sincerity, authenticity, genuineness, reality, truth."
segments = segment_sentences(text)
for i, segment in enumerate(segments):
    print(f"Segment {i+1}: {segment}")

Segment 1: This is an extraordinarily long sentence that we 've crafted to go beyond three hundred tokens , including a plethora of words that are quite mundane like the , and , but , or , because , however , therefore , while , which , who , whom , whose , where , when , why , how , if , then , so , yet , nor , not , only , just , maybe , perhaps , possibly , likely , certainly , indeed , surely , definitely , absolutely , positively , undeniably , and we continue with nouns like book , chair , table , lamp , house , car , tree , river , mountain , ocean , sky , cloud , sun , moon , stars , planet , galaxy , universe , atom , molecule , cell , organism , species , genre , type , kind , variety , class , category , group , team , band , orchestra , choir , ensemble , collection , set , array , list , inventory , catalog , directory , index , archive
Segment 2: , record , document , paper , essay , story , novel , poem , drama , comedy , tragedy , history , science , math , physics , ch

In [None]:
#segmented_text=segment_sentences(pdf_extracted)

### Translation Using Transformers

This code uses the Hugging Face `transformers` library to set up and perform English to Italian translation:

- **Model Setup**: The `setup_translation_model` function initializes the tokenizer and model for the specified translation model, moving it to GPU if available for faster processing. (Note: I have a Colab Premium so I runned my code there, if you do not have it might take longer time.)

- **Translation**: The `translate_sentence` function converts an English sentence into Italian by encoding the input, generating translated tokens, and decoding them back into text. It limits the output length to prevent excessive translation and  to manage computational resources.

In [None]:
#translation using transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

def setup_translation_model(model_name="Helsinki-NLP/opus-mt-en-it"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    return tokenizer, model, device

def translate_sentence(sentence, tokenizer, model, device):
    inputs = tokenizer(sentence, return_tensors="pt").to(device)
    outputs = model.generate(inputs["input_ids"], max_length = min(len(tokenizer.encode(sentence)) * 2, 512))  # Assuming doubling the length of the input as a safe upper bound, but not exceeding 512 tokens, num_beams=4, early_stopping=True)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Loading the model

In [None]:
tokenizer, model, device = setup_translation_model()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/803k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.60M [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/307M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

In [None]:
#translated_sentences = [translate_sentence(sent, tokenizer, model, device) for sent in segmented_text]

In [None]:
#the merged texts will not be explicitely used but it is handy to have a merged text if someone needs it.
def merge_sentences(original, translated):
    return [sent for pair in zip(original, translated) for sent in pair]

In [None]:
#Writing out the merged book

#merged_translated_sentences = merge_sentences(segmented_text, translated_sentences)
#with open('merged_translated_sentences.txt', 'w') as file:
#    for line in merged_translated_sentences:
#       file.write(f"{line}\n")

Defining a very simple text processing function to get rid of the '\n' and '\t' characters because later they would make some troubles during the TTS.

In [None]:
def preprocess_for_tts(text):
    # Replace newlines with spaces for TTS
    text = text.replace('\n', ' ')
    # Remove tabs
    text = text.replace('\t', ' ')
    # Optionally, you might want to add pauses where newlines were for better readability
    # text = text.replace('\n', ' <pause> ')  # If your TTS system supports custom tags for pauses
    return text

So after all function defined this can be run for real usage.

In [None]:
#text extraction ffrom pdf
pdf_extracted=extract_text_from_pdf("/content/animalfarm_merged.pdf")

#the segmentation of the extracted text
segmented_text=segment_sentences(pdf_extracted)

#translation of the segmented text to the desired language - here italian
translated_sentences = [translate_sentence(sent, tokenizer, model, device) for sent in segmented_text]

#process the texts to be suitable for further TTS applications
processed_original = [preprocess_for_tts(sentence) for sentence in segmented_text]
processed_translated = [preprocess_for_tts(sentence) for sentence in translated_sentences]

#write out the processed files and later read it again so if the runtime stops i dont need to rerun just open the file.
with open('processed_original.txt', 'w', encoding='utf-8') as file: # Added encoding='utf-8'
    for line in processed_original:
        file.write(f"{line}\n")

with open('processed_translated.txt', 'w', encoding='utf-8') as file: # Added encoding='utf-8'
    for line in processed_translated:
        file.write(f"{line}\n")

## Part II. Text to speech to create the bilingual audiobook

In [None]:
!pip install datasets
!pip install --upgrade tensorflow  # Upgrade to the latest stable release
!pip install --upgrade transformers  # Update transformers to potentially fix compatibility issues.
!pip install --upgrade datasets
!pip install --upgrade TTS
!pip install numpy==1.23
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
!pip install pydub

Collecting numpy<2.1.0,>=1.26.0 (from tensorflow)
  Using cached numpy-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached numpy-2.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.23.0
    Uninstalling numpy-1.23.0:
      Successfully uninstalled numpy-1.23.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tts 0.22.0 requires numpy==1.22.0; python_version <= "3.10", but you have numpy 2.0.2 which is incompatible.
cudf-cu12 24.10.1 requires pandas<2.2.3dev0,>=2.0, but you have pandas 1.5.3 which is incompatible.
cupy-cuda12x 12.2.0 requires numpy<1.27,>=1.20, but you have numpy 2.0.2 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.0.2 which is inco

Collecting numpy==1.23
  Using cached numpy-1.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.2 kB)
Using cached numpy-1.23.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.22.0
    Uninstalling numpy-1.22.0:
      Successfully uninstalled numpy-1.22.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tts 0.22.0 requires numpy==1.22.0; python_version <= "3.10", but you have numpy 1.23.0 which is incompatible.
albucore 0.0.19 requires numpy>=1.24.4, but you have numpy 1.23.0 which is incompatible.
albumentations 1.4.20 requires numpy>=1.24.4, but you have numpy 1.23.0 which is incompatible.
bigframes 1.29.0 requires numpy>=1.24.0, but you have numpy 1.23.0 which is incompatible.
chex 0.1.88 requires nump



In [None]:
#now read those processed files for further analysis
with open('processed_original.txt', 'r', encoding='utf-8') as file:
    processed_original = [line.strip() for line in file]

with open('processed_translated.txt', 'r', encoding='utf-8') as file:
    processed_translated = [line.strip() for line in file]

it is necessary to remove the marks at the end of the sentences, because otherwise the trasnlated version should pronounce it which i wanted to avoid.

In [None]:
import re

def clean_text(text):
    return re.sub(r'[.,!?]', '', text)

processed_original = [clean_text(line) for line in processed_original]
processed_translated = [clean_text(line) for line in processed_translated]

It is not recommended to run the codes above on CPU, because:
- significantly longer running time (~10x)
- due to the longer running time, the runtime in colab disconnects because of inactivity and this results truncated audio files.

In [None]:
#import torch

# Check if CUDA is available
#device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize TTS with the appropriate device
#tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=(device == "cuda"))

TTS on GPU:

In [None]:
#!pip install numpy==1.23
from TTS.api import TTS

# Initialize TTS with XTTS model for Italian
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)



 > You must confirm the following:
 | > "I have purchased a commercial license from Coqui: licensing@coqui.ai"
 | > "Otherwise, I agree to the terms of the non-commercial CPML: https://coqui.ai/cpml" - [y/n]
 | | > y
 > Downloading model to /root/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2


100%|█████████▉| 1.87G/1.87G [00:44<00:00, 42.9MiB/s]
100%|██████████| 1.87G/1.87G [00:44<00:00, 42.1MiB/s]
100%|██████████| 4.37k/4.37k [00:00<00:00, 18.3kiB/s]
 77%|███████▋  | 280k/361k [00:00<00:00, 2.19MiB/s]
100%|██████████| 361k/361k [00:00<00:00, 1.00MiB/s]
100%|██████████| 32.0/32.0 [00:00<00:00, 111iB/s]
 50%|████▉     | 3.87M/7.75M [00:00<00:00, 38.7MiB/s]

 > Model's license - CPML
 > Check https://coqui.ai/cpml.txt for more info.
 > Using model: xtts


  self.speakers = torch.load(speaker_file_path)
  return torch.load(f, map_location=map_location, **kwargs)
GPT2InferenceModel has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


In [None]:
# Path to an audio file of the speaker you want to clone (must be at least 3 seconds long). Experience: the longer and clearer the voice is, the better the results's quality will be.
speaker_wav = "/content/voice_training_1984-by-george-orwell-lex-fridman.mp3"

In [None]:
# Find the index and value of the longest element- just for checking the succes of tokenization
longest_element_index, longest_element = max(enumerate(processed_translated), key=lambda x: len(x[1]))

print(f"The longest element '{longest_element}' is at index {longest_element_index}.")

The longest element 'Под руководством нашего лидера товарища Наполеона я положил пять яиц за шесть дней; или две коровы наслаждающиеся напитком в бассейне возгласят:  спасибо товарищу Наполеону за то как прекрасно вкусна эта вода  Общее чувство на ферме было хорошо выражено в стихотворении озаглавленном "Комрад Наполеон" которое было составлено Минимусом и которое шло следующим образом: "Дружок без отца"' is at index 977.


**Experience**: it is recommended to add you drive to the Colab to save the result file there, because the runtime is long and therefore I usually leave it to run during the night and then the colab's runtime break due to inactivity and the result file is also lost if not saved to the computer/drive.

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#define a path in your drive where you wanna receive the audiobook
drive_output_path = "/content/drive/MyDrive/big_data"

In [None]:
import os
from pydub import AudioSegment

# Assuming processed_original and processed_translated lists are already defined and populated.

# Create a list to store the concatenated audio segments
concatenated_audio = []

for i in range(min(len(processed_original), len(processed_translated))):  # Iterate up to the length of the shorter list
    original_file = f"original_{i}.wav"
    translated_file = f"translated_{i}.wav"

    tts.tts_to_file(text=processed_original[i], file_path=original_file, language="en", speaker_wav=speaker_wav)
    tts.tts_to_file(text=processed_translated[i], file_path=translated_file, language="it", speaker_wav=speaker_wav)

    # Load audio files using pydub
    original_audio = AudioSegment.from_wav(original_file)
    translated_audio = AudioSegment.from_wav(translated_file)

    # Concatenate audio segments
    concatenated_audio.append(original_audio)
    concatenated_audio.append(translated_audio)

    # Remove temporary files
    os.remove(original_file)
    os.remove(translated_file)


# Combine all audio segments into a single file
final_audio = sum(concatenated_audio)
final_audio.export("final_output.wav", format="wav")

# Export the combined audio to your Google Drive
final_audio.export(os.path.join(drive_output_path, "final_audio_merged_en_ru.wav"), format="wav")

 > Text splitted to sentences.
['M r Jones of the Manor Farm had locked the hen-houses for the night but was too drunk to remember to shut the pop- holes']


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 > Processing time: 7.649575233459473
 > Real-time factor: 0.8511966789351099
 > Text splitted to sentences.
['Мистер Джонс из поместья запер курятники на ночь но был слишком пьян чтобы заткнуть дыры']
 > Processing time: 3.9614100456237793
 > Real-time factor: 0.3732867158376254
 > Text splitted to sentences.
['With the ring of light from his lantern dancing from side to side he lurched across the yard kicked off his boots at the back door drew himself a last glass of beer from the barrel in the scullery and made his way up to bed where Mrs Jones was already snoring']
 > Processing time: 8.972793340682983
 > Real-time factor: 0.45406788905477685
 > Text splitted to sentences.
['С кольцом света от фонаря танцующего из стороны в сторону он прыгнул через двор снял ботинки у задней двери нарисовал себе последний бокал пива из ствола в скуллерии и пошёл в постель где миссис Джонс уже храпела']
 > Processing time: 8.540159702301025
 > Real-time factor: 0.3754795849332781
 > Text splitted to