# Machine translation

In [None]:
!pip install -q datasets sentencepiece transformers torch transformers[torch] dotenv

import dotenv

dotenv.load_dotenv()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Translation using transformer

In [30]:
pl_text = "Wczesnym rankiem, gdy słońce dopiero zaczynało wschodzić, ulice miasta były jeszcze puste. Czuć było świeżość powietrza i delikatny zapach kawy unoszący się z pobliskiej kawiarni. To był idealny moment na spokojny spacer przed rozpoczęciem dnia."
en_text = "Early in the morning, as the sun was just beginning to rise, the city streets were still empty. The air felt fresh, and a gentle smell of coffee drifted from a nearby café. It was the perfect moment for a peaceful walk before the day began."

In [31]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-pl-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-pl-en")

In [32]:
tokenized_text_pl = tokenizer(pl_text, return_tensors="pt")
translated_tokens = model.generate(
    input_ids=tokenized_text_pl["input_ids"],
    attention_mask=tokenized_text_pl["attention_mask"],
)
translated = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]

In [33]:
print(f"Polish text:\n {pl_text}")
print(f"Translated text:\n {translated}")
print(f"Expected translation:\n {en_text}")

Polish text:
 Wczesnym rankiem, gdy słońce dopiero zaczynało wschodzić, ulice miasta były jeszcze puste. Czuć było świeżość powietrza i delikatny zapach kawy unoszący się z pobliskiej kawiarni. To był idealny moment na spokojny spacer przed rozpoczęciem dnia.
Translated text:
 Early in the morning, when the sun was just beginning to rise, the streets of the city were still empty. It felt fresh air and a gentle smell of coffee floating from a nearby cafe. It was the perfect moment for a quiet walk before the beginning of the day.
Expected translation:
 Early in the morning, as the sun was just beginning to rise, the city streets were still empty. The air felt fresh, and a gentle smell of coffee drifted from a nearby café. It was the perfect moment for a peaceful walk before the day began.


## Language detection

In [34]:
!pip install -q langdetect

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [35]:
from langdetect import detect

source_text = "Az ezer mérföldes utazás is egyetlen lépéssel kezdődik."
print(f"Detected language: {detect(source_text)}")

Detected language: hu


## Language detection and translation with HERBERT model

In [36]:
!pip install -q googletrans==4.0.0-rc1 nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [37]:
import googletrans

text = """Szklanka ma u mnie pojemność 250 ml.
Do usmażenia naleśników użyłam szerokiej patelni do naleśników: średnica 24 cm.
Z podanej ilości składników wyszło mi 16 bardzo cienkich naleśników. 
Jajka wyjmij wcześniej z lodówki. Możesz też lekko podgrzać mleko do temperatury nie wyższej niż 40 stopni.

Kalorie policzone zostały na podstawie użytych przeze mnie składników. Jest to więc orientacyjna ilość kalorii, ponieważ nawet mąka może mieć inną ilość kalorii niż ta, której użyłam ja. 

Naleśniki szykuję przynajmniej raz w tygodniu. To danie proste, szybkie i uwielbiane przez dzieci. Możesz je podać na dowolny posiłek w ciągu dnia. Naleśniki możesz zabrać do pracy, do szkoły lub na piknik. Są po prostu boskie!""".replace(
    "\n", " "
)
translator = googletrans.Translator()


def translate(text, dest) -> str:
    return translator.translate(text, dest=dest, src="auto").text


destinations = {
    "en": "english",
    "es": "spanish",
    "de": "german",
    # "lt": "lithuanian",
    "ru": "russian",
    # "hu": "hungarian",
    "it": "italian",
    "fr": "french",
    # "sl": "slovenian",
    "no": "norwegian",
}
translated = []
for des in destinations.keys():
    translated.append(translate(text, des))

In [38]:
translated

['The glass has a capacity of 250 ml for me.To fry pancakes, I used a wide pan to pancakes: diameter 24 cm.I got 16 very thin pancakes from the amount of ingredients given.Take the eggs from the fridge beforehand.You can also lightly heat the milk to a temperature not higher than 40 degrees.Calories were counted on the basis of the ingredients used by me.So this is an approximate amount of calories, because even flour can have a different amount of calories than the one I used.I prepare pancakes at least once a week.It is a simple dish, fast and loved by children.You can serve them for any meal during the day.You can take pancakes to work, school or for a picnic.They are simply divine!',
 'El vidrio tiene una capacidad de 250 ml para mí.Para freír los panqueques, usé una sartén ancha a los panqueques: diámetro de 24 cm.Obtuve 16 panqueques muy delgados por la cantidad de ingredientes dados.Tome los huevos de la nevera de antemano.También puede calentar ligeramente la leche a una temper

In [39]:
import nltk
from nltk.tokenize import word_tokenize
import random

nltk.download("punkt_tab", quiet=True)
random.seed(42)


def safe_tokenize(text, language, verbose=False) -> list[str]:
    try:
        return word_tokenize(text, language=language)
    except Exception:
        if verbose:
            print(
                f"nltk tokenization possible for language [{language}]... Defaulting to whitespace tokenization"
            )
        return text.split(" ")


tokenized = [
    safe_tokenize(t, lan, verbose=True)
    for t, lan in zip(translated, destinations.values())
]

# replace random words with __MASKNOTTRANSLATED__
mask = "<mask>"


def apply_mask(text: list[str], num_masks=5):
    all_indicies = set(list(range(0, len(text))))  # so indexes will be unique
    for _ in range(num_masks):
        idx = random.choice(list(all_indicies))
        text[idx] = mask
        all_indicies -= {idx}
    return text


masked_splitted = []
for t in tokenized:
    masked_splitted.append(apply_mask(t))

In [40]:
for masked in masked_splitted:
    print(masked)

['The', 'glass', 'has', 'a', 'capacity', 'of', '<mask>', 'ml', 'for', 'me.To', 'fry', 'pancakes', ',', 'I', 'used', 'a', 'wide', 'pan', 'to', 'pancakes', ':', 'diameter', '24', 'cm.I', 'got', '16', 'very', 'thin', '<mask>', 'from', '<mask>', 'amount', 'of', '<mask>', 'given.Take', 'the', 'eggs', 'from', 'the', 'fridge', 'beforehand.You', 'can', 'also', 'lightly', 'heat', 'the', 'milk', 'to', 'a', 'temperature', 'not', 'higher', 'than', '40', 'degrees.Calories', 'were', 'counted', 'on', 'the', 'basis', 'of', 'the', 'ingredients', 'used', 'by', 'me.So', 'this', 'is', 'an', 'approximate', 'amount', 'of', '<mask>', ',', 'because', 'even', 'flour', 'can', 'have', 'a', 'different', 'amount', 'of', 'calories', 'than', 'the', 'one', 'I', 'used.I', 'prepare', 'pancakes', 'at', 'least', 'once', 'a', 'week.It', 'is', 'a', 'simple', 'dish', ',', 'fast', 'and', 'loved', 'by', 'children.You', 'can', 'serve', 'them', 'for', 'any', 'meal', 'during', 'the', 'day.You', 'can', 'take', 'pancakes', 'to', '

### Masking, Translation and Recreation of masked tokens

In [41]:
# some additional dependencies
!pip install -q protobuf sacremoses

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [42]:
def fill_all_masks(sentence, mask_pipeline):
    placeholder = "UNIQUE_MASK_PLACEHOLDER"
    while mask in sentence:
        parts = sentence.split(mask, 1)
        rest = parts[1].replace(mask, placeholder)
        sentence_single_mask = parts[0] + mask + rest

        suggestions = mask_pipeline(sentence_single_mask)
        best_token = suggestions[0]["token_str"]

        sentence_filled = sentence_single_mask.replace(mask, best_token, 1)
        sentence = sentence_filled.replace(placeholder, mask)
    return sentence

In [43]:
from transformers import pipeline
from googletrans import Translator

mask_pipeline = pipeline("fill-mask", model="allegro/herbert-base-cased")
translator = Translator()

for foreign_text, lan in zip(translated, destinations.keys()):
    # translate to HERBERT native language - polish
    pl_text = translator.translate(foreign_text, dest="pl").text
    # tokenize
    pl_tokens = safe_tokenize(pl_text, language=lan)
    # apply mask
    masked_tokens = apply_mask(pl_tokens)
    # join splitted
    masked = " ".join(masked_tokens)
    print("\nMasked: ", masked)

    final_sentence = fill_all_masks(masked, mask_pipeline)
    print("Filled: ", final_sentence)

Device set to use cpu



Masked:  Szkło ma dla mnie pojemność 250 ml. Do smażenia naleśników użyłem <mask> patelni <mask> <mask> średnica 24 cm. I dostałem 16 bardzo cienkich naleśników z ilości podanych składników. Wczoraj jaja z lodówki.Może mieć inną ilość kalorii niż ta, której użyłem. Przynajmniej przygotowuję naleśniki przynajmniej raz w tygodniu. To <mask> danie, <mask> i kochane przez dzieci. Możesz podać je na każdy posiłek w ciągu dnia. Możesz zabrać naleśniki do pracy, szkoły lub na piknik. Są po prostu boskie!
Filled:  Szkło ma dla mnie pojemność 250 ml. Do smażenia naleśników użyłem specjalnej patelni , jej średnica 24 cm. I dostałem 16 bardzo cienkich naleśników z ilości podanych składników. Wczoraj jaja z lodówki.Może mieć inną ilość kalorii niż ta, której użyłem. Przynajmniej przygotowuję naleśniki przynajmniej raz w tygodniu. To ulubione danie, popularne i kochane przez dzieci. Możesz podać je na każdy posiłek w ciągu dnia. Możesz zabrać naleśniki do pracy, szkoły lub na piknik. Są po prostu 

## Recursive translation

In [16]:
original_text = "Als Zweiter Weltkrieg wird der zweite global geführte Krieg sämtlicher Großmächte im 20. Jahrhundert bezeichnet. Über 60 Staaten waren direkt oder indirekt beteiligt, mehr als 110 Millionen Menschen trugen Waffen. Schätzungen zufolge wurden über 65 Millionen Menschen getötet."

In [17]:
destinations = {
    "en": "english",
    "es": "spanish",
    "lt": "lithuanian",
    "ru": "russian",
    "hu": "hungarian",
    "it": "italian",
    "fr": "french",
    "sl": "slovenian",
    "no": "norwegian",
    "de": "german",
}

In [18]:
from googletrans import Translator
from nltk.metrics import edit_distance
import copy

translator = Translator()
translated = copy.copy(original_text)
current_language = "de"
for lan in destinations.keys():
    print(f"Translating from '{current_language}' to '{lan}'")
    translated = translator.translate(translated, dest=lan, src="auto").text
    print(f"    Translated: {translated}")

    distance = edit_distance(
        original_text, translated, substitution_cost=1, transpositions=False
    )
    print(f"    Levenshtein edit distance: {distance}")
    current_language = lan

Translating from 'de' to 'en'
    Translated: The second global war of all major powers in the 20th century was called the Second World War.Over 60 states were directly or indirectly involved, more than 110 million people wore weapons.It is estimated that over 65 million people were killed.
    Levenshtein edit distance: 183
Translating from 'en' to 'es'
    Translated: La Segunda Guerra Global de todas las potencias importantes en el siglo XX se llamó la Segunda Guerra Mundial. Over 60 estados estaban directa o indirectamente involucrados, más de 110 millones de personas llevaban armas. Se estima que más de 65 millones de personas fueron asesinadas.
    Levenshtein edit distance: 207
Translating from 'es' to 'lt'
    Translated: Antrasis visų svarbių XX amžiaus galių karas buvo vadinamas Antrojo pasaulinio karo.Daugiau nei 60 valstijų tiesiogiai ar netiesiogiai dalyvavo, daugiau nei 110 milijonų žmonių nešiojo ginklus.Manoma, kad žuvo daugiau nei 65 milijonai žmonių.
    Levenshtein e

### How far is the translation and the original text?

In [19]:
print(f"Original: \n{original_text}")
print(f"Translated: \n{translated}")
distance = edit_distance(
    original_text, translated, substitution_cost=1, transpositions=False
)
print(f"\nLevenshtein edit distance: {distance}")

Original: 
Als Zweiter Weltkrieg wird der zweite global geführte Krieg sämtlicher Großmächte im 20. Jahrhundert bezeichnet. Über 60 Staaten waren direkt oder indirekt beteiligt, mehr als 110 Millionen Menschen trugen Waffen. Schätzungen zufolge wurden über 65 Millionen Menschen getötet.
Translated: 
Der zweite Krieg für alle wichtigen Mächte in den 1900er Jahren wurde als Zweiten Weltkrieg bezeichnet. Hos 110 Millionen Menschen wurden direkt oder indirekt gemacht.

Levenshtein edit distance: 176


The score is high which means there are significant differences between original and multi-tranlated text.

## Translating pdf file

In [20]:
!pip install -q PyPDF2 fpdf

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for fpdf (setup.py) ... [?25l[?25hdone


In [21]:
import PyPDF2
from fpdf import FPDF
from googletrans import Translator


def extract_text(pdf_path):
    with open(pdf_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        full_text = ""
        for page in reader.pages:
            full_text += page.extract_text() + "\n"
    return full_text


def translate(text, translate_to="de"):
    translator = Translator()
    return translator.translate(text, dest=translate_to, src="en").text


# 3. Save text to PDF using fpdf (simpler than reportlab)
def save_text_to_pdf(text, output_path):
    pdf = FPDF()
    pdf.add_page()
    pdf.set_auto_page_break(auto=True, margin=15)
    pdf.set_font("Arial", size=12)
    for line in text.split("\n"):
        pdf.cell(0, 10, txt=line, ln=True)
    pdf.output(output_path)


# Usage
input_pdf = "/kaggle/input/translation/EN-rival600-manu.pdf"
output_pdf = "/DE-rival600-manu.pdf"

text = extract_text(input_pdf)
translated_text = translate(text)
save_text_to_pdf(translated_text, output_pdf)

If you look at the original file you can see that all syle was lost

## Translation with context awareness and large documents handling

In [22]:
# [x] Find large document
# [x] Load document
# [x] Read full text
# [x] Merge text
# [x] Split into 10 sentenses long chunks and save 3 previous and 3 following sentences as context
# [x] Use mBART to translate with surronding context
# [] Save results in 2 markdown files. One for each language. Each sentence in one
from nltk.tokenize import PunktSentenceTokenizer


class Chunk:
    def __init__(self, core, previous="", following=""):
        self.previous: str = previous
        self.core: str = core
        self.following: str = following

    @staticmethod
    def from_raw_text(corpus: str, chunk_size: int, context_size: int):
        # assert chunk_size > context_size, "Chunk size must be grater than context"
        tokenizer = PunktSentenceTokenizer(corpus)
        sentences = tokenizer.tokenize(corpus)

        chunks: list["Chunk"] = []

        chunks_num = int(len(sentences) / chunk_size)
        remaining = len(sentences) - chunks_num

        for i in range(chunks_num):
            start_idx = i * chunk_size
            core = sentences[start_idx : start_idx + chunk_size]
            previous_context = (
                sentences[start_idx - context_size : start_idx] if i != 0 else [""]
            )
            following_context = (
                sentences[
                    start_idx + chunk_size : start_idx + chunk_size + context_size
                ]
                if i < chunks_num
                else sentences[start_idx + context_size : len(sentences)]
            )

            chunks.append(
                Chunk(
                    Chunk.join_str(core),
                    Chunk.join_str(previous_context),
                    Chunk.join_str(following_context),
                )
            )

        if remaining:
            previous_context = chunks[-1].following
            core = sentences[-3:]

            chunks.append(
                Chunk(
                    core=core, previous=Chunk.join_str(previous_context), following=""
                )
            )

        return chunks

    def __repr__(self) -> str:
        return f"{60*'='}\nCORE: {self.core}\nPREVIOUS: {self.previous}\nFOLLOWING: {self.following}\n{60*'='}"

    @staticmethod
    def join_str(splitted: list[str]):
        return " ".join(splitted)

In [None]:
from pathlib import Path

text = Path("ostatnie_życzenie.txt").read_text()

text_chunks = Chunk.from_raw_text(text, 3, 2)
text_chunks[10:12]

 CORE: Obcy, ciągle w
 płaszczu, stał przed szynkwasem sztywno, nieruchomo, milczał. – Co podać? – Piwa – rzekł nieznajomy.
 PREVIOUS: Karczma nie miała najlepszej sławy. Karczmarz uniósł głowę znad beczki kiszonych ogórków i zmierzył gościa wzrokiem.
 FOLLOWING: Głos miał nieprzyjemny. Karczmarz wytarł ręce o płócienny fartuch i
 napełnił gliniany kufel.
 CORE: Głos miał nieprzyjemny. Karczmarz wytarł ręce o płócienny fartuch i
 napełnił gliniany kufel. Kufel był wyszczerbiony.
 PREVIOUS: – Co podać? – Piwa – rzekł nieznajomy.
 FOLLOWING: Nieznajomy nie był stary, ale włosy miał prawie zupełnie białe. Pod płaszczem nosił wytarty
 skórzany kubrak, sznurowany pod szyją i na ramionach.

In [24]:
import os

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
os.environ["TOKENIZERS_PARALLELISM"] = "true"

In [None]:
from transformers import AutoTokenizer, MBartForConditionalGeneration
import uuid
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "facebook/mbart-large-50-many-to-many-mmt"
model = MBartForConditionalGeneration.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.src_lang = "pl_XX"


def prepare_prompt(chunk: Chunk, start_id: str, end_id: str) -> str:
    return f"Konteks początkowy: {chunk.previous} Tekst: {start_id}{chunk.core}{end_id} Kontekst końcowy: {chunk.following}"


def translate_chunks_in_batch(chunks: list[Chunk], batch_size: int = 8) -> list[str]:
    results = []

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]

        # create unique id per chunk
        batch_ids = [(uuid.uuid4(), uuid.uuid4()) for _ in batch]
        prompts = [
            prepare_prompt(chunk, str(start_id), str(end_id))
            for chunk, (start_id, end_id) in zip(batch, batch_ids)
        ]

        inputs = tokenizer(
            prompts, return_tensors="pt", padding=True, truncation=True, max_length=1024
        ).to(device)

        generated = model.generate(
            **inputs, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]
        )
        decoded = tokenizer.batch_decode(generated, skip_special_tokens=True)

        for text, (start_id, end_id) in zip(decoded, batch_ids):
            start_marker = str(start_id).lower()
            end_marker = str(end_id).lower()
            lower_text = text.lower()
            start = lower_text.find(start_marker) + len(start_marker)
            end = lower_text.find(end_marker)
            results.append(text[start:end])

    return results

**I tested multiple variances with multithreading, multiprocessing and bathing and the best found approach is batch translating**

In [26]:
text_chunks = text_chunks[:50]  # take few first chunks
translations = translate_chunks_in_batch(text_chunks, batch_size=8)

In [27]:
translations = translations
originals = [c.core for c in text_chunks]

with open("translation_results.md", "+w", encoding="utf-8") as f:
    for pl, en in zip(originals, translations):
        f.write(f"### PL:\n{pl.strip()}\n\n")
        f.write(f"### EN:\n{en.strip()}\n\n")
        f.write("---\n\n")

## Translation with tone adjustment and sentiment analysis

In [28]:
formal_letter = """Dear Sir or Madam,
I am writing to formally express my interest in the Data Analyst position at your esteemed organization. With a solid academic background in data science and hands-on experience in statistical modeling, data visualization, and machine learning, I believe I possess the qualifications necessary to contribute meaningfully to your team. My previous role involved designing and implementing data pipelines, generating actionable insights, and supporting key business decisions.
I have developed a strong proficiency in tools such as Python, SQL, and Tableau, and I am confident in my ability to adapt quickly to new systems and workflows. I am highly motivated, detail-oriented, and committed to delivering high-quality analytical support.
I would welcome the opportunity to further discuss how my skills and experience align with your organization’s goals. Please feel free to contact me at your earliest convenience should you require any additional information or documentation.
Thank you for your time and consideration."""

In [None]:
!pip install -q huggingface_hub vllm torchvision transformers

In [None]:
from huggingface_hub import login
import os

token = os.getenv("HF_TOKEN")

login(token)

In [None]:
!pip install -q vllm

In [40]:
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "speakleash/Bielik-4.5B-v3.0-Instruct-FP8-Dynamic"

sampling_params = SamplingParams(temperature=0.2, top_p=0.95, max_tokens=4096)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": "Jesteś tłumaczem z języka angielskiego na język polski. Użytkownik poda ci tekst po angielsku a ty masz odpowiedzieć pretłumaczonym tekstem. Tłumaczenie ma uwzględniać ton. Ton zostanie podany przez użytkownika po słowie kluczowym TON.",
    },
    {
        "role": "user",
        "content": "TEXT: 'I would be most grateful if you could kindly inform me at your earliest convenience regarding the status of my application.' TON: nieformalny",
    },
    {
        "role": "assistant",
        "content": "Będę wdzięczny, jeśli dasz mi znać, co z moją aplikacją, jak tylko będziesz mógł.",
    },
    {"role": "user", "content": f"TEXT: '{formal_letter}' TON: skrajnie nieformalny"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

# Can't use 4.5B FP8 Dynamic due to the error: RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.')

INFO 07-04 16:20:44 [__init__.py:244] Automatically detected platform cuda.


ImportError: /usr/local/lib/python3.11/dist-packages/vllm/_C.abi3.so: undefined symbol: _ZN3c106ivalue14ConstantString6createENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # the device to load the model onto

model_name = "speakleash/Bielik-4.5B-v3-Instruct"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
)

messages = [
    {
        "role": "system",
        "content": "Odpowiadaj krótko, precyzyjnie i wyłącznie w języku polskim.",
    },
    {"role": "user", "content": "Jakie mamy pory roku w Polsce?"},
    {
        "role": "assistant",
        "content": "W Polsce mamy 4 pory roku: wiosna, lato, jesień i zima.",
    },
    {"role": "user", "content": "Która jest najcieplejsza?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = input_ids.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

I tried to install few bielik versions bu none of them works. There is a dependency conflict for vllm cuda version and torch cuda version. What's more huggingface access token seem to be really unreliable - sometimes it enables downloading Bielik model and sometimes authentication fails.
With that I will just jump to next task.


## Statistic translation with SMT and performence comparison

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.translate import AlignedSent, IBMModel1
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from transformers import pipeline

try:
    nltk.data.find("tokenizers/punkt")
except nltk.downloader.DownloadError:
    nltk.download("punkt")


corpus_en = [
    "i like mac operating system very much",
    "cs2 is my favourite game",
    "i love this dog, this is my wife",
]
corpus_pl = [
    "bardzo lubię system operacyjny mac",
    "cs2 to moja ulubiona gra",
    "kocham tego psa, a to jest moja żona",
]


aligned_corpus = [
    AlignedSent(word_tokenize(pl), word_tokenize(en))
    for en, pl in zip(corpus_en, corpus_pl)
]

In [91]:
for aligned in aligned_corpus:
    print(aligned)

<AlignedSent: 'bardzo lubię system ...' -> 'i like mac operating...'>
<AlignedSent: 'cs2 to moja ulubiona...' -> 'cs2 is my favourite ...'>
<AlignedSent: 'kocham tego psa , a ...' -> 'i love this dog , th...'>


In [92]:
ibm1 = IBMModel1(aligned_corpus, 20)
translation_table = ibm1.translation_table

test_sentence_en = "i like mac operating system very much"
tokenized_test = word_tokenize(test_sentence_en.lower())

smt_translation_tokens = []
for token in tokenized_test:
    if token in translation_table:
        best_match = max(
            translation_table[token].keys(), key=lambda k: translation_table[token][k]
        )
        smt_translation_tokens.append(best_match)
    else:
        smt_translation_tokens.append(token)

smt_translation = " ".join(smt_translation_tokens)

In [93]:
# transformer translator
translator_nmt = pipeline(
    "translation", model="Helsinki-NLP/opus-mt-pl-en", device=device
)  # -1 dla CPU
nmt_output = translator_nmt(test_sentence_en)
nmt_translation = nmt_output[0]["translation_text"]

Device set to use cuda


In [94]:
reference_translation = "bardzo lubię system operacyjny mac"
reference_tokens = [word_tokenize(reference_translation.lower())]

smoothing = SmoothingFunction().method1
bleu_smt = sentence_bleu(
    reference_tokens,
    smt_translation_tokens,
    weights=(0.5, 0.5),
    smoothing_function=smoothing,
)
bleu_nmt = sentence_bleu(
    reference_tokens,
    word_tokenize(nmt_translation.lower()),
    weights=(0.5, 0.5),
    smoothing_function=smoothing,
)

print(f"Zdanie źródłowe:           '{test_sentence_en}'")
print(f"Tłumaczenie referencyjne:    '{reference_translation}'\n")

print("Statictical translation")
print(f"Wynik: '{smt_translation}'")
print(f"Jakość (BLEU): {bleu_smt:.4f}\n")

print("Transformer translation")
print(f"Wynik: '{nmt_translation}'")
print(f"Jakość (BLEU): {bleu_nmt:.4f}")

Zdanie źródłowe:           'i like mac operating system very much'
Tłumaczenie referencyjne:    'bardzo lubię system operacyjny mac'

Statictical translation
Wynik: 'i like like operating like very much'
Jakość (BLEU): 0.0000

Transformer translation
Wynik: 'i like mac operating system very much'
Jakość (BLEU): 0.0690


### Final thoughts
During the exercises I found out that quality of translation is usually quality/speed/volume tradeoff. Maybe most primitive translation method is SMT (Statictical Machine Translation) which works really fast but does not catch semantics and is not that flexible.\
Translating using transformers really depends on the transformer that was used, how it was trained and dor which languages it is used. As instance mBART model can handle multiple languages however it's heavy and really slow. It also can handle only 1024 tokens at once. Specific transormers designed to deal with 2 languages like _"Helsinki-NLP/opus-mt-pl-en"_ have reacher vocabluaries (not generally but for those 2 languages) and can handle semantics relations while being relatively light for the hardware.\
Usage of LLMs like Bielik is not ideal, especially if those are installed locally and not used via API. Even though those can handle many tokens they can be veeery heavy, often need additional configuration and require many dependencies.\
My personal favourite is solution from google. Googletrans lib is fairly small, text does not require any tokenization and cleaning, its fast and covers lot's of languages.\\

The quality of translation often can't be easly measured and requires human evaluation.