Author: Mikołaj Nowak 151813


# Poem Generator in Polish

The problem I wanted to solve is a Polish poetry generator. It may be difficult to find commercial applications for it, but certain ideas and tools I used to solve this problem could certainly find applications in the broader field of Generative AI.

From my experience, language models (as the name suggests) handle language tasks quite well, but they struggle a bit more with modeling analogies and strictly following predefined rules. Their weaknesses are evident in domains such as mathematics or reasoning. If it were possible to compel a model to generate rhymes, thirteen-syllable verses, and middles, the same mechanisms could be applied to represent, for example, arithmetic rules in a knowledge base without relying on contexts but on general knowledge.

## 1. Training data and preprocessing

I decided to use the entirety of "Pan Tadeusz" as the test data. Other works have slightly different structures and types of rhymes, plus all 13 books provide a substantial amount of data.
It would also be useful to have a function for basic text processing, such as removing punctuation marks.

In [5]:
def remove_punctuation(word):
    return word.translate(str.maketrans('', '', '.,!?:;*«»')).replace('\n', ' ')

example = "This is. a, wei?!rd te,xt with ...:;punct,,u.a!t!ion characters..."
print(remove_punctuation(example))

This is a weird text with punctuation characters


Since we want to create thirteen-syllable verses, we should have a function that divides words into syllables. First, we'll define a helper function to correct syllables generated by regular expressions.

In [6]:
def count_consonants_at_end(syllable, vowels="aąeęiouyó"):
    consonants = 0
    for char in reversed(syllable):
        if char not in vowels:
            consonants += 1
        else:
            break
    return consonants

Now let's define the actual function for syllable division.

In [7]:
import re

def syllabify(word):
    word = word.lower()  # Convert to lowercase
    word = remove_punctuation(word)
    cluster_placeholders = {
        "ch": "0",
        "cz": "1",
        "dz": "2",
        "dź": "3",
        "dż": "4",
        "rz": "5",
        "sz": "6"
    }
    vowel_clusters = ["ię", "ie", "iu", "ia", "io", "ią", "ii"]

    # Replace vowel clusters with a placeholder
    # Ensure thathe letter "i" in "ię", "ie" etc. is not reated like a singular vowel
    for cluster in vowel_clusters:
        word = word.replace(cluster, 'I' + cluster[1])
    # Replace consonant clusters with unique placeholders to ensure that there are no two-letter consonants
    for cluster, placeholder in cluster_placeholders.items():
        word = word.replace(cluster, placeholder)

    # Regular expression to find syllables
    syllables = re.findall(r'[^aąeęiouyó]*[aąeęiouyó][^aąeęiouyóI]*', word)

    # Restore vowel clusters
    syllables = [syllable.replace('I', 'i') for syllable in syllables]

    # Adjust syllables by moving half of the consonants from the end of each syllable to the beginning of the next
    for i in range(len(syllables) - 1):
        found_consonants = count_consonants_at_end(syllables[i])
        consonants_to_move = (found_consonants+1)//2
        syllables[i+1] = syllables[i][len(syllables[i])-consonants_to_move:] + syllables[i+1]
        syllables[i] = syllables[i][:len(syllables[i])-consonants_to_move]

    # Restore consonant clusters
    for cluster, placeholder in cluster_placeholders.items():
        syllables = [syllable.replace(placeholder, cluster) for syllable in syllables]

    return syllables

# Example usage
text = "Litwo! Ojczyzno moja! ty jesteś jak zdrowie Ile cię trzeba cenić, ten tylko się dowie Kto cię stracił. Dziś piękność twą w całej ozdobie Widzę i opisuję, bo tęsknię po tobie."
words = text.split()

for word in words:
    print(f"{word}: {syllabify(word)}")


Litwo!: ['lit', 'wo']
Ojczyzno: ['oj', 'czyz', 'no']
moja!: ['mo', 'ja']
ty: ['ty']
jesteś: ['jes', 'teś']
jak: ['jak']
zdrowie: ['zdro', 'wie']
Ile: ['i', 'le']
cię: ['cię']
trzeba: ['trze', 'ba']
cenić,: ['ce', 'nić']
ten: ['ten']
tylko: ['tyl', 'ko']
się: ['się']
dowie: ['do', 'wie']
Kto: ['kto']
cię: ['cię']
stracił.: ['stra', 'cił']
Dziś: ['dziś']
piękność: ['pięk', 'ność']
twą: ['twą']
w: []
całej: ['ca', 'łej']
ozdobie: ['oz', 'do', 'bie']
Widzę: ['wi', 'dzę']
i: ['i']
opisuję,: ['o', 'pi', 'su', 'ję']
bo: ['bo']
tęsknię: ['tęs', 'knię']
po: ['po']
tobie.: ['to', 'bie']


Since we can syllabify words now, it would be useful to find their application in generating verses. Therefore, we will also need functions for counting syllables and for checking how much two words rhyme.

In [8]:
def count_syllables(word):
    return len(syllabify(word))


def rhyme_factor(word1, word2):
    syllables1 = syllabify(word1)
    syllables2 = syllabify(word2)
    # Get the last two syllables of each word (if they exist)
    lastsyllable1 = syllables1[-1] if len(syllables1) > 0 else 'xxx'
    lastsyllable2 = syllables2[-1] if len(syllables2) > 0 else 'xxx'
    beforelastsyllable1 = syllables1[-2] if len(syllables1) > 1 else 'xxx'
    beforelastsyllable2 = syllables2[-2] if len(syllables2) > 1 else 'xxx'
    # Remove consonants only from the beginning of the second last syllable
    beforelastsyllable1 = beforelastsyllable1.lstrip('bcdfghjklmnpqrstvwxyz')
    beforelastsyllable2 = beforelastsyllable2.lstrip('bcdfghjklmnpqrstvwxyz')
    # Combine the last two syllables into endings
    ending1 = beforelastsyllable1 + lastsyllable1
    ending2 = beforelastsyllable2 + lastsyllable2
    # Calculate rhyme factor
    min_length = min(len(ending1), len(ending2))
    matching_count = sum(1 for c1, c2 in zip(ending1[::-1], ending2[::-1]) if c1 == c2)
    return (matching_count / min_length) if min_length > 0 else 0


# Example usage
word1 = "trzeba"
word2 = "nieba"

print(f"Number of syllables in '{word1}': {count_syllables(word1)}")
print(f"Calculating rhyme factor between '{word1}' and '{word2}': {rhyme_factor(word1, word2):.2f}")


Number of syllables in 'trzeba': 2
Calculating rhyme factor between 'trzeba' and 'nieba': 1.00


## 2. Word N-grams in generating poems

Let's see if we can generate poems using only n-grams. If, after loading the entire "Pan Tadeusz," we manage to create a word database that consistently generates meaningful text, and additionally, in each line, we track the number of syllables and compel the model to rhyme the last one, then we'll have a fully functional poem generator. <br>
Let's start by creating a helper function to split text into n-grams.

In [9]:
def get_word_ngrams(data, n_gram_len):
    data = remove_punctuation(data.lower())
    words = data.split(' ')
    ngrams = []
    for i in range (len(words) - n_gram_len + 1):
        ngram = []
        for j in range(n_gram_len):
            ngram.append(words[i+j])
        ngrams.append(ngram)
    return ngrams
print(get_word_ngrams("Litwo! Ojczyzno moja! ty jesteś jak zdrowie", 3))

[['litwo', 'ojczyzno', 'moja'], ['ojczyzno', 'moja', 'ty'], ['moja', 'ty', 'jesteś'], ['ty', 'jesteś', 'jak'], ['jesteś', 'jak', 'zdrowie']]


And then let's create a function for generating Markov chains.

In [10]:
from collections import Counter

def generate_ngram_markov(n_gram_len):
    markov_dict = dict()  # Create a dictionary that will map a context (sequence of n-1 words) to a list of allowed next words observed after that context.
    with open("pan_tadeusz.txt", 'r', encoding="utf8") as f:  # Read the data corpus.
        data = f.read().lower()  # Convert all uppercase letters to lowercase.
        data = remove_punctuation(data) #Preprocessing
        n_grams = get_word_ngrams(data, n_gram_len)  # Generate all word n-grams from the corpus.
        for n_gram in n_grams:  # For each n-gram...
            context = " ".join(n_gram[:-1])  # Take all words from the n-gram except the last one and join them into a single string separated by spaces.
            last_word = str(n_gram[-1])  # Take the last word of the n-gram.

            if context not in markov_dict.keys():  # If the context without the last word does not exist in the dictionary yet.
                markov_dict[context] = list()  # Add it to the dictionary and create a list for it.
            markov_dict[context].append(last_word)  # Knowing that the context is in the dictionary, append the last word to the list.

    for context in markov_dict.keys():  # For each context (n-1 word sequence).
        markov_dict[context] = Counter(markov_dict[context])  # Create a word histogram for words appearing after this context in the corpus.

    return markov_dict

In [11]:
import random
import itertools


n_gram_len = 2  # Number of words to form an n-gram.
markov_dict = generate_ngram_markov(n_gram_len)  # Create a dictionary with word histograms for each context.

text = "szlachta"  # Text to start generating from.
generated = text
count = 0
line = 1
count += sum(count_syllables(word) for word in text.split())
to_rhyme = ""
print(count)

for i in range(500):  # Repeat 500 times...
    text_spl = text.split(" ")  # Split the existing text by space (perform a naive tokenization).
    context = " ".join(text_spl[-n_gram_len+1:])  # Get the last n_gram_len - 1 words.
    if(line == 1):
        idx = random.randrange(sum(markov_dict[context].values()))  # Check which words are allowed as successors to our context and choose one of them randomly according to the distribution created by the histogram.
        new_word = next(itertools.islice(markov_dict[context].elements(), idx, None))  # Choose the randomly selected word.
        generated = generated + new_word + " "
        count+=count_syllables(new_word)
        if(count > 13):
            generated+=" \n"
            line = 2
            to_rhyme = new_word
            count = 0
    if(line == 2):
        best_word = None
        best_rhyme = 0
        if(count > 10):
            for word, count in markov_dict[context].items():
                rhyme = rhyme_factor(word, to_rhyme)
                if(rhyme > best_rhyme and word!=to_rhyme):
                    best_word = word
                    best_rhyme = rhyme
            if(best_rhyme > 0.0):
                new_word = best_word
            else:
                idx = random.randrange(sum(markov_dict[context].values()))
                new_word = next(itertools.islice(markov_dict[context].elements(), idx, None))
            line = 1
            count = 0
            generated = generated + new_word + " \n"
            to_rhyme = ""
        else:
            idx = random.randrange(sum(markov_dict[context].values()))
            new_word = next(itertools.islice(markov_dict[context].elements(), idx, None))  
            generated = generated + new_word + " "
        count+=count_syllables(new_word)
    text = text + " " + new_word  # Append the chosen word at the end.
    
print(generated)


2
szlachtaże pierwszy komendant on odpowie majorze  
a byłem ci ojcem (mówiąc podkomorzemu jego 
zamek niech pan bóg mieczem rozciął pysk na łożu  
łaskawym chlebie nie miałem zachowanie u 
szlachty major — jego pamięć i srogość urzędów  
urzędów byłam za tą panią ciotką z 
tyłu jakby w plebanii świéce biega od sztućca  
scyzoryka   tam staje pobladły drżący i 
ziemianinowi ustępować z obrusa poległ  
poległ jeden drugiego zachęca dobrzyńscy ja 
zawsze miłą konewkę swój rydwan orły białe  
nasze lance pyta się z drugiej maciej stary 
ozwały się na głowni rapiera patrzy  
patrzy śmiele walczy zawołał cydzik — 
jutro o przyczynę tak świeżej niepomna przysięgi  
przysięgi o filary ten pan rejent i 
pusta rzekłbyś że mnie cudem (gdy od rana pisał  
  ach ja pewną dziewczynkę widziałem w wilnie na 
nic a na miejscu nieruchomy schyliwszy głowy  
głowę mu z niezwyczajnej ich karków rozpuszczają grzywy 
wstaje i czarny baran lub córki choć się  
się snopy zboża malowane na wszystko strwonił 
c

Even when loading over 17GB of Polish text, the n-gram approach generates poor text. We could increase the length of the n-grams and use "Pan Tadeusz" as the training data, but then we're merely talking about copying the input text, not Generative AI. The n-gram approach is definitely not suitable for generating poems.

## 3. Training Pre-built Models

Considering that the n-gram approach completely failed, we should definitely employ a pre-built model for this task.

In [1]:
from transformers import pipeline

generator = pipeline("text-generation", model="ai-forever/mGPT-13B", num_beams=1, temperature=1.0)
generated_text = generator("Mały chłopiec", truncation=True, max_length=50, num_return_sequences=1)
for i, text in enumerate(generated_text):
  print(i + 1, ":", text['generated_text'])


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

  return self.fget.__get__(instance, owner)()


In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, Trainer, TrainingArguments

# Define model and tokenizer names
model_name = "ai-forever/mGPT-13B"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load and preprocess text data (replace with your actual file path)
with open("pan_tadeusz.txt", "r", encoding="utf-8") as f:
    text = f.read()

# Preprocess the text (e.g., cleaning, splitting into chunks)
def preprocess_text(text):
    # Replace this with your specific cleaning/splitting steps
    text = text.lower()  # Convert to lowercase (example pre-processing)
    sentences = text.split(". ")  # Split into sentences (example pre-processing)
    return sentences

sentences = preprocess_text(text)

# Prepare training data (tokenization and padding)
data = tokenizer(sentences, return_tensors="pt", truncation=True, padding="max_length")

# Define training arguments (adjust parameters as needed)
training_args = TrainingArguments(
    output_dir="./fine-tuned_model",
    overwrite_output_dir=True,
    per_device_train_batch_size=4,  # Adjust batch size based on your GPU memory
    save_steps=10_000,
    num_train_epochs=3,  # Adjust training epochs as needed
)

# Create model and trainer
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=data,
)

# Train the model (might take a while depending on GPU)
trainer.train()

print("Model fine-tuned successfully!")

# Load the fine-tuned model and tokenizer
fine_tuned_model = AutoModelForSeq2SeqLM.from_pretrained("./fine-tuned_model")
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("./fine-tuned_model")

# Text generation function
def generate_text(model, tokenizer, prompt, max_length=50, num_return_sequences=1):
    """Generates text using the fine-tuned model."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated_outputs = model.generate(
        input_ids=input_ids,
        max_length=max_length,
        num_return_sequences=num_return_sequences,
        )
    decoded_texts = tokenizer.batch_decode(generated_outputs, skip_special_tokens=True)
    return decoded_texts

# Generate text similar to Pan Tadeusz
prompt = "Mały chłopiec"
generated_texts = generate_text(fine_tuned_model, fine_tuned_tokenizer, prompt)

# Print generated text
for i, text in enumerate(generated_texts):
    print(i + 1, ":", text)

I tested many models capable of generating text in Polish. Downloading the models was a significant time burden, and unfortunately, that's where I lost the most time. The most crucial feature this model needed to possess was high creativity and the ability to generate diverse texts. Some models, even when the temperature was set to the maximum, kept generating the same text. A common problem with all the models I tested was the awareness of context. The longer the poem, the more the model would forget what it was writing about. Therefore, I decided not to generate poems longer than 8 lines.

## 4. Comparision with other available models

In [30]:
import json
from openai import OpenAI

# Load the API key from the apiKeys.json file
with open('apiKeys.json') as f:
    api_keys = json.load(f)

# Set your OpenAI API key
openai_api_key = api_keys['openai']['api_key']
client = OpenAI(api_key=openai_api_key)

# Define the question
question = "Wygeneruj krótki rymowany wiersz w stylu Adama Mickiewicza. Powinien być trzynastozgłoskowcem i składać się z 8 wersów"

# Define a list of language models
language_models = [
    "gpt-3.5-turbo",
    "gpt-4",
    "gpt-4o",
]

# Loop through each language model
for model in language_models:
    # Generate text using the specified model and question
    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": question}],
        stream=True,
    )
    print(f"Model: {model}")
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            print(chunk.choices[0].delta.content, end="")
    print("\n")


Model: gpt-3.5-turbo
W polu szumiącym brzozowym,
Pod niebem błękitnym wiosennym,
Głos ptaków słychać wszędzie,
Wiatr przepływa jak wiewiórka.
Na łące słońce świeci,
A ja tam idę wolno,
W mym sercu radość rośnie,
Bo tu jestem, gdzie kocham.

Model: gpt-4
Gdzieś między morzem a stromymi wzgórzami,
Gdzie kwitną jabłonie złote jak korony,
Wróciłam do domu, wśród dawnych drzew starych,
Serca pełne tęsknoty, oczu pełne łez i skarg.

Niebo jak aksamit, pachnący jasmin krajobraz,
Wszystko jak dawniej, lecz serca już nie ma.
Kyoły śpiewa nucę stary, polski zdrój,
Cichy, jak łza co spływa po policzku mój.

Model: gpt-4o
O zmierzchu nad wodą, gdzie wierzby szepcą cicho,
Młodzieńca serce płacze, lecz dusza pełna licho.
Wśród cieni ukochanej, widmo błąka się śpiesznie,
Miłość już utracona, choć serce bije grzesznie.

Na skrzydłach pieśń niesiona, do gwiazd się wzlatuje,
Choć los nam sprzyjał kiedyś, dziś już nie raduje.
W blasku księżyca płomień, co w nocy migoce,
Tak naszą miłość wspomni, jak rosa

The GPT-4o model is by far the best for the vast majority of tasks related to NLP. <br>
And this time it significantly outperforms GPT-3.5 and GPT-4. The verses rhyme in pairs, and the middle rhyme is visible. <br>
However, sometimes there are trivial rhymes, and the number of syllables in each verse only fluctuates around 13.

## 5. Summary
For generating poetry, I would definitely use the GPT-4o model provided by the OpenAI API. However, as of now (June 13, 2024), it is likely the best language model in the world for most tasks. However, it is not entirely publicly available, and we have to pay for token generation. If we absolutely need full control, we can try using publicly available models, but we must expect lower quality.