1 Introduction:
In this assignment, you will use n-gram language modeling to generate some poetry using the ngrams. For the purpose of this assignment a poem will consist of three stanzas each containing four verses where each verse consists of 7—10 words. For example, following is a manually generated stanza.

دل سے نکال یاس کہ زندہ ہوں میں ابھی،

ہوتا ہے کیوں اداس کہ زندہ ہوں میں ابھی،

مایوسیوں کی قید سے خود کو نکال کر،

آ جاؤ میرے پاس کہ زندہ ہوں میں ابھی،

آ کر کبھی تو دید سے سیراب کر مجھے،

مرتی نہیں ہے پیاس کہ زندہ ہوں میں ابھی،

مہر و وفا خلوص و محبت گداز دل،

سب کچھ ہے میرے پاس کہ زندہ ہوں میں ابھی،

لوٹیں گے تیرے آتے ہی پھر دن بہار کے،

رہتی ہے دل میں آس کہ زندہ ہوں میں،

نایا ب شاخ چشم میں کھلتے ہیں اب بھی خواب، سچ ہے ترا

قیاس کہ زندہ ہوں میں ابھی

The task is to print three such stanzas with an empty line in between. The generation model can be trained on the provided Poetry Corpus containing poems from Faiz, Ghalib and Iqbal.You can scrape other urdu poetry too from internet. You will train unigram and bigram models using this corpus. These models will be used to generate poetry.

2 Assignment Task:

The task is to generate a poem using different models. We will generate a poem verse by verse until all stanzas have been generated. The poetry generation problem can be solved using the following algorithm:

1. Load the Poetry Corpus
2. Tokenize the corpus in order to split it into a list of words
3. Generate n-gram models
4. For each of the stanzas
   – For each verse

- Generate a random number in the range [7...10]
- Select first word
- Select subsequent words until end of verse
- [bonus] If not the first verse, try to rhyme the last word with the last word of the previous verse
- Print verse
  – Print empty line after stanza
  2.1 Implementation Challenges:

Among the challenges of solving this assignment will be selecting subsequent words once we have chosen the first word of the verse. To predict the next word, what we aim to compute is the most probable next word from all the possible next words. In other words, we need to find the set of words that occur most frequently after the already selected word and choose the next word from that set. We can use a Conditional Frequency Distribution (CFD) to figure that out! A CFD tells us: given a condition, what is likelihood of each possible outcome. [bonus] Rhyming the generated verses is also a challenge. You can build your dictionary for rhyming. The Urdu sentence is written from right to left, so makes your n-gram models according to this style.

2.2 Standard n-gram Models
We can develop our model using the Conditional Frequency Distribution method. First develop a unigram model (Unigram Model), then the bigram model (Bigram Model) and then trigram model. Select the first word of each line randomly from starting words in the vocabulary and then use the bigram model to generate the next word until the verse is complete. Generate the next three lines similarly.
Follow the same steps for the trigram model and compare the results of the two n-gram models.


In [20]:
import numpy as np
import random
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk import FreqDist, ConditionalFreqDist

# Load the poetry corpus file
with open('ghalib.txt', 'r', encoding="utf8") as file: 
    poetry = file.read()

# Clean the text by removing punctuation
clean_text = re.sub(r'[^\w\s]', '', poetry)

# Tokenize the text into individual Urdu words
urdu_words = nltk.word_tokenize(clean_text)

# Create bigrams and trigrams
bigrams = list(nltk.ngrams(urdu_words, 2))
trigrams = list(nltk.ngrams(urdu_words, 3))

In [None]:
# Frequency Distributions
bigrams_cfd = ConditionalFreqDist(bigrams)
trigrams_cfd = ConditionalFreqDist(((w1, w2), w3) for w1, w2, w3 in trigrams)

def generate_verse(start_word, length, bigrams_cfd, trigrams_cfd):
    verse = [start_word]
    while len(verse) < length:
        if len(verse) >= 2 and (verse[-2], verse[-1]) in trigrams_cfd:
            # Use trigram model when possible
            next_word = trigrams_cfd[(verse[-2], verse[-1])].max()
        elif verse[-1] in bigrams_cfd:
            # Fall back to bigram model
            next_word = bigrams_cfd[verse[-1]].max()
        else:
            # If no bigram or trigram match, break to prevent random output
            break
        verse.append(next_word)
    return ' '.join(verse)

# Generate stanzas and verses
for stanza_num in range(3):
    for verse_num in range(4):
        verse_length = random.randint(7, 10)
        
        # Prompt user for first word in the first stanza, first verse only
        if stanza_num == 0 and verse_num == 0:
            start_word = input("Enter the first word: ")
        else:
            # Use a more probable starting word for other verses
            start_word = random.choice([word for word, freq in FreqDist(urdu_words).most_common(30)])
        
        # Generate verse
        verse = generate_verse(start_word, verse_length, bigrams_cfd, trigrams_cfd)
        
        # Print the verse
        print(verse)

    # Print an empty line to separate stanzas
    print("\n")



کہ یوں ہوتا تو کیا ہوتا ہوا جب غم
کہ یوں ہوتا تو کیا ہوتا ہوا
ہوا ہے یہ وقت ہے شگفتن گل ہائے ناز


و دل عزیز اس کی گلی میں جائے
کر رہا تھا میں مجموعہٴ خیال ابھی فرد فرد
ہوا ہے یہ وقت ہے شگفتن گل ہائے ناز
و دل عزیز اس کی گلی میں جائے


دل کا درد تھا احباب چارہ سازی
ہیں وہ آئیں گھر میں لگی ایسی کہ جو تھا
ہیں وہ آئیں گھر میں لگی ایسی کہ جو
بھی نہیں دل کو چلا ہے عشق سامان صد


