# Question1
 POETRY Generation using N-grams



1 Introduction:
In this assignment, you will use n-gram language modeling to generate some poetry using the ngrams. For the purpose of this assignment a poem will consist of three stanzas each containing four verses where each verse consists of 7—10 words. For example, following is a manually generated stanza.

دل سے نکال یاس کہ زندہ ہوں میں ابھی،

ہوتا ہے کیوں اداس کہ زندہ ہوں میں ابھی،

مایوسیوں کی قید سے خود کو نکال کر،

آ جاؤ میرے پاس کہ زندہ ہوں میں ابھی،



آ کر کبھی تو دید سے سیراب کر مجھے،

مرتی نہیں ہے پیاس کہ زندہ ہوں میں ابھی،

مہر و وفا خلوص و محبت گداز دل،

سب کچھ ہے میرے پاس کہ زندہ ہوں میں ابھی،




لوٹیں گے تیرے آتے ہی پھر دن بہار کے،

رہتی ہے دل میں آس کہ زندہ ہوں میں،

نایا ب شاخ چشم میں کھلتے ہیں اب بھی خواب، سچ ہے ترا

قیاس کہ زندہ ہوں میں ابھی

The task is to print three such stanzas with an empty line in between. The generation model can be trained on the provided Poetry Corpus containing poems from Faiz, Ghalib and Iqbal.You can scrape other urdu poetry too from internet. You will train unigram and bigram models using this corpus. These models will be used to generate poetry.

2 Assignment Task:

The task is to generate a poem using different models. We will generate a poem verse by verse until all stanzas have been generated. The poetry generation problem can be solved using the following algorithm:
1. Load the Poetry Corpus
2. Tokenize the corpus in order to split it into a list of words
3. Generate n-gram models
4. For each of the stanzas
– For each verse
* Generate a random number in the range [7...10]
* Select first word
* Select subsequent words until end of verse
* [bonus] If not the first verse, try to rhyme the last word with the last word of the previous verse
* Print verse
– Print empty line after stanza
2.1 Implementation Challenges:

Among the challenges of solving this assignment will be selecting subsequent words once we have chosen the first word of the verse. To predict the next word, what we aim to compute is the most probable next word from all the possible next words. In other words, we need to find the set of words that occur most frequently after the already selected word and choose the next word from that set. We can use a Conditional Frequency Distribution (CFD) to figure that out! A CFD tells us: given a condition, what is likelihood of each possible outcome. [bonus] Rhyming the generated verses is also a challenge. You can build your dictionary for rhyming. The Urdu sentence is written from right to left, so makes your n-gram models according to this style.

2.2 Standard n-gram Models
We can develop our model using the Conditional Frequency Distribution method. First develop a unigram model (Unigram Model), then the bigram model (Bigram Model) and then trigram model. Select the first word of each line randomly from starting words in the vocabulary and then use the bigram model to generate the next word until the verse is complete. Generate the next three lines similarly.
 Follow the same steps for the trigram model and compare the results of the two n-gram models.

In [20]:
import random
from nltk import ngrams

# Load the Poetry list from iqbal.txt and ghalib.txt
poetry_list = []

with open('iqbal.txt', 'r', encoding='utf-8') as file:
    poetry_list += file.read().splitlines()

with open('ghalib.txt', 'r', encoding='utf-8') as file:
    poetry_list += file.read().splitlines()

# Tokenize the poetry_list into words

words = []
for line in poetry_list:
    word_list = line.split()
    for word in word_list:
        words.append(word)

# Generate Bigram words
bigram_words = {}

for word1, word2 in ngrams(words, 2):
    if word1 not in bigram_words:
        bigram_words[word1] = []
    bigram_words[word1].append(word2)

# Function to generate a ghazal verse using a model
def generate_ghazal_verse(model, length_range):
    verse = []
    length = random.randint(length_range[0], length_range[1])
    
    while length > 0:
        if not verse:
            # Start with a random word from the poetry list
            current_word = random.choice(words)
        else:
            # Choose the next word based on the bigram words
            next_words = model.get(verse[-1], [])
            if not next_words:
                break
            current_word = random.choice(next_words)
        
        verse.append(current_word)
        length -= 1
    
    return verse

# Function to check if the last word of the verses rhyme
def last_words_rhyme(verses):
    last_word = verses[0][-1]
    for verse in verses[1:]:
        if last_word[-2:] != verse[-1][-2:]:
            return False
    return True

# Generating full ghazal with rhyming stanza 
def generate_complete_ghazal(stanza_count, verses_per_stanza):
    for _ in range(stanza_count):
        while True:
            
            rhyming_verses = []
            for _ in range(verses_per_stanza):
                verse = generate_ghazal_verse(bigram_words, (7, 10))
                rhyming_verses.append(verse)
            
            if last_words_rhyme(rhyming_verses):
                break
        
        for verse in rhyming_verses:
            print(" ".join(verse))
        
        print()  

generate_complete_ghazal(3, 4)


کی کیا تفصیر ہے؟ آب تو دلبری کیا آبرو میں
مجھ کو یہ کافر کو بخشی سختی خارا رہے ہیں
کہ کبھی ملے یہ حرف شیریں ترجماں تیرا کہیں
ہو سب دل نہیں غیر از سیلیِ استاد‘ نہیں

کمخواب تھا جن کے بدلے دشنہ مژگاں
قبا کا؟ ذرہ سویداۓ بیاباں نورد تھا واں
اعتمادِ دل تا پا نہ سکے گی جو واں
لیے ورق تمام ہے بے شرف صحبت اے ناداں

میرے نصیب کر کوئی موجود پھر مشاہدہ ہے
کی گستاخی دے کوئی اداس بیٹھا ہے
لگے نہ ہو جس سے گزر گرچہ ہے
کام بند کریں دماغ خجلت ہوں گلخن میں‘ ہے



# Question2
 Classify language out of the list given below using just stop words. Remove punctuations, make it lower.

In [2]:
import nltk
from nltk.corpus import stopwords
stopwords.fileids()

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [1]:
Test="An article is qualunque member van un class of dedicated words naquele estão used with noun phrases per mark the identifiability of the referents of the noun phrases"

In [2]:
#Manual code
import nltk
from nltk.corpus import stopwords

# Define a dictionary to store the counts of stopwords for each language
def classify_language(text):
    language_counts = {
        'arabic': 0,
        'azerbaijani': 0,
        'basque': 0,
        'bengali': 0,
        'catalan': 0,
        'chinese': 0,
        'danish': 0,
        'dutch': 0,
        'english': 0,
        'finnish': 0,
        'french': 0,
        'german': 0,
        'greek': 0,
        'hebrew': 0,
        'hinglish': 0,
        'hungarian': 0,
        'indonesian': 0,
        'italian': 0,
        'kazakh': 0,
        'nepali': 0,
        'norwegian': 0,
        'portuguese': 0,
        'romanian': 0,
        'russian': 0,
        'slovene': 0,
        'spanish': 0,
        'swedish': 0,
        'tajik': 0,
        'turkish': 0
    }

    # Tokenize the text and remove punctuation
    words = nltk.wordpunct_tokenize(text.lower())

    # Define stopwords for each language
    language_stopwords = {
        'arabic': set(stopwords.words('arabic')),
        'azerbaijani': set(stopwords.words('azerbaijani')),
        'basque': set(stopwords.words('basque')),
        'bengali': set(stopwords.words('bengali')),
        'catalan': set(stopwords.words('catalan')),
        'chinese': set(stopwords.words('chinese')),
        'danish': set(stopwords.words('danish')),
        'dutch': set(stopwords.words('dutch')),
        'english': set(stopwords.words('english')),
        'finnish': set(stopwords.words('finnish')),
        'french': set(stopwords.words('french')),
        'german': set(stopwords.words('german')),
        'greek': set(stopwords.words('greek')),
        'hebrew': set(stopwords.words('hebrew')),
        'hinglish': set(stopwords.words('hinglish')),
        'hungarian': set(stopwords.words('hungarian')),
        'indonesian': set(stopwords.words('indonesian')),
        'italian': set(stopwords.words('italian')),
        'kazakh': set(stopwords.words('kazakh')),
        'nepali': set(stopwords.words('nepali')),
        'norwegian': set(stopwords.words('norwegian')),
        'portuguese': set(stopwords.words('portuguese')),
        'romanian': set(stopwords.words('romanian')),
        'russian': set(stopwords.words('russian')),
        'slovene': set(stopwords.words('slovene')),
        'spanish': set(stopwords.words('spanish')),
        'swedish': set(stopwords.words('swedish')),
        'tajik': set(stopwords.words('tajik')),
        'turkish': set(stopwords.words('turkish'))
    }

    # Classify words by language and count stopwords for each language
    for word in words:
        for language, stopwords_list in language_stopwords.items():
            if word in stopwords_list:
                language_counts[language] += 1

    # Return the language counts
    return language_counts

# Test text
test = "An article is qualunque member van un class of dedicated words naquele estão used with noun phrases per mark the identifiability of the referents of the noun phrases"

result = classify_language(test)
print(result)


{'arabic': 0, 'azerbaijani': 3, 'basque': 0, 'bengali': 0, 'catalan': 3, 'chinese': 0, 'danish': 0, 'dutch': 5, 'english': 9, 'finnish': 0, 'french': 1, 'german': 1, 'greek': 0, 'hebrew': 0, 'hinglish': 12, 'hungarian': 1, 'indonesian': 1, 'italian': 2, 'kazakh': 0, 'nepali': 0, 'norwegian': 0, 'portuguese': 1, 'romanian': 1, 'russian': 0, 'slovene': 0, 'spanish': 1, 'swedish': 0, 'tajik': 0, 'turkish': 0}


# Question 3
 Rule Based Roman Urdu Text Normalization

Roman Urdu lacks standard lexicon and usually many spelling variations exist for a given word, e.g., the word zindagi (life) is also written as zindagee, zindagy, zaindagee and zndagi. So, in this question you have to Normalize Roman Urdu words using the following Rules given in the attached Pdf. Your Code works for a complete Sentence or multiple sentences.

For Example: zaroori, zaruri, zarori map to the 'zrory'. So zrory becomes the correct word for all representations mentioned above.

In [15]:
# your code here
def normalize_word(word):
    # Apply rules one by one
    if word.endswith("ain"):
        word = word.replace("ain", "ein")
    if not (word.startswith("ar")) and "ar" in word:
        word = word.replace("ar", "r")
    word = word.replace("ai", "ae")
    word = word.replace("iy", "I")
    word = word.replace("ih", "eh")
    word = word.replace("ay", "e")
    word = word.replace("s" * 3, "s")
    word = word.replace("ry", "ri")
    if word.startswith("es"):
        word = "is" + word[2:]
    word = word.replace("sy", "si")
    if not word.endswith("ty"):
        word = word.replace("ty", "ti")
    word = word.replace("aaa", "a")
    word = word.replace("aa", "a")
    word = word.replace("j" * 2, "j")
    word = word.replace("o" * 2, "o")
    if("ee" in word):
        word = word.replace("ee", "i")
    word = word.replace("ee" * 2, "i")

    if word.endswith("i") and word[-1] not in ['a','s']:
        word = word[:-1] + "y"
    word = word.replace("d" * 2, "d")
    word = word.replace("u", "o")
    if word.startswith("h"):
        word = word[1:]

    return word

def normalize_sentence(sentence):
    words = sentence.split()
    normalized_words = [normalize_word(word) for word in words]
    return ' '.join(normalized_words)

# Test the code
input_sentence = "zaroori zaruri zarori zindagee zindagy sehar bain balu mera zindagi zindagi aram"
normalized_sentence = normalize_sentence(input_sentence)
print(normalized_sentence)


zrory zrory zrory zindagy zindagy sehr bein balo mera zindagy zindagy aram


# Question 4
In this question, you have been given two text files in Urdu. The first file contains an Urdu dictionary,
which consists of a list of words. The second file contains sentences that do not have spaces between the
words and are difficult to read.
آجخودبخشہوں
This sentence, without proper word segmentation, is difficult to understand. However, with proper word
segmentation, the sentence can be separated into individual words:
آج خود بخش ہوں
This makes the sentence much easier to read and understand.


This task is create spaces between words using

*   unigrams
*   bigram
*   trigrams

You can use the list of words file/dictionary provided in assignment 1.








In [2]:
import nltk
from nltk import ngrams

#  Reading the file containing correct sentences and extract all the words.
with open('words_dictionary.txt', 'r', encoding='utf-8') as file:
    sentences = file.read().splitlines()
    words = set()
    for sentence in sentences:
        words.update(sentence.split())

#  Generate unigrams, bigrams, and trigrams from the words.
unigrams = words
bigrams = set(ngrams(words, 2))
trigrams = set(ngrams(words, 3))

#  Read the file containing words that don't have spaces.
with open('word_test.txt', 'r', encoding='utf-8') as file:
    joined_words = set(file.read().split())
#Joined words file
print("File without proper segmentation: \n")
print(joined_words)
print("\nFile with proper Segmentation: \n")
#  Compare all sentences at once
fixed_sentences = []
for sentence in sentences:
    for trigram in trigrams:
        joined_trigram = " ".join(trigram)
        if joined_trigram in joined_words:
            sentence = sentence.replace(joined_trigram, " ".join(trigram))
    for bigram in bigrams:
        joined_bigram = " ".join(bigram)
        if joined_bigram in joined_words:
            sentence = sentence.replace(joined_bigram, " ".join(bigram))
    
    for unigram in unigrams:
        if unigram in joined_words:
            sentence = sentence.replace(unigram, unigram + " ")
    
        #fixed_sentences.append(sentence)

# : Replace the joined words in the text file with properly spaced words.
with open('output_file.txt', 'w', encoding='utf-8') as output_file:
    for sentence in fixed_sentences:
        output_file.write(sentence + '\n')
for sentence in sentences:
    print(sentence)


print("Fixing and spacing completed. Result saved in segmented_sentences.txt.")


File without proper segmentation: 

{'سابقوفاقیوزیرمفتاحاسماعیلکاکہناہےکہوزیرخزانہاسحاقڈارنےپھروہیحرکتکیجوشوکتتریننےکیتھیانہوںنےآئیایمایفسےکیاگیامعاہدہتوڑاتھاانہوںنےہمنیوزکےپروگراممیںگفتگوکرتےہوئےمزیدکہاکہآئیایمایفسےمعاہدہہونےکےبعدچیزیںبہترہوںگیپیآئیاےاسسال90اربروپےکانقصانکرےگیلیکناگرپیآئیاےکینجکاریہوگیتو90اربروپےکانقصاننہیںہوگا', 'قبلازیںوفاقیوزیرایازصادقنےکہاکہپرویزخٹکاسدقیصراوراعجازشاہکواےپیسیمیںشرکتکیدعوتدیانسےکہاکہوہیہدعوتعمرانخانتکپہنچادیں', 'آفسپنرنےکہاکہانٹرنیشنلکرکٹکونسل(آئیسیسی)نےپاکستانکوٹورنامنٹکیمیزبانیکاحقدےدیاہےلیکنبھارتپاکستانکادورہکرنےکوتیارنہیںرویچندرنایشوننےبھی2023ءمیں50اوورکےورلڈکپکےحوالےسےپاکستانکرکٹبورڈ(پیسیبی)کےحالیہبیانکاجوابدیتےہوئےکہامیرےخیالمیںیہممکننہیںہےآفسپنرنےمزیدکہاکہپاکستاننےپہلےبھارتکادورہکرنےسےانکارکردیاتھالیکنآخرکاروہمیگاایونٹسمیںشرکتکےلیےبھارتگئےتھے', 'مفتاحاسماعیلنےکہاکہمیںتوچاہرہاتھاپہلےدنہیپٹرولیممصنوعاتکیقیمتوںمیںاضافہکیاجائےسیلابآنےسےپہلےہمنےڈیفالٹرسککوکمکیاتھاشہبازشریفاگرآجوزیراعظمنہہوتےتوملکڈیفالٹکیطرفجاچکاتھاقبلازیںایکبیانمیںمفتاحاسماعیلنےکہ