# Question1
 POETRY Generation using N-grams



1 Introduction:
In this assignment, you will use n-gram language modeling to generate some poetry using the ngrams. For the purpose of this assignment a poem will consist of three stanzas each containing four verses where each verse consists of 7—10 words. For example, following is a manually generated stanza.

دل سے نکال یاس کہ زندہ ہوں میں ابھی،

ہوتا ہے کیوں اداس کہ زندہ ہوں میں ابھی،

مایوسیوں کی قید سے خود کو نکال کر،

آ جاؤ میرے پاس کہ زندہ ہوں میں ابھی،



آ کر کبھی تو دید سے سیراب کر مجھے،

مرتی نہیں ہے پیاس کہ زندہ ہوں میں ابھی،

مہر و وفا خلوص و محبت گداز دل،

سب کچھ ہے میرے پاس کہ زندہ ہوں میں ابھی،




لوٹیں گے تیرے آتے ہی پھر دن بہار کے،

رہتی ہے دل میں آس کہ زندہ ہوں میں،

نایا ب شاخ چشم میں کھلتے ہیں اب بھی خواب، سچ ہے ترا

قیاس کہ زندہ ہوں میں ابھی

The task is to print three such stanzas with an empty line in between. The generation model can be trained on the provided Poetry Corpus containing poems from Faiz, Ghalib and Iqbal.You can scrape other urdu poetry too from internet. You will train unigram and bigram models using this corpus. These models will be used to generate poetry.

2 Assignment Task:

The task is to generate a poem using different models. We will generate a poem verse by verse until all stanzas have been generated. The poetry generation problem can be solved using the following algorithm:
1. Load the Poetry Corpus
2. Tokenize the corpus in order to split it into a list of words
3. Generate n-gram models
4. For each of the stanzas
– For each verse
* Generate a random number in the range [7...10]
* Select first word
* Select subsequent words until end of verse
* [bonus] If not the first verse, try to rhyme the last word with the last word of the previous verse
* Print verse
– Print empty line after stanza
2.1 Implementation Challenges:

Among the challenges of solving this assignment will be selecting subsequent words once we have chosen the first word of the verse. To predict the next word, what we aim to compute is the most probable next word from all the possible next words. In other words, we need to find the set of words that occur most frequently after the already selected word and choose the next word from that set. We can use a Conditional Frequency Distribution (CFD) to figure that out! A CFD tells us: given a condition, what is likelihood of each possible outcome. [bonus] Rhyming the generated verses is also a challenge. You can build your dictionary for rhyming. The Urdu sentence is written from right to left, so makes your n-gram models according to this style.

2.2 Standard n-gram Models
We can develop our model using the Conditional Frequency Distribution method. First develop a unigram model (Unigram Model), then the bigram model (Bigram Model) and then trigram model. Select the first word of each line randomly from starting words in the vocabulary and then use the bigram model to generate the next word until the verse is complete. Generate the next three lines similarly.
 Follow the same steps for the trigram model and compare the results of the two n-gram models.

In [47]:
import nltk
import random

nltk.download('punkt')

def load_data(filename):
    with open(filename, 'r', encoding='utf-8') as f:
        data = f.read()
    return data

def tokenize_corpus(data):
    return nltk.word_tokenize(data)

def generate_n_gram_model(token_list, n):
    if n == 1:
        model = list(nltk.ngrams(token_list, n))
    elif n == 2:
        model = list(nltk.ngrams(token_list, n, pad_left=True, pad_right=True))
    elif n == 3:
        model = list(nltk.ngrams(token_list, n, pad_left=True, pad_right=True))
    else:
        raise ValueError("Invalid n value. Please use 1 or 2.")
    return model

def generate_poetry(stanzas, model, poet_name):
    poetry = []
    for _ in range(stanzas):
        verse = ''
        previous_word = ''
        for _ in range(4):
            line_length = random.randint(7, 10)
            line = []
            for _ in range(line_length):
                if previous_word and _ == 0:
                    line.append(previous_word)
                else:
                    word = random.choice(model)
                    if isinstance(word, tuple):
                        word = ' '.join(word)
                    line.append(word)
                    previous_word = word
            verse += ' '.join(line) + '\n'
        poetry.append(verse)
    print(f"Poetry by {poet_name}:\n")
    for verse in poetry:
        print(verse)
        print()

ghalib_data = load_data('ghalib.txt')
iqbal_data = load_data('iqbal.txt')

ghalib_tokens = tokenize_corpus(ghalib_data)
iqbal_tokens = tokenize_corpus(iqbal_data)

ghalib_model = generate_n_gram_model(ghalib_tokens, 1)
iqbal_model = generate_n_gram_model(iqbal_tokens, 2)

generate_poetry(3, ghalib_model, "Ghalib")
print("\n")
generate_poetry(3, iqbal_model, "Iqbal")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Poetry by Ghalib:

پهر ‘ زیر نہ اور کے دل جفا
جفا پتھر سوزودرد کیا نہیں میری بھی
بھی کے اک ترے تو سمجھ سے آنکھ نے شب
شب کبھی مرتا جاتی ہجراں پر کہیں کبھی حاضر سے


کر دبا کی کو منظر باد دکان غالب و
و آيہ کا تحفہ کو کی چاہتا میرا نگار وقتِ
وقتِ کا سے ازل اگر بنتا ابھی
ابھی تھا نظر ہے معاصی کچھ کہ میدانءکربلا


کیے ! غم کا کا نہ نہ کاوش خاک مت
مت جس زار میں عتاب جگر صفا گاہے ہے
ہے ڈال شخص نہ کیا میرے مستانہ بہت جاں جنون
جنون تنہا خالی ہوئے وجہِ دہر انفعال چرایا




Poetry by Iqbal:

سرمہ ہے ہیں محبت یار ہوگا بغداد یہ آخر جو وہ اندیشہ ترے سامنے دیا گوش ذوق حسن
ذوق حسن ہیں سب میں فقط مہتاب کی یہاں بے نہ ایراں رہا ہے ۔۔۔۔۔۔۔۔ نو کے ہاتھ
کے ہاتھ کواکب وہی واسطے پیدا تھی سکھائے میں ہے کام دوسروں خدا مست شاخ نازک میں باقی میان غیب
میان غیب میں فقط نیم شب و خاشاک آزاد بندے جس سے نے ابلۂ تہی کیسہ


وقت سے ہیں تہی ہے ہزاروں پہ آئی بہا ہے ذرا دیر حق میں معلوم ہے
معلوم ہے ادب ہوں ہوئی کیوں بھی ہو تر ہے ہیں وہ تو مجھے سے گزر کش انتظار تھا تن
تھا تن موج تند کی انجمن ابھی یورپ کی نمو ہیں تہ انہیں کے ج

# Question2
 Classify language out of the list given below using just stop words. Remove punctuations, make it lower.

In [None]:
import nltk
from nltk.corpus import stopwords
stopwords.fileids()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Umair\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [None]:
Test="An article is qualunque member van un class of dedicated words naquele estão used with noun phrases per mark the identifiability of the referents of the noun phrases"

In [None]:
# Your output looks like

{'arabic': 0,
 'azerbaijani': 1,
 'basque': 0,
 'bengali': 0,
 'catalan': 3,
 'chinese': 0,
 'danish': 0,
 'dutch': 3,
 'english': 5,
 'finnish': 0,
 'french': 1,
 'german': 1,
 'greek': 0,
 'hebrew': 0,
 'hinglish': 8,
 'hungarian': 1,
 'indonesian': 1,
 'italian': 2,
 'kazakh': 0,
 'nepali': 0,
 'norwegian': 0,
 'portuguese': 1,
 'romanian': 1,
 'russian': 0,
 'slovene': 0,
 'spanish': 1,
 'swedish': 0,
 'tajik': 0,
 'turkish': 0}

In [19]:
from nltk.corpus import stopwords
import re

languages = ['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']

def classify_language(text, languages):
    words = re.findall(r'\b\w+\b', text) 

    stop_words_dict = {lang: set() for lang in languages}
    for lang in languages:
        stop_words = set(map(lambda x: x.lower(), stopwords.words(lang)))
        for word in words:
            if word.lower() in stop_words:
                stop_words_dict[lang].add(word.lower())
    stop_words_count_dict = {lang: len(words) for lang, words in stop_words_dict.items()}
    return stop_words_count_dict

Test = "An article is qualunque member van un class of dedicated words naquele estão used with noun phrases per mark the identifiability of the referents of the noun phrases"
output = classify_language(Test, languages)
print(output)


{'arabic': 0, 'azerbaijani': 1, 'basque': 0, 'bengali': 0, 'catalan': 3, 'chinese': 0, 'danish': 0, 'dutch': 3, 'english': 5, 'finnish': 0, 'french': 1, 'german': 1, 'greek': 0, 'hebrew': 0, 'hinglish': 8, 'hungarian': 1, 'indonesian': 1, 'italian': 2, 'kazakh': 0, 'nepali': 0, 'norwegian': 0, 'portuguese': 1, 'romanian': 1, 'russian': 0, 'slovene': 0, 'spanish': 1, 'swedish': 0, 'tajik': 0, 'turkish': 0}


# Question 3
 Rule Based Roman Urdu Text Normalization

Roman Urdu lacks standard lexicon and usually many spelling variations exist for a given word, e.g., the word zindagi (life) is also written as zindagee, zindagy, zaindagee and zndagi. So, in this question you have to Normalize Roman Urdu words using the following Rules given in the attached Pdf. Your Code works for a complete Sentence or multiple sentences.

For Example: zaroori, zaruri, zarori map to the 'zrory'. So zrory becomes the correct word for all representations mentioned above.

In [61]:
import re

def normalize_text(text):
    rules = [
        (r"ain$", "ein"),
        (r"([^a])ar", "\\1r"),
        (r"ai", "ae"),
        (r"y+", "I"),
        (r"ay$", "e"),
        (r"ih+", "eh"),
        (r"ey$", "e"),
        (r"s+", "s"),
        (r"ie$", "y"),
        (r"([^r])y", "\\1ri"),
        (r"es", "is"),
        (r"([^y])sy", "\\1si"),
        (r"a+", "a"),
        (r"([^t])y", "\\1ti"),
        (r"jj", "j"),
        (r"oo+", "o"),
        (r"ce+", "i"),
        (r"i([bdefghjklmnpqrtuvwxyz])", "y\\1"),
        (r"d+", "d"),
        (r"u", "o"),
        (r"h([^a-z])", "\\1")
    ]

    normalized_text = text

    for rule in rules:
        pattern, replacement = rule
        normalized_text = re.sub(pattern, replacement, normalized_text)

    return normalized_text

text = "zindagi zindagee zindagy zaindagee zndagi ainzindain
zin"
normalized_text = normalize_text(text)
print(normalized_text)


zyndagi zyndagee zyndagI zaendagee zndagi aenzyndeyn


In [63]:
text="ek baar ke baad, Ali ne apne ghar mein kuch gadbad dekhi. Usne apne dost ko bulaya aur kaha, 'kya karun, yeh sab kaise hua?' Dost ne muskuraya aur kaha, 'shayad koi raaz hoga.' Phir Ali ne kaha, 'hum raaz jaan lenge.' Lekin phir unhone dekha, raaz koi aam nahi tha, woh tha ek purani haveli ka bhoot."
normalized_text=normalize_text(text)
print(normalized_text)

ek bar ke bad, Ali ne apne ghr meyn koc gadbad dekhi. Usne apne dost ko bolaIa aor kaha, 'kIa kron, Ie sab kaise hoa?' Dost ne moskoraIa aor kaha, 'shaIad koi raz hoga.' Phyr Ali ne kaha, 'hom raz jan lenge.' Lekyn phyr onhone dekha, raz koi am nahi tha, wo tha ek porani haveli ka bhot.


# Question 4
In this question, you have been given two text files in Urdu. The first file contains an Urdu dictionary,
which consists of a list of words. The second file contains sentences that do not have spaces between the
words and are difficult to read.
آجخودبخشہوں
This sentence, without proper word segmentation, is difficult to understand. However, with proper word
segmentation, the sentence can be separated into individual words:
آج خود بخش ہوں
This makes the sentence much easier to read and understand.


This task is create spaces between words using

*   unigrams
*   bigram
*   trigrams

You can use the list of words file/dictionary provided in assignment 1.








In [10]:
pip install razdel


Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 23.3
[notice] To update, run: python.exe -m pip install --upgrade pip



Collecting razdel
  Downloading razdel-0.5.0-py3-none-any.whl (21 kB)
Installing collected packages: razdel
Successfully installed razdel-0.5.0


In [13]:
import razdel
from collections import Counter
import re


def load_words(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        words = set(word.strip() for word in f)
    return words


def create_ngrams(words, n):
    ngrams = []
    for word in words:
        word = re.sub(r'[^\w\s]', '', word)  # Remove non-word characters
        if len(word) >= n:
            for i in range(len(word) - n + 1):
                ngrams.append(word[i:i + n])
    return Counter(ngrams)


def add_spaces(text, unigram_counts, bigram_counts, trigram_counts):
    words = [token.text for token in razdel.tokenize(text)]

    segmented_text_unigram = words[0]
    segmented_text_bigram = words[0]
    segmented_text_trigram = words[0]

    for i in range(1, len(words)):
        last_word = words[i - 1]
        current_word = words[i]

        best_segment_unigram = current_word
        best_segment_bigram = current_word
        best_segment_trigram = current_word

        if current_word not in unigram_counts:
            if last_word in bigram_counts and current_word in bigram_counts[last_word]:
                best_segment_bigram = ' ' + current_word
            else:
                best_segment_bigram = last_word[-1] + current_word

        if i > 1 and last_word in trigram_counts and current_word in trigram_counts[last_word]:
            segmented_text_trigram += ' ' + current_word
        else:
            segmented_text_trigram += best_segment_trigram

        segmented_text_unigram += best_segment_unigram
        segmented_text_bigram += best_segment_bigram

    return segmented_text_unigram, segmented_text_bigram, segmented_text_trigram


if __name__ == '__main__':
    words = load_words('words.txt')

    unigram_counts = create_ngrams(words, 1)
    bigram_counts = create_ngrams(words, 2)
    trigram_counts = create_ngrams(words, 3)

    with open('word_test.txt', 'r', encoding='utf-8') as f:
        text = f.read()

    segmented_text_unigram, segmented_text_bigram, segmented_text_trigram = add_spaces(text, unigram_counts, bigram_counts, trigram_counts)

    print("Unigram Model Output:")
    print(segmented_text_unigram)

    print("\nBigram Model Output:")
    print(segmented_text_bigram)

    print("\nTrigram Model Output:")
    print(segmented_text_trigram)


Unigram Model Output:
تجربہکارہندوستانیآفسپنررویچندرنایشوننےآئندہایشیاءکپ2023ءکیغیریقینیقسمتپراپنیرائےکااظہارکیاہےجوپاکستانمیںہونےجارہاہےاپنےیوٹیوبچینلپرباتکرتےہوئےرویچندرنایشوننےکہاکہاگرپڑوسیملکبھارتایشیاکپ2023ءمیںشرکتکرناچاہتاہےتومقامتبدیلکردیناچاہیےآفسپنرنےکہاکہانٹرنیشنلکرکٹکونسل(آئیسیسی)نےپاکستانکوٹورنامنٹکیمیزبانیکاحقدےدیاہےلیکنبھارتپاکستانکادورہکرنےکوتیارنہیںرویچندرنایشوننےبھی2023ءمیں50اوورکےورلڈکپکےحوالےسےپاکستانکرکٹبورڈ(پیسیبی)کےحالیہبیانکاجوابدیتےہوئےکہامیرےخیالمیںیہممکننہیںہےآفسپنرنےمزیدکہاکہپاکستاننےپہلےبھارتکادورہکرنےسےانکارکردیاتھالیکنآخرکاروہمیگاایونٹسمیںشرکتکےلیےبھارتگئےتھےغورطلبہےکہایشیاکپکےمعاملےپرحالہیمیںبحرینمیںہونےوالیایکہنگامیمیٹنگمیںغورکیاگیاجہاںوینیوکےبارےمیںحتمیفیصلہمارچتکموخرکردیاگیابحرینمیںایشینکرکٹکونسل(اےسیسی)کےاجلاسکےبعدبیسیسیآئیکےحکامنےاعلانکیاکہبورڈنےایونٹکےلیےاپنیٹیمپاکستاننہبھیجنےکافیصلہکیاہےتاہمپیسیبیکےحکامنےبھیسختردعملکااظہارکرتےہوئےکہاہےکہوہاکتوبرمیںبھارتمیںہونےوالےورلڈکپ2023ءمیںشرکتنہیںکریںگےتوشہخانہکیسمیںعمرانخانپرفردجرمعائدنہہوسکیسیشنعدالتنےالیکشن