# Question1
 POETRY Generation using N-grams



1 Introduction:
In this assignment, you will use n-gram language modeling to generate some poetry using the ngrams. For the purpose of this assignment a poem will consist of three stanzas each containing four verses where each verse consists of 7—10 words. For example, following is a manually generated stanza.

دل سے نکال یاس کہ زندہ ہوں میں ابھی،

ہوتا ہے کیوں اداس کہ زندہ ہوں میں ابھی،

مایوسیوں کی قید سے خود کو نکال کر،

آ جاؤ میرے پاس کہ زندہ ہوں میں ابھی،



آ کر کبھی تو دید سے سیراب کر مجھے،

مرتی نہیں ہے پیاس کہ زندہ ہوں میں ابھی،

مہر و وفا خلوص و محبت گداز دل،

سب کچھ ہے میرے پاس کہ زندہ ہوں میں ابھی،




لوٹیں گے تیرے آتے ہی پھر دن بہار کے،

رہتی ہے دل میں آس کہ زندہ ہوں میں،

نایا ب شاخ چشم میں کھلتے ہیں اب بھی خواب، سچ ہے ترا

قیاس کہ زندہ ہوں میں ابھی

The task is to print three such stanzas with an empty line in between. The generation model can be trained on the provided Poetry Corpus containing poems from Faiz, Ghalib and Iqbal.You can scrape other urdu poetry too from internet. You will train unigram and bigram models using this corpus. These models will be used to generate poetry.

2 Assignment Task:

The task is to generate a poem using different models. We will generate a poem verse by verse until all stanzas have been generated. The poetry generation problem can be solved using the following algorithm:
1. Load the Poetry Corpus
2. Tokenize the corpus in order to split it into a list of words
3. Generate n-gram models
4. For each of the stanzas
– For each verse
* Generate a random number in the range [7...10]
* Select first word
* Select subsequent words until end of verse
* [bonus] If not the first verse, try to rhyme the last word with the last word of the previous verse
* Print verse
– Print empty line after stanza
2.1 Implementation Challenges:

Among the challenges of solving this assignment will be selecting subsequent words once we have chosen the first word of the verse. To predict the next word, what we aim to compute is the most probable next word from all the possible next words. In other words, we need to find the set of words that occur most frequently after the already selected word and choose the next word from that set. We can use a Conditional Frequency Distribution (CFD) to figure that out! A CFD tells us: given a condition, what is likelihood of each possible outcome. [bonus] Rhyming the generated verses is also a challenge. You can build your dictionary for rhyming. The Urdu sentence is written from right to left, so makes your n-gram models according to this style.

2.2 Standard n-gram Models
We can develop our model using the Conditional Frequency Distribution method. First develop a unigram model (Unigram Model), then the bigram model (Bigram Model) and then trigram model. Select the first word of each line randomly from starting words in the vocabulary and then use the bigram model to generate the next word until the verse is complete. Generate the next three lines similarly.
 Follow the same steps for the trigram model and compare the results of the two n-gram models.

In [None]:
import nltk
import random
from nltk import ngrams
from nltk.corpus import PlaintextCorpusReader

# Load the Poetry Corpus and tokenize it
corpus_root = '/home/muhammadzohaib/Semester_3/PAI/A4'
wordlists = PlaintextCorpusReader(corpus_root, '.*\.txt')
words = wordlists.words()

# Define the number of stanzas and verses
num_stanzas = 5
num_verses_per_stanza = 4

# Define the maximum verse length
max_verse_length = 10
verse_length = random.randint(7, max_verse_length)

# Unigram Model
unigram_model = nltk.FreqDist(words)

# Bigram Model
bigrams = list(ngrams(words, 2))
cfd_bigram = nltk.ConditionalFreqDist(bigrams)

# Trigram Model
trigrams = list(ngrams(words, 3))
cfd_trigram = nltk.ConditionalFreqDist(((w1, w2), w3) for w1, w2, w3 in trigrams)

# Generate poems using unigram, bigram, and trigram models
for _ in range(num_stanzas):
    for _ in range(num_verses_per_stanza):


        # Select the first word randomly from the vocabulary
        first_word = random.choice(words)

        # Initialize the verse with the first word
        verse_unigram = [first_word]
        verse_bigram = [first_word]
        verse_trigram = [first_word]

        for _ in range(verse_length - 1):
            # Use Unigram Model to generate the next word
            next_word_unigram = unigram_model.max()
            verse_unigram.append(next_word_unigram)

            # Use Bigram Model to generate the next word
            context_bigram = (verse_bigram[-1],)
            if context_bigram in cfd_bigram:
                next_word_bigram = cfd_bigram[context_bigram].max()
            else:
                # Fallback to a random word if the context is not found
                next_word_bigram = random.choice(words)
            verse_bigram.append(next_word_bigram)

             # Use Trigram Model to generate the next word
            if len(verse_trigram) >= 2:
                context_trigram = (verse_trigram[-2], verse_trigram[-1])
                if context_trigram in cfd_trigram:
                    next_word_trigram = cfd_trigram[context_trigram].max()
                else:
                    # Fallback to a random word if the context is not found
                    next_word_trigram = random.choice(words)
                verse_trigram.append(next_word_trigram)
            else:
                # If there are not enough words for the trigram context, use bigram context
                context_bigram = (verse_bigram[-1],)
                if context_bigram in cfd_bigram:
                    next_word_trigram = cfd_bigram[context_bigram].max()
                else:
                    # Fallback to a random word if the context is not found
                    next_word_trigram = random.choice(words)
                verse_trigram.append(next_word_trigram)


        # Print the generated verses for each model
        # print("Unigram Model:")
        # print(' '.join(verse_unigram))
        # print()

        # print("Bigram Model:")
        # print(' '.join(verse_bigram))
        # print()

        # print("Trigram Model:")
        print(' '.join(verse_trigram))
        print()

    # Print an empty line after each stanza
    print()


OSError: ignored

In [None]:
#Your code here

In [None]:
import nltk
from nltk.corpus import PlaintextCorpusReader
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

corpus_root = '/home/muhammadzohaib/Semester_3/PAI/A4'
wordlists = PlaintextCorpusReader(corpus_root, '.*\.txt')

# file_id = 'ghalib.txt'
# raw_text = wordlists.raw(file_id)

file_ids = wordlists.fileids()  # Get a list of file IDs in the corpus

for file_id in file_ids:
    raw_text = wordlists.raw(file_id)
    words = word_tokenize(raw_text)

unigrams = nltk.ngrams(words, 1)
unigramFD = nltk.FreqDist(unigrams)

bigrams = nltk.ngrams(words, 2)
bigramFD = nltk.FreqDist(bigrams)

trigrams = nltk.ngrams(words, 3)
trigramFD = nltk.FreqDist(trigrams)


In [None]:
# bigrams = nltk.ngrams(words, 2)
# cfd1 = nltk.ConditionalFreqDist(bigrams)

# print(cfd1)

from nltk import ngrams

n = 2  # Change this to 1 for unigrams and 3 for trigrams
n_grams = list(ngrams(words, n))

# Create a conditional frequency distribution for n-grams
cfd = nltk.ConditionalFreqDist(n_grams)

# Verify the content of the CFD
print(cfd)

<ConditionalFreqDist with 1870 conditions>


In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
import re

corpus_root = '/home/muhammadzohaib/Semester_3/PAI/A4'
wordlists = PlaintextCorpusReader(corpus_root, '.*\.txt')

# file_id = 'ghalib.txt'
# raw_text = wordlists.raw(file_id)

file_ids = wordlists.fileids()  # Get a list of file IDs in the corpus
pattern = r'[،؛؟!"#\$%&\'\(\)\*\+,\-\.\/:؛;<=>?@\[\\\]^_`{|}~]'

for file_id in file_ids:
    raw_text = wordlists.raw(file_id)
    raw_text = re.sub(pattern, '', raw_text)
    words = word_tokenize(raw_text)

unigrams = nltk.ngrams(words, 1)
unigramFD = nltk.FreqDist(unigrams)

bigrams = nltk.ngrams(words, 2)
bigramFD = nltk.FreqDist(bigrams)

trigrams = nltk.ngrams(words, 3)
trigramFD = nltk.FreqDist(trigrams)

# bigrams = nltk.ngrams(words, 2)
# cfd1 = nltk.ConditionalFreqDist(bigrams)

# print(cfd1)

from nltk import ngrams

n = 2  # Change this to 1 for unigrams and 3 for trigrams
n_grams = list(ngrams(words, n))

# Create a conditional frequency distribution for n-grams
cfd = nltk.ConditionalFreqDist(n_grams)

# Verify the content of the CFD
print(cfd)

import random

cfd1 = nltk.ConditionalFreqDist(bigrams)
n=2

length = random.randint(7,10)

for i in range(4):

    for j in range(4):

        verse = [random.choice(words)]

        for k in range(length-1):

            context = verse[-1]
            # print(context)
            next_word = cfd[context].max()
            verse.append(' ')
            verse.append(next_word)

        print(" ".join(verse))

    print()

<ConditionalFreqDist with 1845 conditions>
غریب   اگرچہ   مغربیوں   کا   یہ   بات   کہ
بن   کے   لیے   تنگ   چیتے   کا   یہ
نومیدی   مجھے   نکتۂ   سلمان   خوش   آہنگ   دنیا
اور   بھی   ہیں   وہ   خاک   کہ   میں

میں   نے   آخر   جو   آوارۂ   جنوں   تھے
وہی   رہے   ہیں   وہ   خاک   کہ   میں
میں   نے   آخر   جو   آوارۂ   جنوں   تھے
کا   یہ   بات   کہ   میں   نے   آخر

لولوئے   لالا   اگر   کیفیت   ہے   یا   میرا
جہان   مے   خانہ   ہر   کوئی   بادہ   و
ہے   یا   میرا   یا   میرا   یا   میرا
ازلی   ہے   یا   میرا   یا   میرا   یا

آیا   ہے   یا   میرا   یا   میرا   یا
پارس   یہ   بات   کہ   میں   نے   آخر
اہل   خرد   کا   یہ   بات   کہ   میں
ستارے   یہ   بات   کہ   میں   نے   آخر



In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
import re

corpus_root = '/home/muhammadzohaib/Semester_3/PAI/A4'
wordlists = PlaintextCorpusReader(corpus_root, '.*\.txt')

# file_id = 'ghalib.txt'
# raw_text = wordlists.raw(file_id)

file_ids = wordlists.fileids()  # Get a list of file IDs in the corpus
pattern = r'[،؛؟!"#\$%&\'\(\)\*\+,\-\.\/:؛;<=>?@\[\\\]^_-`{|}~]'

for file_id in file_ids:
    raw_text = wordlists.raw(file_id)
    raw_text = re.sub(pattern, '', raw_text)
    words = word_tokenize(raw_text)

unigrams = nltk.ngrams(words, 1)
unigramFD = nltk.FreqDist(unigrams)

bigrams = nltk.ngrams(words, 2)
bigramFD = nltk.FreqDist(bigrams)

trigrams = nltk.ngrams(words, 3)
trigramFD = nltk.FreqDist(trigrams)

# bigrams = nltk.ngrams(words, 2)
# cfd1 = nltk.ConditionalFreqDist(bigrams)

# print(cfd1)

from nltk import ngrams

n = 2  # Change this to 1 for unigrams and 3 for trigrams
n_grams = list(ngrams(words, n))

# Create a conditional frequency distribution for n-grams
cfd = nltk.ConditionalFreqDist(n_grams)

import random

cfd1 = nltk.ConditionalFreqDist(bigrams)
n=2

length = random.randint(7,10)

for i in range(4):

    for j in range(4):

        verse = [random.choice(words)]

        for k in range(length-1):

            context = verse[-1]
            # print(context)
            next_word = cfd[context].max()
            verse.append(' ')
            verse.append(next_word)

        print(" ".join(verse))

    print(" ")


# Save all the generated verses to a new file
# output_file = "generated_verses.txt"
# with open(output_file, 'w', encoding='utf-8') as file:
#     for verse in verses:
#         file.write(verse + '\n')

# print(f'Generated verses saved to {output_file}')


OSError: ignored

# Question2
 Classify language out of the list given below using just stop words. Remove punctuations, make it lower.

In [None]:
import nltk
from nltk.corpus import stopwords
stopwords.fileids()

['arabic',
 'azerbaijani',
 'basque',
 'bengali',
 'catalan',
 'chinese',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hebrew',
 'hinglish',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

In [None]:
Test="An article is qualunque member van un class of dedicated words naquele estão used with noun phrases per mark the identifiability of the referents of the noun phrases"

In [None]:
# Your output looks like

{'arabic': 0,
 'azerbaijani': 1,
 'basque': 0,
 'bengali': 0,
 'catalan': 3,
 'chinese': 0,
 'danish': 0,
 'dutch': 3,
 'english': 5,
 'finnish': 0,
 'french': 1,
 'german': 1,
 'greek': 0,
 'hebrew': 0,
 'hinglish': 8,
 'hungarian': 1,
 'indonesian': 1,
 'italian': 2,
 'kazakh': 0,
 'nepali': 0,
 'norwegian': 0,
 'portuguese': 1,
 'romanian': 1,
 'russian': 0,
 'slovene': 0,
 'spanish': 1,
 'swedish': 0,
 'tajik': 0,
 'turkish': 0}

In [None]:
import re

Test = re.sub(r'[^\w\s]', '', Test)
Test = Test.lower()

words = word_tokenize(Test)

result_dict = {}

for f_id in stopwords.fileids():

    stop_words = set(stopwords.words(f_id))

    count = 0
    for w in words:
        if w in stop_words:
            count = count+1

    result_dict[f_id] = count



result_dict

{'arabic': 0,
 'azerbaijani': 3,
 'basque': 0,
 'bengali': 0,
 'catalan': 3,
 'chinese': 0,
 'danish': 0,
 'dutch': 5,
 'english': 9,
 'finnish': 0,
 'french': 1,
 'german': 1,
 'greek': 0,
 'hebrew': 0,
 'hinglish': 12,
 'hungarian': 1,
 'indonesian': 1,
 'italian': 2,
 'kazakh': 0,
 'nepali': 0,
 'norwegian': 0,
 'portuguese': 1,
 'romanian': 1,
 'russian': 0,
 'slovene': 0,
 'spanish': 1,
 'swedish': 0,
 'tajik': 0,
 'turkish': 0}

# Question 3
 Rule Based Roman Urdu Text Normalization

Roman Urdu lacks standard lexicon and usually many spelling variations exist for a given word, e.g., the word zindagi (life) is also written as zindagee, zindagy, zaindagee and zndagi. So, in this question you have to Normalize Roman Urdu words using the following Rules given in the attached Pdf. Your Code works for a complete Sentence or multiple sentences.

For Example: zaroori, zaruri, zarori map to the 'zrory'. So zrory becomes the correct word for all representations mentioned above.

In [None]:
import re

normrules_dict = {
    "ain$": "ein",
    "ar": "r",
    "ai": "ae",
    "iy+": "i",
    "ay$": "ae",
    "ih+": "eh",
    "s+": "s",
    "ie$": "e",
    "ry": "ri",
    "es": "is",
    "sy": "si",
    "a+": "a",
    "ty": "ti",
    "j+": "j",
    "o+": "o",
    "e+": "i",
    "([bcdefghijklmnopqrstuvwxyz])i": r'\1y',
    "d+": "d",
    "u": "o"
}

def normalize(word):
    for pattern, replacement in normrules_dict.items():
        word = re.sub(pattern, replacement, word)
    return word




str = input('Enter String:')
token_words = word_tokenize(str)

n_words = []
n_sent = []

for word in token_words:
    n_word = normalize(word)
    n_words.append(n_word)

    n_sent = ' '.join(n_words)

print(n_sent)

# zaroori zaruri zarori

In [None]:
import nltk
from nltk.corpus import words
from nltk.util import ngrams

# import files

#--------FOR LIST#1---------

path1 = "/home/muhammadzohaib/A1/Word dictionary/bigram_words.txt"  # Replace with the path to your .txt file
with open(path1, "r") as file:
    word = file.readlines()

list1 = []

for line in word:
    list1.append(line.strip())


#--------FOR LIST#2---------

path2 = "/home/muhammadzohaib/A1/Word dictionary/words.txt"

with open(path2, "r") as file:
    word = file.readlines()

list2 = []

for line in word:
    list2.append(line.strip())


#---------------------------------------------------------------------------

dictionary = list1 + list2

# Replace 'your_text_file.txt' with the path to your text file
with open("/home/muhammadzohaib/A1/Q1/word_test.txt", "r", encoding="utf-8") as file:
    unprocessed_text = file.read()


# ...

# Function to segment the sentence using n-grams for maximum matching with context
def segment_with_context_ngrams(sentence, dictionary, n):
    words = []
    i = 0
    while i < len(sentence):
        matched = False
        for j in range(min(len(sentence), i + n, i), i, -1):
            ngram = sentence[i:j]
            if ngram in dictionary:
                for k in range(j, min(len(sentence), j + n, j), j):
                    next_ngram = sentence[j:k]
                    if next_ngram in dictionary:
                        words.append(next_ngram)
                        i = k
                        matched = True
                        break
                if not matched:
                    words.append(ngram)
                    i = j
                    matched = True
                    break
        if not matched:
            i += 1
    return words

# Segment the sentence using trigrams with context
segmented_words_with_context_trigrams = segment_with_context_ngrams(unprocessed_text, dictionary, 2)
segmented_sentence_with_context_trigrams = " ".join(segmented_words_with_context_trigrams)
print(segmented_sentence_with_context_trigrams)


# print("Unprocessed Text:", unprocessed_text)
# print("Dictionary:", dictionary)






In [None]:
import nltk
from nltk.corpus import words
from nltk.util import ngrams
# Replace 'your_text_file.txt' with the path to your text file
with open("/home/muhammadzohaib/A1/Q1/word_test.txt", "r") as file:
    unprocessed_text = file.read()


# Function to find the best word based on the dictionary
def find_best_word(ngram, dictionary):
    for i in range(len(ngram), 0, -1):
        if ngram[:i] in dictionary:
            return ngram[:i]
    return ngram[0]

# Tokenize the unprocessed text using n-grams and the dictionary
def tokenize_with_ngrams(text, dictionary, n):
    word_list = []
    words = set(dictionary)

    i = 0
    while i < len(text):
        for j in range(i + n, i, -1):
            ngram = text[i:j]
            if ngram not in words:
                ngram = text[i:i + n]
                segmented_word = find_best_word(ngram, dictionary)
                word_list.append(segmented_word)
                i += len(segmented_word)
            else:
                word_list.append(ngram)
                i = j
                break


    return word_list

# Tokenize the unprocessed text using n-grams
n = 3  # You can adjust 'n' based on your needs
tokenized_text = tokenize_with_ngrams(unprocessed_text, dictionary, n)

# Display the segmented text
segmented_text = " ".join(tokenized_text)
print(segmented_text)


تج ربہ کار ہند و س تان ی آ ف س پنر روی چند رنا یش ون نے آئن دہا یش یا ء کپ 2 0 2 3 ء کی غیر ی قی نی قسم تپ را پنی را ئے کا ا ظ ہار کیا ہے جوپ اکس تان میں ہون ے ج ارہ اہے اپن ے ی وٹ ی و بچی نل پرب ات کر تے ہوئ ے ر وی چند رنا یش ون نے کہا کہا گرپ ڑ و سیم لک بھا رتا یش یاک پ 2 0 2 3 ء میں شرک تکر ناچ اہت اہے توم قام تب دی لک ردی ناچ اہ یے 
 آفس پنر نے کہا کہا نٹر نیش نل کرک ٹکو نسل ( آ ئی سیس ی ) ن ے پ اکس تان کوٹ ورن امن ٹکی میز بان یک احق دے دیا ہے لیک نبھ ارت پاک ستا نکا دور ہک رن ے کوت یار نہی ں ر وی چند رنا یش ون نے بھی 2 0 2 3 ء میں 5 0 او ورک ے و رل ڈک پکے حوا لے سے پاک ستا نکر کٹ بور ڈ ( پ ی س ی ب ی ) کے حال یہ بیا نکا جوا بدی تے ہوئ ے کہا میر ے خ یال میں یہ مم کن نہی ں ہے آفس پنر نے م زید کہا کہ پاک ستا نن ے پہل ے ب ھا رت کا دور ہک رن ے سے انک ارک ردی ات ھا لیک ن آ خر کار وہم یگ ا ایو نٹ سمی ں شرک تکے لیے بھا رت گئے تھے 
 غ ور طلب ہے کہا یش یاک پکے معا ملے پر حال ہی میں بحر ین میں ہون ے و الی ایک ہن گام یم ی ٹ نگم ی ں غور کیا گیا جہا ں و ینی و ک ے ب ارے میں حتم یف ی صلہ مار چت کم

In [None]:
import nltk
from nltk.corpus import words
from nltk.util import ngrams

# import files

#--------FOR LIST#1---------

path1 = "/home/muhammadzohaib/A1/Word dictionary/bigram_words.txt"  # Replace with the path to your .txt file
with open(path1, "r") as file:
    word = file.readlines()

list1 = []

for line in word:
    list1.append(line.strip())


#--------FOR LIST#2---------

path2 = "/home/muhammadzohaib/A1/Word dictionary/words.txt"

with open(path2, "r") as file:
    word = file.readlines()

list2 = []

for line in word:
    list2.append(line.strip())


#---------------------------------------------------------------------------

dictionary = list1 + list2

# Replace 'your_text_file.txt' with the path to your text file
with open("/home/muhammadzohaib/A1/Q1/word_test.txt", "r", encoding="utf-8") as file:
    unprocessed_text = file.read()


# ...

# Function to segment the sentence using n-grams for maximum matching with context
def segment_with_context_ngrams(sentence, dictionary, n):
    words = []
    i = 0
    while i < len(sentence):
        matched = False
        for j in range(min(len(sentence), i + n, i), i, -1):
            ngram = sentence[i:j]
            if ngram in dictionary:
                for k in range(j, min(len(sentence), j + n, j), j):
                    next_ngram = sentence[j:k]
                    if next_ngram in dictionary:
                        words.append(next_ngram)
                        i = k
                        matched = True
                        break
                if not matched:
                    words.append(ngram)
                    i = j
                    matched = True
                    break
        if not matched:
            i += 1
    return words

# Segment the sentence using trigrams with context
segmented_words_with_context_trigrams = segment_with_context_ngrams(unprocessed_text, dictionary, 3)
segmented_sentence_with_context_trigrams = " ".join(segmented_words_with_context_trigrams)
print(segmented_sentence_with_context_trigrams)


# print("Unprocessed Text:", unprocessed_text)
# print("Dictionary:", dictionary)


