# PA1.1 Text Generation using Shannon Visualization Method

### Introduction

In this notebook, you will be generating text using the Shannon Visualization method.

An n-gram is a contiguous sequence of n words. For example "Machine" is a unigram, "Machine Learning" is a bigram and "Machine Learning PA1" is a trigram. In language modeling, n-gram models are probabilistic models of text that use word dependencies and context to predict the likelihood of occurence of an n-gram, i.e. predicting the nth word in an n-gram based on the previous n-1 words. One use of the predictions made by such a model is text generation. In this part, you will be generating text using the Shannon Visualization Method.

For additional details of the working of n-gram models and shannon visualization method, you can also consult [Chapter 3](https://web.stanford.edu/~jurafsky/slp3/3.pdf) of the SLP3 book as reference.

### Instructions

- Follow along with the notebook, filling out the necessary code where instructed.

- <span style="color: red;">Read the Submission Instructions, Plagiarism Policy, and Late Days Policy in the attached PDF.</span>

- <span style="color: red;">Make sure to run all cells for credit.</span>

- <span style="color: red;">Do not remove any pre-written code.</span>

- <span style="color: red;">You must attempt all parts.</span>

For this notebook, in addition to standard libraries i.e. `numpy`, `pandas`, `regex`, `matplotlib` and `scipy`, you **can** use [UrduHack](https://github.com/urduhack/urduhack) for tokenization, and [NLTK](https://www.nltk.org/) for training your n-grams. However, no other machine learning toolkits or libraries are allowed.

In [1]:
#Installing the dependencies
%pip install matplotlib numpy pandas regex matplotlib scipy urduhack nltk

# import all required libraries here
import numpy as np
import pandas as pd
import regex as re
import matplotlib.pyplot as plt
import scipy
import nltk
import urduhack
urduhack.download()



Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.



TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 



### Dataset

You will be using the Urdu short stories by Patras Bukhari given in the folder `Urdu Short Stories` in the attached zip file for the purposes of this part of the assignment. This contains 6 stories of varying lengths which will serve as inputs for your n-gram model. 

You're required to implement an n-gram model that uses the given stories to generate Urdu text that mimics the input stories.

## Loading and Preprocessing the Dataset

Read in the short story files given and tokenize the text to be preprocessed.

In [2]:
# code here
# import nltk
from urduhack.tokenization import word_tokenizer
from urduhack import normalize
import os
urdu_stories=[]
tokenized_text=[]

for f in os.listdir('./DataP1'):
    story_location="./DataP1/"+ str(f)
    with open(story_location, 'r', encoding='utf-8') as file:
        story = file.read()
        urdu_stories.append(story)

# Adding the start and stop words for sentence generation in future (using bigrams) + Also removing \n tokens
# urdu_stories = [story.replace('۔', "<s>" +" "+ "</s>") for story in urdu_stories]
urdu_stories = [story.replace('\n', " ") for story in urdu_stories]

# Break each Urdu story into chunks of max length 256 because the urduhack model cannot take a length more than 256
chunked_urdu_stories = []

for story in urdu_stories:
    chunks = [story[i:i + 256] for i in range(0, len(story), 256)]
    chunked_urdu_stories.append(chunks)
print("chunked")
print(len(chunked_urdu_stories))

for i in range(6):
    #This is a preprocessing step to normalise the urdu sentences before creating tokens
    normalized_text = normalize(str(chunked_urdu_stories[i]))
    tokens=word_tokenizer(normalized_text)
    tokenized_text.append(tokens)

print(tokenized_text)

chunked
6
[['میبل', 'لڑکیوں', 'کے', 'کالج', 'میں', 'تھی۔', 'لیکن', 'ہم', 'دونوں', 'کیمبرج', 'یونیورسٹی', 'میں', 'ایک', 'ہی', 'مضمون', 'پڑھتے', 'تھے،اس', 'لیے', 'اکثرلکچروں', 'میں', 'ملاقات', 'ہو', 'جاتی', 'تھی۔اس', 'کے', 'علاوہ', 'ہم', 'دوست', 'بھی', 'تھے۔', 'کئی', 'دلچسپیوں', 'میں', 'ایک', 'دوسرے', 'کے', 'شریک', 'ہوتے', 'تھے۔', 'تصویروں', 'اور', 'موسیقی', 'کاشوق', 'اسے', 'بھی', 'تھا،', 'میں', 'بھی', 'ہمہدانی', 'کادعویدار۔', 'اکثرگیلریوں', 'یاکانسرٹوں', 'میں', 'اکھٹے', 'جایاکرتے', 'تھ'], ['ہم', 'نے', 'کالج', 'میں', 'تعلیم', 'تو', 'ضرور', 'پائی', 'اور', 'رفتہ', 'رفتہ', 'بی،', 'اے', 'بھی', 'پاس', 'کر', 'لی', 'ا،', 'لیکن', 'اس', 'نصف', 'صدی', 'کے', 'دوران', 'جو', 'کالج', 'میں', 'گزارنی', 'پڑی،ہاسٹل', 'میں', 'داخل', 'ہونے', 'کی', 'اجازت', 'ہمیں', 'صرف', 'ایک', 'ہی', 'دفعہ', 'ملی۔', 'خدا', 'کا', 'یہ', 'فضل', 'ہم', 'پر', 'کب', 'اور', 'کس', 'طرح', 'ہوا،یہ', 'سوال', 'ایک', 'داستان', 'کا', 'محتاج', 'ہے۔', 'جب', 'ہم', 'نے', 'انٹرنس', 'پاس', 'کیا', 'تو', 'مقامی', 'اس', 'کول', 'کے', 'ہیڈ', 'ماسٹر'

Preprocess the tokenized data. Go through the data and use your own discretion to decide on what kind of pre-processing might be required.

In [3]:
# code here
from urduhack.preprocessing import remove_english_alphabets
#Converting the list into a flat list
tokenized_text_flat = [item for sublist in tokenized_text for item in sublist]

#Preprocessing the tokens
#1 Normalising using the normalise function of urdu hack already done
#2 Removing punctuation using regex
preprocessed_tokens = [re.sub(r'[^\w\s]', '', token) for token in tokenized_text_flat] 
#3 Removing empty tokens
preprocessed_tokens = [token for token in preprocessed_tokens if token] 

#4 filtering out non-urdu tokens
preprocessed_tokens = [remove_english_alphabets(token) for token in preprocessed_tokens]

print( 'preprocessed tokens:',preprocessed_tokens)
print(len(preprocessed_tokens))

preprocessed tokens: ['میبل', 'لڑکیوں', 'کے', 'کالج', 'میں', 'تھی', 'لیکن', 'ہم', 'دونوں', 'کیمبرج', 'یونیورسٹی', 'میں', 'ایک', 'ہی', 'مضمون', 'پڑھتے', 'تھےاس', 'لیے', 'اکثرلکچروں', 'میں', 'ملاقات', 'ہو', 'جاتی', 'تھیاس', 'کے', 'علاوہ', 'ہم', 'دوست', 'بھی', 'تھے', 'کئی', 'دلچسپیوں', 'میں', 'ایک', 'دوسرے', 'کے', 'شریک', 'ہوتے', 'تھے', 'تصویروں', 'اور', 'موسیقی', 'کاشوق', 'اسے', 'بھی', 'تھا', 'میں', 'بھی', 'ہمہدانی', 'کادعویدار', 'اکثرگیلریوں', 'یاکانسرٹوں', 'میں', 'اکھٹے', 'جایاکرتے', 'تھ', 'ہم', 'نے', 'کالج', 'میں', 'تعلیم', 'تو', 'ضرور', 'پائی', 'اور', 'رفتہ', 'رفتہ', 'بی', 'اے', 'بھی', 'پاس', 'کر', 'لی', 'ا', 'لیکن', 'اس', 'نصف', 'صدی', 'کے', 'دوران', 'جو', 'کالج', 'میں', 'گزارنی', 'پڑیہاسٹل', 'میں', 'داخل', 'ہونے', 'کی', 'اجازت', 'ہمیں', 'صرف', 'ایک', 'ہی', 'دفعہ', 'ملی', 'خدا', 'کا', 'یہ', 'فضل', 'ہم', 'پر', 'کب', 'اور', 'کس', 'طرح', 'ہوایہ', 'سوال', 'ایک', 'داستان', 'کا', 'محتاج', 'ہے', 'جب', 'ہم', 'نے', 'انٹرنس', 'پاس', 'کیا', 'تو', 'مقامی', 'اس', 'کول', 'کے', 'ہیڈ', 'ماسٹر', 'صا

## Creating Unigrams

Generate a list of unigrams. Print the first 10 unigrams obtained.

In [4]:
# code here
from nltk.util import ngrams
unigrams = list(ngrams(preprocessed_tokens, 1))

print('First 10 unigrams')
for i in range(10):
    print(unigrams[i])

First 10 unigrams
('میبل',)
('لڑکیوں',)
('کے',)
('کالج',)
('میں',)
('تھی',)
('لیکن',)
('ہم',)
('دونوں',)
('کیمبرج',)


Find the probabilities for each unique unigram. (Refer to the Shannon Visualization Method that we studied in class.)

In [5]:
# code here
# Calculating the probabilties of each unigram
from nltk import FreqDist
frequency_unigrams = FreqDist(unigrams)
no_of_unigrams = len(unigrams)

#creating a dictionary for storing probabilties
probabilties={}
for unigram, freq in frequency_unigrams.items():
    probabilties[unigram[0]]=(freq/no_of_unigrams)
    print('Unigram:', unigram[0], " Probability: " ,freq/no_of_unigrams)


Unigram: میبل  Probability:  0.0024449877750611247
Unigram: لڑکیوں  Probability:  0.0024449877750611247
Unigram: کے  Probability:  0.034229828850855744
Unigram: کالج  Probability:  0.007334963325183374
Unigram: میں  Probability:  0.044009779951100246
Unigram: تھی  Probability:  0.0024449877750611247
Unigram: لیکن  Probability:  0.009779951100244499
Unigram: ہم  Probability:  0.012224938875305624
Unigram: دونوں  Probability:  0.0024449877750611247
Unigram: کیمبرج  Probability:  0.0024449877750611247
Unigram: یونیورسٹی  Probability:  0.0024449877750611247
Unigram: ایک  Probability:  0.012224938875305624
Unigram: ہی  Probability:  0.009779951100244499
Unigram: مضمون  Probability:  0.007334963325183374
Unigram: پڑھتے  Probability:  0.0024449877750611247
Unigram: تھےاس  Probability:  0.0024449877750611247
Unigram: لیے  Probability:  0.004889975550122249
Unigram: اکثرلکچروں  Probability:  0.0024449877750611247
Unigram: ملاقات  Probability:  0.0024449877750611247
Unigram: ہو  Probability:  0.

## Creating Bigrams

Generate a list of bigrams. Print the first 10 bigrams obtained.

In [6]:
from nltk.util import ngrams
bigrams = list(ngrams(preprocessed_tokens, 2))

print('First 10 bigrams')
for i in range(10):
    print(bigrams[i])

First 10 bigrams
('میبل', 'لڑکیوں')
('لڑکیوں', 'کے')
('کے', 'کالج')
('کالج', 'میں')
('میں', 'تھی')
('تھی', 'لیکن')
('لیکن', 'ہم')
('ہم', 'دونوں')
('دونوں', 'کیمبرج')
('کیمبرج', 'یونیورسٹی')


Find the probabilities for each unique bigram. 

In [7]:
# code here
# Calculating the probabilties of each bigram
from nltk import FreqDist
frequency = FreqDist(bigrams)
no_of_bigrams = len(bigrams)

#creating a dictionary for storing probabilties
probabilties_bigrams={}
for bigram, freq in frequency.items():
    individual_bigram_probability = (freq/(no_of_bigrams+1))
    unigram_probability=probabilties[bigram[0]]
    probabilties_bigrams[bigram]= individual_bigram_probability/unigram_probability
    print('Bigram:', bigram, " Probability: " ,individual_bigram_probability/unigram_probability)



Bigram: ('میبل', 'لڑکیوں')  Probability:  1.0
Bigram: ('لڑکیوں', 'کے')  Probability:  1.0
Bigram: ('کے', 'کالج')  Probability:  0.07142857142857144
Bigram: ('کالج', 'میں')  Probability:  1.0
Bigram: ('میں', 'تھی')  Probability:  0.05555555555555555
Bigram: ('تھی', 'لیکن')  Probability:  1.0
Bigram: ('لیکن', 'ہم')  Probability:  0.25
Bigram: ('ہم', 'دونوں')  Probability:  0.19999999999999998
Bigram: ('دونوں', 'کیمبرج')  Probability:  1.0
Bigram: ('کیمبرج', 'یونیورسٹی')  Probability:  1.0
Bigram: ('یونیورسٹی', 'میں')  Probability:  1.0
Bigram: ('میں', 'ایک')  Probability:  0.1111111111111111
Bigram: ('ایک', 'ہی')  Probability:  0.39999999999999997
Bigram: ('ہی', 'مضمون')  Probability:  0.25
Bigram: ('مضمون', 'پڑھتے')  Probability:  0.3333333333333333
Bigram: ('پڑھتے', 'تھےاس')  Probability:  1.0
Bigram: ('تھےاس', 'لیے')  Probability:  1.0
Bigram: ('لیے', 'اکثرلکچروں')  Probability:  0.5
Bigram: ('اکثرلکچروں', 'میں')  Probability:  1.0
Bigram: ('میں', 'ملاقات')  Probability:  0.0555555555

## Generating Text using the Shannon Visualization Method

Generate a paragraph with ten sentences. Use the Shannon visualization method that we studied in class.

In [8]:
# code here
import random
possible_first_words = [word[0] for word in probabilties_bigrams]
sentences=[]
for j in range(10):
    print("Sentence number", j+1,":")

    #choosing the first word
    first_word=random.choice(possible_first_words)
    sentence=[]
    possible_words=[]
    # creating sentences of variable length - 8 to 15 words
    random_sentence_length = random.randint(8, 15)
    for i in range(random_sentence_length):
        current_word=first_word
        
        for word in probabilties_bigrams:
            if word[0==current_word]:
                possible_words.append(word[1] )
        
        next_word = random.choice(possible_words)
        sentence.append(next_word)
        current_word=next_word

    print(sentence)
    sentences.append(sentence)



Sentence number 1 :
['انقطاع', 'اے', 'کا', 'طرف', 'کے', 'تمہید', 'بعد', 'داستان']
Sentence number 2 :
['کرے', 'صرف', 'کاشوق', 'دل', 'سے', 'کیونکہ', 'جو', 'بیٹھے', 'فوقتا', 'کرے', 'سوال', 'باتوں']
Sentence number 3 :
['دونوں', 'تعلیم', 'لالہ', 'اور', 'کہ', 'چاہتا', 'تھے', 'کے', 'خبر', 'سمجھئے', 'لاہ', 'آتے', 'موسیقی']
Sentence number 4 :
['نہ', 'کہ', 'یا', 'وجود', 'معاملے', 'ٹھہر', 'سامنے', 'تھےاس', 'نکرجی', 'ناہل', 'ہیڈ']
Sentence number 5 :
['عرض', 'ا', 'عرض', 'عرصہ', 'انٹرنس', 'گی', 'ایک', 'طور', 'مدت', 'پھر', 'جوشامت', 'خاص', 'تو']
Sentence number 6 :
['ٹھہر', 'ایک', 'ملک', 'عرض', 'کو', 'بائیں', 'میں', 'جگادیاکیجیئے', 'کے', 'کے']
Sentence number 7 :
['طور', 'کے', 'بہت', 'انقطاع', 'موت', 'کی', 'اور', 'ہی', 'ہو']
Sentence number 8 :
['سے', 'کیا', 'ہیںآپس', 'میرے', 'قریب', 'دائیں', 'کراچی', 'دریافت', 'عنوان']
Sentence number 9 :
['خبر', 'میرے', 'کہ', 'آتیہےتوشہر', 'مدت', 'گزر', 'تھی', 'دکھانے', 'پر', 'دونوں', 'آپ']
Sentence number 10 :
['البلد', 'جب', 'ادب', 'تمام', 'بائیں', 'اس', 'صبح

## Computing the Probability of Sentences

Compute the probability of each sentence that has been generated in the previous step. Refer to the lecture slides to see what does it mean to compute the _probability of a sentence_. 

Apply the **unigram assumption** while computing these probabilities.


In [9]:
#code here 
# Unigram assumption so need to check individual probabilities of the tokens and multiply them together

for sentence in sentences:
    sentence_prob=1
    for word in sentence:
        individual_probability=probabilties[word]
        sentence_prob=sentence_prob*individual_probability
    
    print('Probability:', sentence_prob, "for sentence:",sentence)

Probability: 5.006100948636449e-19 for sentence: ['انقطاع', 'اے', 'کا', 'طرف', 'کے', 'تمہید', 'بعد', 'داستان']
Probability: 1.314355333120921e-29 for sentence: ['کرے', 'صرف', 'کاشوق', 'دل', 'سے', 'کیونکہ', 'جو', 'بیٹھے', 'فوقتا', 'کرے', 'سوال', 'باتوں']
Probability: 1.1809916501758885e-30 for sentence: ['دونوں', 'تعلیم', 'لالہ', 'اور', 'کہ', 'چاہتا', 'تھے', 'کے', 'خبر', 'سمجھئے', 'لاہ', 'آتے', 'موسیقی']
Probability: 3.359820820290354e-28 for sentence: ['نہ', 'کہ', 'یا', 'وجود', 'معاملے', 'ٹھہر', 'سامنے', 'تھےاس', 'نکرجی', 'ناہل', 'ہیڈ']
Probability: 3.347482001632337e-32 for sentence: ['عرض', 'ا', 'عرض', 'عرصہ', 'انٹرنس', 'گی', 'ایک', 'طور', 'مدت', 'پھر', 'جوشامت', 'خاص', 'تو']
Probability: 1.3466833811887797e-21 for sentence: ['ٹھہر', 'ایک', 'ملک', 'عرض', 'کو', 'بائیں', 'میں', 'جگادیاکیجیئے', 'کے', 'کے']
Probability: 5.875130697666248e-20 for sentence: ['طور', 'کے', 'بہت', 'انقطاع', 'موت', 'کی', 'اور', 'ہی', 'ہو']
Probability: 5.620341866389907e-23 for sentence: ['سے', 'کیا', 'ہیںآپس'

## Discussion and Evaluation

- Analyze the text generated, and mention 3 distinct observations. Also compare it with the input text and how different it is and why might that be.

- Do you notice any repetition of words in the generated sentences? If yes, how would you solve it?

- Is going upto `n=2` enough? What do you think would be a good value of n and why?


Answer here:
<div style="color:green;">
...
 
- Three obsrvations:

1. The quality of the text generated and the words doesn't seem to be very good. This might be due to generation method issues.
2. The sentences generated are alot less coherent than the ones in the inputs.
3. The sentences generated are of variable lengths due to randomness given however in the inputs, the sentence stucture as well as the sentence length is depenedant on the meaning and contextual linkage.

- Repitition:
There is a definite repitition in words in the generated sentences like (the urdu word aur) becuase of the repition of the word in the input sentences as well. This makes sense as higher probable words are going to repeat more often as we are using bigrams so certain words have a higher probability of following others.
We can solve this by trying other methods like trigrams or higher order grams. We can also maybe limit the number of times a word appears in generation to put a bias towards less repeating words.

- N-gram level:
The quality of text generated shows that n=2 is not enough as it doesnot capture any long term dependancy or variable sentence structures. Higher values of n can be tried and then the best can be figured out. It is hard to conclude that what value of n would be a good choice however it must be greater than 2 so that it can capture better sentence context and result in better generation (more human like). Input dataset will always pose a huge impact however.
</div>
