Question 1 (SMS)

Required Imports

In [39]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import wordnet
from nltk import pos_tag
from nltk.util import ngrams
from collections import Counter


Tag Conversion Function:

    Used to convert NLTK Part of Speech Tags to WordNet Part of Speech Tags. Used for lemmatization where correct POS tag helps in reduction

In [40]:
# Function to convert NLTK POS tags to WordNet POS tags
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

File Reading/Data Frame Creation

In [41]:
sms_filename = "SMSSpamCollection"
sms = pd.read_csv(sms_filename, sep='\t', header=None, names=['label', 'message'])
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Normalization Step:

    1. Converting all messages to lowercase to establish consistency across messages
    2. Removing all leading whitespaces
    3. Removing all punctuation and numbers to reduce noise

In [42]:
sms['message'] = sms['message'].str.lower()

sms['message'] = sms['message'].str.strip()

sms['message'] = sms['message'].str.replace(r'[^\w\s]', '', regex=True) #remove punctuation
sms['message'] = sms['message'].str.replace(r'\d+', '', regex=True) #remove numbers
sms.head()

Unnamed: 0,label,message
0,ham,go until jurong point crazy available only in ...
1,ham,ok lar joking wif u oni
2,spam,free entry in a wkly comp to win fa cup final...
3,ham,u dun say so early hor u c already then say
4,ham,nah i dont think he goes to usf he lives aroun...


Stopword removal to avoid unncessary computations later

In [43]:
sms['message'] = sms['message'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords.words('english'))]))
sms.head()

Unnamed: 0,label,message
0,ham,go jurong point crazy available bugis n great ...
1,ham,ok lar joking wif u oni
2,spam,free entry wkly comp win fa cup final tkts st ...
3,ham,u dun say early hor u c already say
4,ham,nah dont think goes usf lives around though


Tokenization step to facilitate advanced text analysis

In [44]:
sms['tokens'] = sms['message'].apply(word_tokenize)
sms.head()

Unnamed: 0,label,message,tokens
0,ham,go jurong point crazy available bugis n great ...,"[go, jurong, point, crazy, available, bugis, n..."
1,ham,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,spam,free entry wkly comp win fa cup final tkts st ...,"[free, entry, wkly, comp, win, fa, cup, final,..."
3,ham,u dun say early hor u c already say,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,nah dont think goes usf lives around though,"[nah, dont, think, goes, usf, lives, around, t..."


Stemming And Lemmatization to reduce words to their dictionary form so that text is standardized

In [45]:
#Lemmatize the tokens with POS tagging
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()


#Apply Lemmatization and Stemming
def process_tokens(tokens):
    #POS Tagging
    pos_tagged_tokens = pos_tag(tokens)  # Tag each token with its POS
    
    #Lemmatization with PartOfSpeech tagging
    lemmatized_tokens = [
        lemmatizer.lemmatize(token, get_wordnet_pos(pos)) for token, pos in pos_tagged_tokens
    ]
    
    #Stemming
    stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens]
    
    return stemmed_tokens

sms['processed_tokens'] = sms['tokens'].apply(process_tokens)
sms.head()

Unnamed: 0,label,message,tokens,processed_tokens
0,ham,go jurong point crazy available bugis n great ...,"[go, jurong, point, crazy, available, bugis, n...","[go, jurong, point, crazi, avail, bugi, n, gre..."
1,ham,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
2,spam,free entry wkly comp win fa cup final tkts st ...,"[free, entry, wkly, comp, win, fa, cup, final,...","[free, entri, wkli, comp, win, fa, cup, final,..."
3,ham,u dun say early hor u c already say,"[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]"
4,ham,nah dont think goes usf lives around though,"[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, go, usf, life, around, though]"


N-Gram Modeling

Function to generate n-grams for a specific number n which helps identify context in phrases

In [46]:
def ngram_generator(tokens, n):
    return list(ngrams(tokens, n))

Generate N-Grams (N = 1 to 5)

In [47]:
for i in range(1, 6):
    column_name = f'ngrams_{i}'
    sms[column_name] = sms['processed_tokens'].apply(lambda x: ngram_generator(x, i))
sms.head()

Unnamed: 0,label,message,tokens,processed_tokens,ngrams_1,ngrams_2,ngrams_3,ngrams_4,ngrams_5
0,ham,go jurong point crazy available bugis n great ...,"[go, jurong, point, crazy, available, bugis, n...","[go, jurong, point, crazi, avail, bugi, n, gre...","[(go,), (jurong,), (point,), (crazi,), (avail,...","[(go, jurong), (jurong, point), (point, crazi)...","[(go, jurong, point), (jurong, point, crazi), ...","[(go, jurong, point, crazi), (jurong, point, c...","[(go, jurong, point, crazi, avail), (jurong, p..."
1,ham,ok lar joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]","[(ok,), (lar,), (joke,), (wif,), (u,), (oni,)]","[(ok, lar), (lar, joke), (joke, wif), (wif, u)...","[(ok, lar, joke), (lar, joke, wif), (joke, wif...","[(ok, lar, joke, wif), (lar, joke, wif, u), (j...","[(ok, lar, joke, wif, u), (lar, joke, wif, u, ..."
2,spam,free entry wkly comp win fa cup final tkts st ...,"[free, entry, wkly, comp, win, fa, cup, final,...","[free, entri, wkli, comp, win, fa, cup, final,...","[(free,), (entri,), (wkli,), (comp,), (win,), ...","[(free, entri), (entri, wkli), (wkli, comp), (...","[(free, entri, wkli), (entri, wkli, comp), (wk...","[(free, entri, wkli, comp), (entri, wkli, comp...","[(free, entri, wkli, comp, win), (entri, wkli,..."
3,ham,u dun say early hor u c already say,"[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]","[(u,), (dun,), (say,), (earli,), (hor,), (u,),...","[(u, dun), (dun, say), (say, earli), (earli, h...","[(u, dun, say), (dun, say, earli), (say, earli...","[(u, dun, say, earli), (dun, say, earli, hor),...","[(u, dun, say, earli, hor), (dun, say, earli, ..."
4,ham,nah dont think goes usf lives around though,"[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, go, usf, life, around, though]","[(nah,), (dont,), (think,), (go,), (usf,), (li...","[(nah, dont), (dont, think), (think, go), (go,...","[(nah, dont, think), (dont, think, go), (think...","[(nah, dont, think, go), (dont, think, go, usf...","[(nah, dont, think, go, usf), (dont, think, go..."


Count N-Grams

In [48]:
#count spam messages n-grams occurrences
spam_ngram_count = {}
spam_sms = sms[sms['label'] == 'spam']
for i in range(1, 6):
    spam_ngram_count[i] = Counter([ngram for ngrams_list in spam_sms[f'ngrams_{i}'] for ngram in ngrams_list])

#count human messages n-grams occurrences
ham_ngram_count = {}
ham_sms = sms[sms['label'] == 'ham']
for i in range(1, 6):
    ham_ngram_count[i] = Counter([ngram for ngrams_list in ham_sms[f'ngrams_{i}'] for ngram in ngrams_list])



Most Common Phrases in Spam Messages

In [49]:
#display the 10 most common n-grams for each level from 1 to 5
for i in range(1, 6):
    top_ngrams = spam_ngram_count[i].most_common(10)
    print(f"\nTop 10 {i}-grams:", top_ngrams)


Top 10 1-grams: [(('call',), 369), (('free',), 219), (('txt',), 163), (('u',), 163), (('ur',), 144), (('text',), 139), (('mobil',), 136), (('stop',), 118), (('claim',), 115), (('repli',), 110)]

Top 10 2-grams: [(('pleas', 'call'), 46), (('po', 'box'), 31), (('tri', 'contact'), 28), (('custom', 'servic'), 27), (('p', 'per'), 25), (('contact', 'u'), 24), (('guarante', 'call'), 23), (('call', 'landlin'), 23), (('prize', 'guarante'), 22), (('await', 'collect'), 22)]

Top 10 3-grams: [(('prize', 'guarante', 'call'), 21), (('call', 'land', 'line'), 18), (('call', 'custom', 'servic'), 16), (('privat', 'account', 'statement'), 16), (('tri', 'contact', 'u'), 16), (('call', 'identifi', 'code'), 15), (('guarante', 'call', 'land'), 15), (('call', 'p', 'per'), 15), (('identifi', 'code', 'expir'), 14), (('land', 'line', 'claim'), 14)]

Top 10 4-grams: [(('prize', 'guarante', 'call', 'land'), 15), (('guarante', 'call', 'land', 'line'), 15), (('call', 'identifi', 'code', 'expir'), 14), (('call', 'la

Most Common Phrases in Organic Messages

In [50]:
#display the 10 most common n-grams for each level from 1 to 5
for i in range(1, 6):
    topham_ngrams = ham_ngram_count[i].most_common(10)
    print(f"\nTop 10 {i}-grams:", topham_ngrams)


Top 10 1-grams: [(('u',), 1056), (('get',), 609), (('go',), 521), (('im',), 464), (('come',), 321), (('call',), 289), (('dont',), 276), (('ltgt',), 276), (('ok',), 273), (('know',), 256)]

Top 10 2-grams: [(('gon', 'na'), 58), (('call', 'later'), 52), (('ill', 'call'), 48), (('let', 'know'), 41), (('sorri', 'ill'), 39), (('r', 'u'), 37), (('u', 'r'), 37), (('dont', 'know'), 33), (('u', 'get'), 33), (('good', 'morn'), 31)]

Top 10 3-grams: [(('ill', 'call', 'later'), 42), (('sorri', 'ill', 'call'), 38), (('im', 'gon', 'na'), 20), (('happi', 'new', 'year'), 19), (('pl', 'send', 'messag'), 13), (('cant', 'pick', 'phone'), 12), (('pick', 'phone', 'right'), 12), (('phone', 'right', 'pl'), 12), (('right', 'pl', 'send'), 12), (('hi', 'hi', 'hi'), 11)]

Top 10 4-grams: [(('sorri', 'ill', 'call', 'later'), 38), (('cant', 'pick', 'phone', 'right'), 12), (('pick', 'phone', 'right', 'pl'), 12), (('phone', 'right', 'pl', 'send'), 12), (('right', 'pl', 'send', 'messag'), 12), (('ill', 'call', 'late

Analysis:

1. Analysis of Unigrams:
    We can see that the top 10 most common unigrams are all related to calls and texts in both spam and ham messages but spam messages also have claim in the most common word list

2. Analysis of Bigrams:
    We can see that the top 10 most common bigrams are all still related to communication and/or delay in communication with spam messages still having claiming prizes as one of the most common elements.
3. Analysis of Trigrams:
    In trigrams we can see the difference between ham and spam becoming clearer. In spam messages, we can now clearly see that spam messages are more oriented towards claiming prizes, guarantees and calls. Meanwhile in ham messages there is a clear abundance of messages related to delayed communication.
4. Analysis beyond Trigrams:
    Beyond trigrams, the themes remain basically the same for both ham and spam messages with the exception that there is an inclusion of birthday congratulations in ham.

Question 1 (Urdu)

File Reading/Data Frame Creation

In [51]:
with open("urdu.txt", encoding="UTF-8-SIG", errors='ignore') as file:
    lines = file.readlines()
    
header = lines[0].strip().split(',')
lines = [line.strip() for line in lines[1:]]  # Skip the first line for data
[line.split(',') for line in lines]
urdu = pd.DataFrame([line.split(',') for line in lines], columns=['Index',
 'Headline',
 'News Text',
 'Category',
 'Date',
 'URL',
 'Source',
 'News length',
 'x'])
urdu = urdu.drop(columns=['Index', 'x'])
urdu.head()

Unnamed: 0,Headline,News Text,Category,Date,URL,Source,News length
0,عالمی بینک عسکریت پسندی سے متاثرہ خاندانوں کی ...,اسلام باد عالمی بینک خیبرپختونخوا کے قبائلی اض...,Business & Economics,2020-12-06,https://www.dawnnews.tv/news/1148499/,Dawn News,1854
1,مالی سال 2020 ریٹرن فائل کرنے والوں کی تعداد م...,اسلام باد فیڈرل بورڈ ریونیو ایف بی نے دسمبر کی...,Business & Economics,2020-12-06,https://www.dawnnews.tv/news/1148498/,Dawn News,2016
2,جاپان کو سندھ کے خصوصی اقتصادی زون میں سرمایہ ...,اسلام باد بورڈ انویسٹمنٹ بی او ئی کے چیئرمین ع...,Business & Economics,2020-12-05,https://www.dawnnews.tv/news/1148433/,Dawn News,2195
3,برامدات 767 فیصد بڑھ کر ارب 16 کروڑ ڈالر سے زائد,اسلام اباد پاکستان میں ماہ نومبر میں مسلسل تیس...,Business & Economics,2020-12-05,https://www.dawnnews.tv/news/1148430/,Dawn News,2349
4,کے الیکٹرک کو اضافی بجلی گیس کی فراہمی کے قانو...,اسلام باد نیشنل ٹرانسمیشن اینڈ ڈسپیچ کمپنی این...,Business & Economics,2020-12-05,https://www.dawnnews.tv/news/1148421/,Dawn News,2655


Normalization

In [52]:
urdu['Headline'] = urdu['Headline'].str.replace(r'[^\w\s]', '', regex=True) #remove punctuation
urdu['Headline'] = urdu['Headline'].str.replace(r'\d+', '', regex=True) #remove numbers
urdu['News Text'] = urdu['News Text'].str.replace(r'[^\w\s]', '', regex=True) #remove punctuation
urdu['News Text'] = urdu['News Text'].str.replace(r'\d+', '', regex=True) #remove numbers
urdu['Category'] = urdu['Category'].str.replace(r'[^\w\s]', '', regex=True) #remove punctuation
urdu['Category'] = urdu['Category'].str.lower() #convert to lowercase
urdu.head()

Unnamed: 0,Headline,News Text,Category,Date,URL,Source,News length
0,عالمی بینک عسکریت پسندی سے متاثرہ خاندانوں کی ...,اسلام باد عالمی بینک خیبرپختونخوا کے قبائلی اض...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148499/,Dawn News,1854
1,مالی سال ریٹرن فائل کرنے والوں کی تعداد میں ...,اسلام باد فیڈرل بورڈ ریونیو ایف بی نے دسمبر کی...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148498/,Dawn News,2016
2,جاپان کو سندھ کے خصوصی اقتصادی زون میں سرمایہ ...,اسلام باد بورڈ انویسٹمنٹ بی او ئی کے چیئرمین ع...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148433/,Dawn News,2195
3,برامدات فیصد بڑھ کر ارب کروڑ ڈالر سے زائد,اسلام اباد پاکستان میں ماہ نومبر میں مسلسل تیس...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148430/,Dawn News,2349
4,کے الیکٹرک کو اضافی بجلی گیس کی فراہمی کے قانو...,اسلام باد نیشنل ٹرانسمیشن اینڈ ڈسپیچ کمپنی این...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148421/,Dawn News,2655


Stopword Removal

In [53]:
#reduce dataset to reduce computation time
urdu = urdu.head(5)


#read file named stopwords-ur.txt and use the words in it as stopwords
with open('stopwords-ur.txt', 'r', encoding='utf-8') as file:
    stopwords_ur = file.read().splitlines()
    
urdu['Headline'] = urdu['Headline'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords_ur]))
urdu['News Text'] = urdu['News Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords_ur]))
urdu.head()

Unnamed: 0,Headline,News Text,Category,Date,URL,Source,News length
0,عالمی بینک عسکریت پسندی سے متاثرہ خاندانوں معا...,اسلام باد عالمی بینک خیبرپختونخوا قبائلی اضلاع...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148499/,Dawn News,1854
1,مالی سال ریٹرن فائل کرنے والوں تعداد میں فیصد کمی,اسلام باد فیڈرل بورڈ ریونیو ایف بی نے دسمبر خر...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148498/,Dawn News,2016
2,جاپان کو سندھ خصوصی اقتصادی زون میں سرمایہ کار...,اسلام باد بورڈ انویسٹمنٹ بی او ئی چیئرمین عاطف...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148433/,Dawn News,2195
3,برامدات فیصد بڑھ کر ارب کروڑ ڈالر سے زائد,اسلام اباد پاکستان میں ماہ نومبر میں مسلسل تیس...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148430/,Dawn News,2349
4,الیکٹرک کو اضافی بجلی گیس فراہمی قانونی تقاضے ...,اسلام باد نیشنل ٹرانسمیشن اینڈ ڈسپیچ کمپنی این...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148421/,Dawn News,2655


Tokenization

In [54]:
urdu['tokens'] = urdu['Headline'].apply(word_tokenize)
urdu.head()

Unnamed: 0,Headline,News Text,Category,Date,URL,Source,News length,tokens
0,عالمی بینک عسکریت پسندی سے متاثرہ خاندانوں معا...,اسلام باد عالمی بینک خیبرپختونخوا قبائلی اضلاع...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148499/,Dawn News,1854,"[عالمی, بینک, عسکریت, پسندی, سے, متاثرہ, خاندا..."
1,مالی سال ریٹرن فائل کرنے والوں تعداد میں فیصد کمی,اسلام باد فیڈرل بورڈ ریونیو ایف بی نے دسمبر خر...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148498/,Dawn News,2016,"[مالی, سال, ریٹرن, فائل, کرنے, والوں, تعداد, م..."
2,جاپان کو سندھ خصوصی اقتصادی زون میں سرمایہ کار...,اسلام باد بورڈ انویسٹمنٹ بی او ئی چیئرمین عاطف...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148433/,Dawn News,2195,"[جاپان, کو, سندھ, خصوصی, اقتصادی, زون, میں, سر..."
3,برامدات فیصد بڑھ کر ارب کروڑ ڈالر سے زائد,اسلام اباد پاکستان میں ماہ نومبر میں مسلسل تیس...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148430/,Dawn News,2349,"[برامدات, فیصد, بڑھ, کر, ارب, کروڑ, ڈالر, سے, ..."
4,الیکٹرک کو اضافی بجلی گیس فراہمی قانونی تقاضے ...,اسلام باد نیشنل ٹرانسمیشن اینڈ ڈسپیچ کمپنی این...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148421/,Dawn News,2655,"[الیکٹرک, کو, اضافی, بجلی, گیس, فراہمی, قانونی..."


Stemming and Lemmatization

In [55]:
#Lemmatize the tokens with POS tagging
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()


#Apply Lemmatization and Stemming
def process_tokens(tokens):
    #POS Tagging
    pos_tagged_tokens = pos_tag(tokens)  # Tag each token with its POS
    
    #Lemmatization with PartOfSpeech tagging
    lemmatized_tokens = [
        lemmatizer.lemmatize(token, get_wordnet_pos(pos)) for token, pos in pos_tagged_tokens
    ]
    
    #Stemming
    stemmed_tokens = [stemmer.stem(token) for token in lemmatized_tokens]
    
    return stemmed_tokens

urdu['processed_tokens'] = urdu['tokens'].apply(process_tokens)
urdu.head()

Unnamed: 0,Headline,News Text,Category,Date,URL,Source,News length,tokens,processed_tokens
0,عالمی بینک عسکریت پسندی سے متاثرہ خاندانوں معا...,اسلام باد عالمی بینک خیبرپختونخوا قبائلی اضلاع...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148499/,Dawn News,1854,"[عالمی, بینک, عسکریت, پسندی, سے, متاثرہ, خاندا...","[عالمی, بینک, عسکریت, پسندی, سے, متاثرہ, خاندا..."
1,مالی سال ریٹرن فائل کرنے والوں تعداد میں فیصد کمی,اسلام باد فیڈرل بورڈ ریونیو ایف بی نے دسمبر خر...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148498/,Dawn News,2016,"[مالی, سال, ریٹرن, فائل, کرنے, والوں, تعداد, م...","[مالی, سال, ریٹرن, فائل, کرنے, والوں, تعداد, م..."
2,جاپان کو سندھ خصوصی اقتصادی زون میں سرمایہ کار...,اسلام باد بورڈ انویسٹمنٹ بی او ئی چیئرمین عاطف...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148433/,Dawn News,2195,"[جاپان, کو, سندھ, خصوصی, اقتصادی, زون, میں, سر...","[جاپان, کو, سندھ, خصوصی, اقتصادی, زون, میں, سر..."
3,برامدات فیصد بڑھ کر ارب کروڑ ڈالر سے زائد,اسلام اباد پاکستان میں ماہ نومبر میں مسلسل تیس...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148430/,Dawn News,2349,"[برامدات, فیصد, بڑھ, کر, ارب, کروڑ, ڈالر, سے, ...","[برامدات, فیصد, بڑھ, کر, ارب, کروڑ, ڈالر, سے, ..."
4,الیکٹرک کو اضافی بجلی گیس فراہمی قانونی تقاضے ...,اسلام باد نیشنل ٹرانسمیشن اینڈ ڈسپیچ کمپنی این...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148421/,Dawn News,2655,"[الیکٹرک, کو, اضافی, بجلی, گیس, فراہمی, قانونی...","[الیکٹرک, کو, اضافی, بجلی, گیس, فراہمی, قانونی..."


N-Gram Generator

In [56]:
for i in range(1, 6):
    column_name = f'ngrams_{i}'
    urdu[column_name] = urdu['processed_tokens'].apply(lambda x: ngram_generator(x, i))
urdu.head()

Unnamed: 0,Headline,News Text,Category,Date,URL,Source,News length,tokens,processed_tokens,ngrams_1,ngrams_2,ngrams_3,ngrams_4,ngrams_5
0,عالمی بینک عسکریت پسندی سے متاثرہ خاندانوں معا...,اسلام باد عالمی بینک خیبرپختونخوا قبائلی اضلاع...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148499/,Dawn News,1854,"[عالمی, بینک, عسکریت, پسندی, سے, متاثرہ, خاندا...","[عالمی, بینک, عسکریت, پسندی, سے, متاثرہ, خاندا...","[(عالمی,), (بینک,), (عسکریت,), (پسندی,), (سے,)...","[(عالمی, بینک), (بینک, عسکریت), (عسکریت, پسندی...","[(عالمی, بینک, عسکریت), (بینک, عسکریت, پسندی),...","[(عالمی, بینک, عسکریت, پسندی), (بینک, عسکریت, ...","[(عالمی, بینک, عسکریت, پسندی, سے), (بینک, عسکر..."
1,مالی سال ریٹرن فائل کرنے والوں تعداد میں فیصد کمی,اسلام باد فیڈرل بورڈ ریونیو ایف بی نے دسمبر خر...,business economics,2020-12-06,https://www.dawnnews.tv/news/1148498/,Dawn News,2016,"[مالی, سال, ریٹرن, فائل, کرنے, والوں, تعداد, م...","[مالی, سال, ریٹرن, فائل, کرنے, والوں, تعداد, م...","[(مالی,), (سال,), (ریٹرن,), (فائل,), (کرنے,), ...","[(مالی, سال), (سال, ریٹرن), (ریٹرن, فائل), (فا...","[(مالی, سال, ریٹرن), (سال, ریٹرن, فائل), (ریٹر...","[(مالی, سال, ریٹرن, فائل), (سال, ریٹرن, فائل, ...","[(مالی, سال, ریٹرن, فائل, کرنے), (سال, ریٹرن, ..."
2,جاپان کو سندھ خصوصی اقتصادی زون میں سرمایہ کار...,اسلام باد بورڈ انویسٹمنٹ بی او ئی چیئرمین عاطف...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148433/,Dawn News,2195,"[جاپان, کو, سندھ, خصوصی, اقتصادی, زون, میں, سر...","[جاپان, کو, سندھ, خصوصی, اقتصادی, زون, میں, سر...","[(جاپان,), (کو,), (سندھ,), (خصوصی,), (اقتصادی,...","[(جاپان, کو), (کو, سندھ), (سندھ, خصوصی), (خصوص...","[(جاپان, کو, سندھ), (کو, سندھ, خصوصی), (سندھ, ...","[(جاپان, کو, سندھ, خصوصی), (کو, سندھ, خصوصی, ا...","[(جاپان, کو, سندھ, خصوصی, اقتصادی), (کو, سندھ,..."
3,برامدات فیصد بڑھ کر ارب کروڑ ڈالر سے زائد,اسلام اباد پاکستان میں ماہ نومبر میں مسلسل تیس...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148430/,Dawn News,2349,"[برامدات, فیصد, بڑھ, کر, ارب, کروڑ, ڈالر, سے, ...","[برامدات, فیصد, بڑھ, کر, ارب, کروڑ, ڈالر, سے, ...","[(برامدات,), (فیصد,), (بڑھ,), (کر,), (ارب,), (...","[(برامدات, فیصد), (فیصد, بڑھ), (بڑھ, کر), (کر,...","[(برامدات, فیصد, بڑھ), (فیصد, بڑھ, کر), (بڑھ, ...","[(برامدات, فیصد, بڑھ, کر), (فیصد, بڑھ, کر, ارب...","[(برامدات, فیصد, بڑھ, کر, ارب), (فیصد, بڑھ, کر..."
4,الیکٹرک کو اضافی بجلی گیس فراہمی قانونی تقاضے ...,اسلام باد نیشنل ٹرانسمیشن اینڈ ڈسپیچ کمپنی این...,business economics,2020-12-05,https://www.dawnnews.tv/news/1148421/,Dawn News,2655,"[الیکٹرک, کو, اضافی, بجلی, گیس, فراہمی, قانونی...","[الیکٹرک, کو, اضافی, بجلی, گیس, فراہمی, قانونی...","[(الیکٹرک,), (کو,), (اضافی,), (بجلی,), (گیس,),...","[(الیکٹرک, کو), (کو, اضافی), (اضافی, بجلی), (ب...","[(الیکٹرک, کو, اضافی), (کو, اضافی, بجلی), (اضا...","[(الیکٹرک, کو, اضافی, بجلی), (کو, اضافی, بجلی,...","[(الیکٹرک, کو, اضافی, بجلی, گیس), (کو, اضافی, ..."


N-Gram Counting

In [57]:
urdu_ngram_count = {}
for i in range(1, 6):
    urdu_ngram_count[i] = Counter([ngram for ngrams_list in urdu[f'ngrams_{i}'] for ngram in ngrams_list])

for i in range(1, 6):
    print(f"\nTop 5 {i}-grams in Headlines:", urdu_ngram_count[i].most_common(5))


Top 5 1-grams in Headlines: [(('سے',), 2), (('میں',), 2), (('فیصد',), 2), (('کو',), 2), (('عالمی',), 1)]

Top 5 2-grams in Headlines: [(('عالمی', 'بینک'), 1), (('بینک', 'عسکریت'), 1), (('عسکریت', 'پسندی'), 1), (('پسندی', 'سے'), 1), (('سے', 'متاثرہ'), 1)]

Top 5 3-grams in Headlines: [(('عالمی', 'بینک', 'عسکریت'), 1), (('بینک', 'عسکریت', 'پسندی'), 1), (('عسکریت', 'پسندی', 'سے'), 1), (('پسندی', 'سے', 'متاثرہ'), 1), (('سے', 'متاثرہ', 'خاندانوں'), 1)]

Top 5 4-grams in Headlines: [(('عالمی', 'بینک', 'عسکریت', 'پسندی'), 1), (('بینک', 'عسکریت', 'پسندی', 'سے'), 1), (('عسکریت', 'پسندی', 'سے', 'متاثرہ'), 1), (('پسندی', 'سے', 'متاثرہ', 'خاندانوں'), 1), (('سے', 'متاثرہ', 'خاندانوں', 'معاونت'), 1)]

Top 5 5-grams in Headlines: [(('عالمی', 'بینک', 'عسکریت', 'پسندی', 'سے'), 1), (('بینک', 'عسکریت', 'پسندی', 'سے', 'متاثرہ'), 1), (('عسکریت', 'پسندی', 'سے', 'متاثرہ', 'خاندانوں'), 1), (('پسندی', 'سے', 'متاثرہ', 'خاندانوں', 'معاونت'), 1), (('سے', 'متاثرہ', 'خاندانوں', 'معاونت', 'گا'), 1)]


Problems with Urdu Data Set

The biggest problem with the urdu dataset was that it wasn't being read properly on vscode even when using different types of encodings. In the end, I had to read the whole file line by line and manually create dataframe. After that it was easy. Stopwords were retrieved from github in a separate file as I could not get the nltk urdu stopwords to work.

Question 2

Read File and convert to Data Frame

In [58]:
with open("robert.txt", encoding="ISO-8859-1") as file:
    lines = file.readlines()

#strip whitespace from each line and create a DataFrame
lines = [line.strip() for line in lines]
robert = pd.DataFrame(lines, columns=["Line"])
robert['Line'] = robert['Line'].str.lower() #convert to lowercase
robert['Line'] = robert['Line'].str.strip() #strip whitespace
robert['Line'] = robert['Line'].str.replace(r'[^\w\s]', '', regex=True) #remove punctuation
robert['Line'] = robert['Line'].str.replace(r'\d+', '', regex=True) #remove numbers
robert.head()
with open("shakespear.txt", encoding="ISO-8859-1") as file:
    lines = file.readlines()

#strip whitespace from each line and create a DataFrame
lines = [line.strip() for line in lines]
shakespeare = pd.DataFrame(lines, columns=["Line"])
shakespeare['Line'] = shakespeare['Line'].str.lower() #convert to lowercase
shakespeare['Line'] = shakespeare['Line'].str.strip() #strip whitespace
shakespeare['Line'] = shakespeare['Line'].str.replace(r'[^\w\s]', '', regex=True) #remove punctuation
shakespeare['Line'] = shakespeare['Line'].str.replace(r'\d+', '', regex=True) #remove numbers
shakespeare.head()


Unnamed: 0,Line
0,over hill over dale
1,thorough bush thorough brier over park over pale
2,thorough flood thorough fire i do wander every...
3,swifter than the moons sphere and i serve the ...
4,to dew her orbs upon the green the cowslips ta...


Tokenization

In [59]:
#tokenization
robert['tokens']=robert['Line'].apply(word_tokenize)
shakespeare['tokens']=shakespeare['Line'].apply(word_tokenize)
shakespeare.head()

Unnamed: 0,Line,tokens
0,over hill over dale,"[over, hill, over, dale]"
1,thorough bush thorough brier over park over pale,"[thorough, bush, thorough, brier, over, park, ..."
2,thorough flood thorough fire i do wander every...,"[thorough, flood, thorough, fire, i, do, wande..."
3,swifter than the moons sphere and i serve the ...,"[swifter, than, the, moons, sphere, and, i, se..."
4,to dew her orbs upon the green the cowslips ta...,"[to, dew, her, orbs, upon, the, green, the, co..."


Unigrams

In [60]:
#unigrams of robert and shakespeare
robert['unigrams'] = robert['tokens'].apply(lambda x: ngram_generator(x, 1))
shakespeare['unigrams'] = shakespeare['tokens'].apply(lambda x: ngram_generator(x, 1))


Bigrams

In [61]:
#bigrams of shakespeare and robert
robert['bigrams'] = robert['tokens'].apply(lambda x: ngram_generator(x, 2))
shakespeare['bigrams'] = shakespeare['tokens'].apply(lambda x: ngram_generator(x, 2))


Trigrams

In [62]:
#trigrams of shakespeare and robert
robert['trigrams'] = robert['tokens'].apply(lambda x: ngram_generator(x, 3))
shakespeare['trigrams'] = shakespeare['tokens'].apply(lambda x: ngram_generator(x, 3))


In [63]:
robert.head()

Unnamed: 0,Line,tokens,unigrams,bigrams,trigrams
0,a boundless moment,"[a, boundless, moment]","[(a,), (boundless,), (moment,)]","[(a, boundless), (boundless, moment)]","[(a, boundless, moment)]"
1,he halted in the wind and what was that,"[he, halted, in, the, wind, and, what, was, that]","[(he,), (halted,), (in,), (the,), (wind,), (an...","[(he, halted), (halted, in), (in, the), (the, ...","[(he, halted, in), (halted, in, the), (in, the..."
2,far in the maples pale but not a ghost,"[far, in, the, maples, pale, but, not, a, ghost]","[(far,), (in,), (the,), (maples,), (pale,), (b...","[(far, in), (in, the), (the, maples), (maples,...","[(far, in, the), (in, the, maples), (the, mapl..."
3,he stood there bringing march against his thought,"[he, stood, there, bringing, march, against, h...","[(he,), (stood,), (there,), (bringing,), (marc...","[(he, stood), (stood, there), (there, bringing...","[(he, stood, there), (stood, there, bringing),..."
4,and yet too ready to believe the most,"[and, yet, too, ready, to, believe, the, most]","[(and,), (yet,), (too,), (ready,), (to,), (bel...","[(and, yet), (yet, too), (too, ready), (ready,...","[(and, yet, too), (yet, too, ready), (too, rea..."


In [64]:
shakespeare.head()

Unnamed: 0,Line,tokens,unigrams,bigrams,trigrams
0,over hill over dale,"[over, hill, over, dale]","[(over,), (hill,), (over,), (dale,)]","[(over, hill), (hill, over), (over, dale)]","[(over, hill, over), (hill, over, dale)]"
1,thorough bush thorough brier over park over pale,"[thorough, bush, thorough, brier, over, park, ...","[(thorough,), (bush,), (thorough,), (brier,), ...","[(thorough, bush), (bush, thorough), (thorough...","[(thorough, bush, thorough), (bush, thorough, ..."
2,thorough flood thorough fire i do wander every...,"[thorough, flood, thorough, fire, i, do, wande...","[(thorough,), (flood,), (thorough,), (fire,), ...","[(thorough, flood), (flood, thorough), (thorou...","[(thorough, flood, thorough), (flood, thorough..."
3,swifter than the moons sphere and i serve the ...,"[swifter, than, the, moons, sphere, and, i, se...","[(swifter,), (than,), (the,), (moons,), (spher...","[(swifter, than), (than, the), (the, moons), (...","[(swifter, than, the), (than, the, moons), (th..."
4,to dew her orbs upon the green the cowslips ta...,"[to, dew, her, orbs, upon, the, green, the, co...","[(to,), (dew,), (her,), (orbs,), (upon,), (the...","[(to, dew), (dew, her), (her, orbs), (orbs, up...","[(to, dew, her), (dew, her, orbs), (her, orbs,..."


Vocabulary Creation

In [65]:
robert_unigrams = [item for sublist in robert['unigrams'] for item in sublist]
robert_bigrams = [item for sublist in robert['bigrams'] for item in sublist]
robert_trigrams = [item for sublist in robert['trigrams'] for item in sublist]

shakespeare_unigrams = [item for sublist in shakespeare['unigrams'] for item in sublist]
shakespeare_bigrams = [item for sublist in shakespeare['bigrams'] for item in sublist]
shakespeare_trigrams = [item for sublist in shakespeare['trigrams'] for item in sublist]


Poem Generation

In [66]:
from nltk.probability import ConditionalFreqDist

#function to create Conditional Frequency Distribution (CFD) for a list of n-grams
def create_cfd(ngrams_list):
    return ConditionalFreqDist((ngram[:-1], ngram[-1]) for ngram in ngrams_list)


#create CFD models for Robert Frost
robert_unigram_cfd = create_cfd(robert_unigrams)
robert_bigram_cfd = create_cfd(robert_bigrams)
robert_trigram_cfd = create_cfd(robert_trigrams)

#create CFD models for William Shakespeare
shakespeare_unigram_cfd = create_cfd(shakespeare_unigrams)
shakespeare_bigram_cfd = create_cfd(shakespeare_bigrams)
shakespeare_trigram_cfd = create_cfd(shakespeare_trigrams)


1. Start with a randomly selected word.
2. Use the trigram model to predict the next word whenever possible.
3. Fall back to the bigram and then the unigram model if the trigram prediction isn’t available.

In [67]:
import random

#function to generate a line based on n-gram models
def generate_line(start_word, unigram_cfd, bigram_cfd, trigram_cfd, min_len=7, max_len=10):
    line = [start_word]
    
    while len(line) < max_len:
        #try trigram
        if len(line) >= 2 and (line[-2], line[-1]) in trigram_cfd:
            next_word = trigram_cfd[(line[-2], line[-1])].max()
        #else bigram
        elif len(line) >= 1 and (line[-1],) in bigram_cfd:
            next_word = bigram_cfd[(line[-1],)].max()
        #else unigram
        else:
            next_word = unigram_cfd[()].max()

        line.append(next_word)

        #end line randomly between min_len and max_len words
        if len(line) >= min_len and random.random() > 0.7:
            break
    
    return ' '.join(line)

#function to generate a poem with multiple stanzas
def generate_poem(vocab, unigram_cfd, bigram_cfd, trigram_cfd, stanzas=3, lines_per_stanza=4):
    poem = []
    for k in range(stanzas):
        stanza = []
        for k in range(lines_per_stanza):
            start_word = random.choice(vocab)  # Randomly select starting word
            line = generate_line(start_word, unigram_cfd, bigram_cfd, trigram_cfd)
            stanza.append(line)
        poem.append('\n'.join(stanza) + '\n')
    return '\n'.join(poem)


In [68]:
#extract vocabulary for each poet
robert_vocab = list(set([token for sublist in robert['tokens'] for token in sublist]))
shakespeare_vocab = list(set([token for sublist in shakespeare['tokens'] for token in sublist]))

#generate Robert Frost style poem
robert_poem = generate_poem(robert_vocab, robert_unigram_cfd, robert_bigram_cfd, robert_trigram_cfd)
print("Robert Frost Style Poem:\n")
print(robert_poem)

#generate William Shakespeare style poem
shakespeare_poem = generate_poem(shakespeare_vocab, shakespeare_unigram_cfd, shakespeare_bigram_cfd, shakespeare_trigram_cfd)
print("William Shakespeare Style Poem:\n")
print(shakespeare_poem)


Robert Frost Style Poem:

preprimitives of the sun shines now no warmer than the
drew back in its body song and
whiter and whiter and whiter and whiter and
swarm still ran and scuttled just as

speck that would have been a quarry
boy bend them down to stay done hellforleather the worlds
neighbors may not have risen that so keep the
lie in stones and bushes unretrieved the worlds poetry archive

concerns the worlds poetry archive the worlds poetry archive the
swarm still ran and scuttled just as fast oh fast
frame the worlds poetry archive the worlds poetry archive the
hatchery near canada the worlds poetry archive the

William Shakespeare Style Poem:

seanymphs hourly ring his knell dingdong and in my verse
hearing die to themselves sweet roses do
endowed she gave the more which bounteous gift thou
combat doubtful that love is not so

moones sphere nbspnbspnbspand i serve the fairy queen nbspnbspnbspto
darken the day or night and weep
shrieking undistinguishd woe and moan th expense of