I. Lab Activity </br>
a) Obtain a book in the Gutenberg website. </br>
b) Check the top 3 most frequent words in the book. </br>
c) Record how many times the words were used respectively. </br>
d) Perform a search using regular expression to check for other forms of the frequent words,
individually. </br>
e) Record how many times the frequent words were used, respectively. </br>
f) Create three normalized versions: </br>
i. PorterStemmer </br>
ii. LancasterStemmer </br>
iii. WordNetLammatizer </br>
g) Record how many times the frequent words were used for all the versions, respectively. </br>
h) Tabulate the frequencies of all the results. </br>

# Code Walkthrough

## Task A.


In [None]:
from urllib.request import urlopen
# The book is entitled Macbeth by William Shakespeare
url = "https://www.gutenberg.org/files/1533/1533-0.txt"
raw = urlopen(url).read().decode('utf-8')

In [None]:
import nltk
from nltk.tokenize import RegexpTokenizer
nltk.download('punkt')
tokenizer = RegexpTokenizer(r'\w+')
tokenized_book = tokenizer.tokenize(raw)
raw = raw.lower() 
clean_text = tokenizer.tokenize(raw)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Task B.

In [None]:
# Removing Stopwords
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
newStopWords = ['_','gutenberg','macbeth','macduff','project','banquo']
stop_words.extend(newStopWords)

filtered_books = [w for w in tokenized_book if not w.lower() in stop_words]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**I explicitly removed the name of the characters in the result for most frequent word since i want more common word when i try it in regex for task D. For example, if the most word is 'play', i can find 'playing' as a result in regex but i can't do that for the name of character (such macbeth and macduff)**

In [None]:
freq = nltk.FreqDist(filtered_books)
top3_words = freq.most_common(3)

def Extract(lst):
    return [item[0] for item in lst]
      

print(Extract(top3_words))

['LADY', 'Enter', 'thou']


## Task C.

In [None]:
top3_words

[('LADY', 80), ('Enter', 71), ('thou', 66)]

## Task D.

In [None]:
import re
for_lady = [w for w in set(tokenized_book) if re.search('^[Ll]+[Aa]+[Dd]+[Yy]+$',w)]
for_lady

['LADY', 'lady', 'Lady']

In [None]:
for_thou = [w for w in set(tokenized_book) if re.search('^[Tt]+[Hh]+[Oo]+[Uu]+$',w)]
for_thou

['thou', 'Thou']

In [None]:
for_enter = [w for w in set(tokenized_book) if re.search('^[Ee]+[Nn]+[Tt]+[Ee]+[Rr]+$',w)]
for_enter

['Enter', 'enter']

In [None]:
all_frequent_words = [for_lady,for_thou,for_enter]

In [None]:
# flatten the list of list
from collections.abc import Iterable
def flatten(lis):
     for item in lis:
         if isinstance(item, Iterable) and not isinstance(item, str):
             for x in flatten(item):
                 yield x
         else:        
             yield item

all_frequent_words = list(flatten(all_frequent_words))
all_frequent_words

['LADY', 'lady', 'Lady', 'thou', 'Thou', 'Enter', 'enter']

## Task E.

In [None]:
from collections import Counter
wordcounts = Counter(tokenized_book)
freq = [wordcounts[word] for word in all_frequent_words]
freq

[80, 3, 14, 66, 28, 71, 1]

In [None]:
import pandas as pd
all_frequent_words_frequency = pd.DataFrame(
    {'Word': all_frequent_words,
     'Frequency': freq
    })


In [None]:
all_frequent_words_frequency

Unnamed: 0,Word,Frequency
0,LADY,80
1,lady,3
2,Lady,14
3,thou,66
4,Thou,28
5,Enter,71
6,enter,1


## Task F.

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')
porter_stemmer = nltk.PorterStemmer()
lancaster_stemmer = nltk.LancasterStemmer()
wordnet_lemmatizer = nltk.WordNetLemmatizer()

ps = [porter_stemmer.stem(t) for t in set(tokenized_book)]
ls = [lancaster_stemmer.stem(t) for t in set(tokenized_book)]
wl = [wordnet_lemmatizer.lemmatize(t) for t in set(tokenized_book)]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
df = pd.DataFrame(
    {
     'Porter Stemmer': ps,
     'Lancaster Stemmer': ls,
     'Wordnet Lemmatizer': wl
    })
df

Unnamed: 0,Porter Stemmer,Lancaster Stemmer,Wordnet Lemmatizer
0,believ,believ,believ
1,wave,wav,wave
2,vouch,vouch,vouch
3,thank,thank,Thanks
4,war,war,war
...,...,...,...
4296,float,flo,float
4297,enter,ent,enter
4298,remov,remov,removed
4299,chamber,chamb,chamber


## Task G.

In [None]:
# Porter Stemmer
wordcounts_ps = Counter(ps)
freq_ps = [wordcounts_ps[word] for word in all_frequent_words]
freq_ps

ps_frequency = pd.DataFrame(
    {'Word': all_frequent_words,
     'Frequency': freq_ps
    })

ps_frequency

Unnamed: 0,Word,Frequency
0,LADY,0
1,lady,0
2,Lady,0
3,thou,2
4,Thou,0
5,Enter,0
6,enter,2


In [None]:
# Lancaster Stemmer
wordcounts_ls = Counter(ls)
freq_ls = [wordcounts_ls[word] for word in all_frequent_words]
freq_ls

ls_frequency = pd.DataFrame(
    {'Word': all_frequent_words,
     'Frequency': freq_ls
    })

ls_frequency

Unnamed: 0,Word,Frequency
0,LADY,0
1,lady,3
2,Lady,0
3,thou,2
4,Thou,0
5,Enter,0
6,enter,0


In [None]:
# Wordnet Lemmatizer
wordcounts_wl = Counter(wl)
freq_wl = [wordcounts_wl[word] for word in all_frequent_words]
freq_wl

wl_frequency = pd.DataFrame(
    {'Word': all_frequent_words,
     'Frequency': freq_wl
    })

wl_frequency

Unnamed: 0,Word,Frequency
0,LADY,1
1,lady,1
2,Lady,1
3,thou,1
4,Thou,1
5,Enter,1
6,enter,1


## Task H.

In [None]:
frequency_in_all_version = pd.DataFrame(
    {'Word': all_frequent_words,
     'Original Text': freq,
     'Porter Stemmer': freq_ps,
     'Lancaster Stemmer':freq_ls,
     'Wordnet Lemmatizer':freq_wl
    })
frequency_in_all_version 

Unnamed: 0,Word,Original Text,Porter Stemmer,Lancaster Stemmer,Wordnet Lemmatizer
0,LADY,80,0,0,1
1,lady,3,0,3,1
2,Lady,14,0,0,1
3,thou,66,2,2,1
4,Thou,28,0,0,1
5,Enter,71,0,0,1
6,enter,1,2,0,1


In [None]:
# Transposed Version
frequency_in_all_version.T

Unnamed: 0,0,1,2,3,4,5,6
Word,LADY,lady,Lady,thou,Thou,Enter,enter
Original Text,80,3,14,66,28,71,1
Porter Stemmer,0,0,0,2,0,0,2
Lancaster Stemmer,0,3,0,2,0,0,0
Wordnet Lemmatizer,1,1,1,1,1,1,1
