## CS431/631 Data Intensive Distributed Analytics
### Winter 2022 - Assignment 0
---

**Please edit this (text) cell to provide your name and UW student ID number!**
* **Name:** k79wu
* **ID:** 20775633

For this assignment, you will be using Python to do some analysis of the text of Shakespeare's plays.   Run the next block to download the text file (`Shakespeare.txt`) and the Python tokenizer module (`simple_tokenize.py`).

In [None]:
!wget -q https://student.cs.uwaterloo.ca/~cs451/content/cs431/Shakespeare.txt
!wget -q https://student.cs.uwaterloo.ca/~cs451/content/cs431/simple_tokenize.py

---


Let's first try running some simple Python code to make sure that everything is set up properly and ready to go.   The code in the next box will open `Shakespeare.txt`, read the file line by line, and tokenize each line that it reads in.    Try running the code by selecting the box and typing `Shift-Return`, i.e., hit the carriage return key while you are holding the shift key.   It may take a few seconds to run.

In [None]:
# this imports the SimpleTokenize function from the simple_tokenize.py file that you uploaded
from simple_tokenize import simple_tokenize

# Now, let's tokenize Shakespeare's plays
filename = 'Shakespeare.txt'
with open(filename) as f:
    for line in f:
        # tokenize, one line at a time
        t = simple_tokenize(line)
t

['the', 'end']

---
### Important

The questions that follow ask you to implement functions whose prototypes are given to you. **Do _not_ change the prototypes of the functions. Do _not_ write code outside of the functions.** In particular, do not use the same cell as the function declaration invoke the function.

You may use specific cells, identified by `# Your tests here`, for test purposes. Code in these cells will *not* be executed when marking your assignment.

---

#### Question 1  (2/10 marks):
After the code has finished running, the notebook will display the resulting output immediately after cell.   In this case, you should see the output `['the', 'end']`. In the next cell, briefly explain why the code produced this output.


#### Your answer to Question 1:
The program reads lines one by one, tokenizes them and save tokenized list to t. When reading the last line, we see that 'THE' and 'END' does not have apostrophe behind them, so they are transformed into lower case and put into a list, which is ['the', 'end']





Now it is time for you to write some code.   Let's find the most frequently appearing tokens in Shakespeare's work.

#### Question 2 (4/10 marks):
In the next box, implement the function `top_50_tokens` using Python code that return the list of the 50 most frequently appearing tokens and their frequency, i.e., the number of times that each occurs.   Please use the `simple_tokenize` function, without modification, to tokenize the text, so that everyone is working with the same definition of what a token is.   If you wish, feel free to start with the Python code in the box above - just copy it from there and paste it below.

In [None]:
from simple_tokenize import simple_tokenize

# This program takes one file and return a list of list of the 50 most frequently appearing tokens and their frequency
# top_50_tokens: Str -> (listof (listof Str Int))
def top_50_tokens(filename='Shakespeare.txt'):
    #initialize an empty list
    top_50_tokens_list = []
    
    #initialize an empty dictionary
    freq={}

    # open the file
    with open(filename) as f:
      for line in f:
          # tokenize, one line at a time
          t = simple_tokenize(line)

          # check words in tokenized list, if the word is in dictionary, update the
          # frequency by adding one, else, add the word into the dictionary
          for word in t:
            if word in freq.keys():
              freq[word]+=1
            else:
              freq[word]=1

    # get top 50 appearing tokens by sorting dictionary by values and take first 50 tokens
    sorted_keys=sorted(freq,key=lambda k: freq[k], reverse=True)[:50]
    
    # update top_50_tokens_list by getting values of those tokens
    top_50_tokens_list=[[key, freq[key]] for key in sorted_keys]
    
    return top_50_tokens_list


In [None]:
# Your tests here
top_50_tokens(filename='Shakespeare.txt')

[['the', 27378],
 ['and', 26082],
 ['i', 20717],
 ['to', 19661],
 ['of', 17473],
 ['a', 14723],
 ['you', 13630],
 ['my', 12490],
 ['in', 10996],
 ['that', 10915],
 ['is', 9137],
 ['not', 8512],
 ['with', 7778],
 ['me', 7777],
 ['it', 7692],
 ['for', 7578],
 ['be', 6867],
 ['his', 6859],
 ['your', 6657],
 ['this', 6606],
 ['but', 6277],
 ['he', 6260],
 ['have', 5885],
 ['as', 5744],
 ['thou', 5491],
 ['him', 5205],
 ['so', 5056],
 ['will', 4977],
 ['what', 4469],
 ['thy', 4034],
 ['all', 3923],
 ['her', 3850],
 ['no', 3797],
 ['by', 3768],
 ['do', 3753],
 ['shall', 3592],
 ['if', 3500],
 ['are', 3405],
 ['we', 3298],
 ['thee', 3180],
 ['on', 3062],
 ['lord', 3062],
 ['our', 3061],
 ['king', 2871],
 ['good', 2834],
 ['now', 2789],
 ['sir', 2763],
 ['from', 2640],
 ['o', 2621],
 ['come', 2519]]

Be sure to test the code that you have written by running it.   When you submit your notebook to us, we will run your code when we mark you assignment.   As a sanity test on you output, our reference implementation finds that the most frequent word is "the", which occurs 27378 times.

---

Once you have found the 50 most frequent tokens, let's move on to something slightly more complicated.

#### Question 3 (4/10 marks):

Instead of the most frequent tokens appearing in Shakespeare's works, suppose that we want a list of words that appear after the word "perfect", on the same line, in Shakespeare's text. 
(Note: the "words" we are interested in for this question are tokens, as returned by simple_tokenize.)

For example, *All's Well That Ends Well* includes the line
>  Ere I can perfect mine intents, to kneel.

so "mine" should be part of the output, since it follows "perfect" on this line.  To keep the output from getting too long, include only words that appear after "perfect" on more than one line.

In the next box, implement the function `perfect_x` that returns a dictionary of key/value pairs, where the keys are the words that follow perfect on more than one line, and the values the number of lines in which the pattern is observed. For instance, if 'x' follows 'perfect' on 3 different lines, the entry in the dictionary will be ('x':3) As a sanity check on your output, our reference implementation finds 5 such words.

In [16]:
from simple_tokenize import simple_tokenize

# The program perfect_x takes a file and return a dictionary of words that follow 'perfect' on more than 
# one line along with its line freqeuncy
# perfect_x: Str -> Dict(Str, Int)
def perfect_x(filename = 'Shakespeare.txt'):
    #initialize an empty dictionary
    perfect_x_dict = {}

    # open the file
    with open(filename) as f:
      for line in f:
          # tokenize, one line at a time
          t = simple_tokenize(line)

          # get a list of indexes of 'perfect' in one line
          perfect_indexes = [i for i, token in enumerate(t) if token=='perfect']

          # get a list of all tokens immediately after 'perfect'
          follower_list = [t[i+1] for i in perfect_indexes if i!=len(t)-1]

          # remove all duplicate tokens in the list
          follower_list = list(set(follower_list))

          # add value by one if token existing in dict, otherwise add token into the dict
          for follower in follower_list:
            if follower in perfect_x_dict.keys():
                  perfect_x_dict[follower]+=1
            else:
                  perfect_x_dict[follower]=1
    
    # filter the dictionary by checking if the value is greater than 1
    perfect_x_dict={k:v for k,v in perfect_x_dict.items() if v>1}
    
    return perfect_x_dict


In [15]:
# Your tests here
perfect_x('Shakespeare.txt')

{'honour': 2, 'in': 4, 'love': 4, 'that': 2, 'yellow': 2}

 When you are finished and ready to submit your assignment, download your .ipynb notebook file from Colab (File>Download .ipynb) to your machine, and then follow the submission instructions in the assignment.