## CS431/631 Data Intensive Distributed Analytics
### Fall 2019 - Assignment 0
---

**Please edit this (text) cell to provide your name and UW student ID number!**
* **Name:** Vyas Anirudh Akundy
* **ID:** 20765080

---
For this assignment, you will be using Python to do some analysis of the text of Shakespeare's plays.    You should already have uploaded the text file (`Shakespeare.txt`) to the hub.   You should also have downloaded the Python tokenizer module, as `simple_tokenize.py`.

Let's first try running some simple Python code to make sure that everything is set up properly and ready to go.   The code in the next box will open `Shakespeare.txt`, read the file line by line, and tokenize each line that it reads in.    Try running the code by selecting the box and typing `Shift-Return`, i.e., hit the carriage return key while you are holding the shift key.   It may take a few seconds to run.

In [1]:
# this imports the SimpleTokenize function from the simple_tokenize.py file that you uploaded
from simple_tokenize import simple_tokenize

# Now, let's tokenize Shakespeare's plays
filename = 'Shakespeare.txt'
with open(filename) as f:
    for line in f:
        # tokenize, one line at a time
        t = simple_tokenize(line)
t

['the', 'end']

---
### Important

The questions that follow ask you to implement functions whose prototypes are given to you. **Do _not_ change the prototypes of the functions. Do _not_ write code outside of the functions.** In particular, do not use the same cell as the function declaration invoke the function.

You may use specific cells, identified by `# Your tests here`, for test purposes. Code in these cells will *not* be executed when marking your assignment.

---

#### Question 1  (2/10 marks):
When the code is being executed, the cell number will change to \*, i.e., you should see `In [*]` in the left margin next to the cell.   After the code has finished running, the cell number will change to `In [1]` (indicating that this is the first code cell to be executed) and the notebook will display the resulting output immediately after cell.   In this case, you shoueld see the output
`['the', 'end']`.   In the next cell, briefly explain why the code produced this output.


#### Your answer to Question 1:
- The last line in 'Shakespeare.txt' file is "THE END"
- The code above reads each line, passed each line to the 'simple_tokenize' function. This function changed the text to lower case and returned the tokenized version, i.e separated the words in the sentence as tokens(with the help of the regular expression)
- The variable 't' contains the tokenized sentence. But the for loop runs for the entire file hence the value of the variable 't' gets overwritten for each line.
- The 'with' operation comes out after the last line of the file, and since the last line is "THE END", t contains this and that is what is printed as output in tokenized form





Remember that when you close and halt a notebook, any unsaved work in the notebook will be lost.   To save the contents of your notebook, use `Save and Checkpoint` (from the File menu).   

---

Now it is time for you to write some code.   Let's find the most frequently appearing tokens in Shakespeare's work.

#### Question 2 (4/10 marks):
In the next box, implement the function `top_50_tokens` using Python code that return the list of the 50 most frequently appearing tokens and their frequency, i.e., the number of times that each occurs.   Please use the `simple_tokenize` function, without modification, to tokenize the text, so that everyone is working with the same definition of what a token is.   If you wish, feel free to start with the Python code in the box above - just copy it from there and paste it below.

In [6]:
from simple_tokenize import simple_tokenize

def top_50_tokens(filename='Shakespeare.txt'):
    top_50_tokens_list = []
    # In this function, write Python code to find the 50 most frequent tokens in Shakespeare.txt
    # Make sure that your code is commented
    with open(filename) as f:
        temp_freq={} #temporary dictionary to store key:value pairs
        for line in f:
            tokens = simple_tokenize(line) #tokenizing each line
            for t in tokens: #storing the frequency of each word in the temp_freq dictionary
                if t in temp_freq:
                    temp_freq[t]+=1
                else:
                    temp_freq[t]=1
        #sorting the dictionary according to the values in descending order
        sorted_temp_freq = sorted(temp_freq.items(), key=lambda item: item[1], reverse=True)[0:50]
        #creating the list of lists to store the 50 most frequent tokens
        for i in sorted_temp_freq:
            top_50_tokens_list.append([i[0],i[1]])
    return top_50_tokens_list


In [7]:
# Your tests here
print(top_50_tokens())

[['the', 27378], ['and', 26082], ['i', 20717], ['to', 19661], ['of', 17473], ['a', 14723], ['you', 13630], ['my', 12490], ['in', 10996], ['that', 10915], ['is', 9137], ['not', 8512], ['with', 7778], ['me', 7777], ['it', 7692], ['for', 7578], ['be', 6867], ['his', 6859], ['your', 6657], ['this', 6606], ['but', 6277], ['he', 6260], ['have', 5885], ['as', 5744], ['thou', 5491], ['him', 5205], ['so', 5056], ['will', 4977], ['what', 4469], ['thy', 4034], ['all', 3923], ['her', 3850], ['no', 3797], ['by', 3768], ['do', 3753], ['shall', 3592], ['if', 3500], ['are', 3405], ['we', 3298], ['thee', 3180], ['on', 3062], ['lord', 3062], ['our', 3061], ['king', 2871], ['good', 2834], ['now', 2789], ['sir', 2763], ['from', 2640], ['o', 2621], ['come', 2519]]


Be sure to test the code that you have written by running it.   When you submit your notebook to us, we will run your code when we mark you assignment.   As a sanity test on you output, our reference implementation finds that the most frequent word is "the", which occurs 27378 times.

---

Once you have found the 50 most frequent tokens, let's move on to something slightly more complicated.

#### Question 3 (4/10 marks):

Instead of the most frequent tokens appearing in Shakespeare's works, suppose that we want a list of words that appear after the word "perfect", on the same line, in Shakespeare's text. 
(Note: the "words" we are interested in for this question are tokens, as returned by simple_tokenize.)

For example, *All's Well That Ends Well* includes the line
>  Ere I can perfect mine intents, to kneel.

so "mine" should be part of the output, since it follows "perfect" on this line.  To keep the output from getting too long, include only words that appear after "perfect" on more than one line.

In the next box, implement the function `perfect_x` that returns a dictionary of key/value pairs, where the keys are the words that follow perfect on more than one line, and the values the number of lines in which the pattern is observed. For instance, if 'x' follows 'perfect' on 3 different lines, the entry in the dictionary will be ('x':3) As a sanity check on your output, our reference implementation finds 5 such words.

In [8]:
from simple_tokenize import simple_tokenize

def perfect_x(filename = 'Shakespeare.txt'):
    perfect_x_list = []
    # In this function, write Python code to find tokens that follow "perfect" in Shakespeare.txt 
    # Make sure that your code is commented
    with open(filename) as f:
        temp_freq={} #temporary dictionary to store key:value pairs
        for line in f:
            tokens = simple_tokenize(line) #tokenizing each line
            #check if "perfect" is in the tokenized sentence
            if "perfect" in tokens and (tokens.index("perfect"))!=(len(tokens)-1):
                next_word = tokens[tokens.index("perfect")+1] # storing the word next to "perfect" and counting it
                if next_word in temp_freq:
                    temp_freq[next_word]+=1
                else:
                    temp_freq[next_word] = 1
        #adding only those words that follow "perfect" which appear on more than one line 
        for k,v in temp_freq.items():
            if v > 1:
                perfect_x_list.append([k,v])
    return perfect_x_list


In [9]:
# Your tests here
print(perfect_x())

[['honour', 2], ['in', 4], ['love', 4], ['yellow', 2], ['that', 2]]


That's it!   Don't forget to save your work before closing and halting your notebook.      

When you are finished and ready to submit your assignment, download your notebook file (.ipynb) from the hub to your machine, and then follow the submission instructions in the assignment.