# Most frequently used words in Shakespeare's works

#### The goal of this exercise is to find out the 100 words most frequently employed by the bard. 


### First, we need to access the works in the corpus.
We do this by generating a list of the paths of all the files in the directory.

In [1]:
import os
import string

corpus_path = "../exercise-5/corpus/"
file_paths = [os.path.join(corpus_path, f) for f in os.listdir(corpus_path) if os.path.isfile(os.path.join(corpus_path, f))]

### Now we need to find and normalize all tokens.
To do that, we create a list of lists of tokens per work.

In [2]:
normalized_tokens = []
for path in file_paths:
    with open(path, "r", encoding="utf-8") as f:
        tokens = f.read().split()
        normalized_tokens.extend([token.lower().strip().strip(string.punctuation) for token in tokens])
        while '' in normalized_tokens:
            normalized_tokens.remove('')

### Next, we need to sort the tokens by frequency.
For this, we create a dictionary that maps every token to the amount of times it occurs and then sort the entries accordingly.

In [3]:
counts = {}
for token in normalized_tokens:
    counts[token] = counts.get(token, 0) + 1
    
sorted_token_frequencies = sorted(counts.items(), key=lambda item: item[1], reverse=True)

### Et voilà, here we have a list of the 100 words most frequently used by Shakespeare: 
To establish the rank and frequency of a token, we iterate over the dictionary.

In [4]:
rank = 0
sum_tokens = 0

for item in sorted_token_frequencies:
    sum_tokens = sum_tokens + item[1]

for item in sorted_token_frequencies[:100]:
    rank = rank + 1
    print(str(rank) + ".\t" + str(item[0]) + "\t count: " + str(item[1]) + "\t frequency: " + str(item[1]/sum_tokens) + "\n")

1.	the	 count: 29311	 frequency: 0.030497283326847723

2.	and	 count: 28303	 frequency: 0.02944848725733585

3.	to	 count: 21930	 frequency: 0.02281755734563033

4.	i	 count: 21599	 frequency: 0.02247316101724895

5.	of	 count: 18434	 frequency: 0.01918006621565661

6.	a	 count: 15666	 frequency: 0.01630003891366369

7.	you	 count: 14599	 frequency: 0.015189854978972055

8.	my	 count: 13080	 frequency: 0.013609377568665969

9.	in	 count: 11929	 frequency: 0.01241179396151501

10.	that	 count: 11706	 frequency: 0.012179768640581333

11.	is	 count: 9874	 frequency: 0.010273623403135151

12.	not	 count: 8982	 frequency: 0.009345522119400438

13.	with	 count: 8616	 frequency: 0.008964709260827675

14.	me	 count: 8170	 frequency: 0.008500658618960318

15.	for	 count: 8106	 frequency: 0.008434068392324644

16.	it	 count: 8099	 frequency: 0.008426785086286368

17.	he	 count: 7946	 frequency: 0.008267592825735458

18.	his	 count: 7649	 frequency: 0.00795857255525428

19.	be	 count: 7293	 frequ

### Some limitations:

* We did not manage to filter out contractions completely (see rank 78), as `strip` only removes punctuation at the beginning and the end of a string.

* Rank 99 (*th*) is not a word either.

* Finally, as we did not filter out pronouns, articles etc. the list says more about the most frequently used words in the English language in general than about Shakespeare specifically.