## Shakespeare's favorite words
Let's find out what Shakespeares favorite (most frequently used) words were by looking at a corpus of his texts. At the end, a .csv-file will be created to store the result.

### Access the files of the corpus
Create a list with the paths of all files in the corpus.

In [29]:
import os

path = "../exercise-5/corpus"
path_corpus = [os.path.join(path, f) for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]
print(f"The corpus contains {len(path_corpus)} files.")

The corpus contains 42 files.


### Find and normalize tokens
Create a list that contains lists of tokens, one for each text.

In [30]:
all_normalized_tokens = []

for file_path in path_corpus:
    f = open(file_path)
    content = f.read()
    tokens = content.split()
    normalized_tokens = [token.lower().strip(",.!?[]()=-...") for token in tokens]
    for token in normalized_tokens:
        if len(token) == 0:
            normalized_tokens.remove(token)
    all_normalized_tokens.append(normalized_tokens)


print(f"Tokens for {len(all_normalized_tokens)} texts are normalized.")

Tokens for 42 texts are normalized.


### Create a dictionary with counts per token

In [31]:
counts = {}
for token_list in all_normalized_tokens:
    for token in token_list:
        counts[token] = counts.get(token, 0) + 1

print(f"Counted {len(counts)} different tokens.")

Counted 35749 different tokens.


### Sort tokens by frequency

In [32]:
token_frequencies = sorted(counts.items(), key=lambda item: item[1], reverse=True)

print(f"The {len(token_frequencies)} tokens are now ordered by frequency, the five most frequent words are:\n")
for token, count in token_frequencies[:5]:
    print(token, count, sep=":\t")

The 35749 tokens are now ordered by frequency, the five most frequent words are:

the:	29236
and:	28282
to:	21904
i:	21122
of:	18427


### Store the frequencies in a csv-file
This file is stored in the current working directory.

In [33]:
sum_tokens = 0
rank = 0
         
output_file = "shakespeares_favorites.csv"

for token, tokencount in token_frequencies:
    sum_tokens += tokencount   # calculate total sum of tokens for frequcency below

with open(output_file, "w") as f:
    for token, tokencount in token_frequencies:
        rank = rank + 1
        f.write(f"{rank},{token},{tokencount},{tokencount/sum_tokens}\n")

print(f"Created an outputfile called '{output_file}' where the frequencies are listed.\n\nFormat of the output file:\n")
with open(output_file, "r") as f:
    content = f.readlines()
for line in content[:5]:
    print(line)

Created an outputfile called 'shakespeares_favorites.csv' where the frequencies are listed.

Format of the output file:

1,the,29236,0.03041915295415173

2,and,28282,0.029426545486705407

3,to,21904,0.022790433927614567

4,i,21122,0.021976787135640746

5,of,18427,0.01917272306355705



### ... so what are Shakespeare's favorite words then?
His favorite words were stopwords like "the", "and", "to", and "I" - which is no suprise, as these are the most frequently used words in the English language. To find his "real" favorite words, one would need to remove these stopwords from the analysis.
(I was curious what the result might be, so I did a quick and easy demonstration of that. I already had the stopwords-list from another course.)

In [34]:
no_stopwords = []
with open("stopwords.txt") as f:
    stopwords = f.readlines()
stopwords = [word.strip() for word in stopwords]


for (word, count) in token_frequencies:
    if word not in stopwords:
        no_stopwords.append((word, count))
print(f"""Now, we have only {len(no_stopwords)} instead of {len(token_frequencies)} tokens left.\n
These are the 'new' most frequent words:\n""")
for (word, count) in no_stopwords[:15]:
    print(word, count, sep='\t')
    


Now, we have only 35594 instead of 35749 tokens left.

These are the 'new' most frequent words:

thou	5864
will	5233
thy	4329
shall	3837
thee	3309
lord	3078
now	2970
good	2876
king	2826
sir	2765
come	2527
enter	2506
o	2249
let	2207
love	2199


As we can see, the first few tokens are still partially stopwords as Shakespeare used (now) archaic forms of pronouns etc.
However, if we have a look at the tokens that follow, we reach more and more 'non'-stopwords:

In [35]:
for (word, count) in no_stopwords[15:35]:
    print(word, count, sep='\t')

well	2192
hath	2049
man	1947
one	1917
like	1905
upon	1879
may	1774
make	1770
know	1754
go	1754
say	1725
yet	1708
us	1704
must	1624
see	1537
'tis	1489
give	1407
th'	1365
can	1299
first	1295


### However, ...
This is still mostly a list of highly frequent words that might be the result of any Author's work, not just Shakespeare's. So all in all, we now know that Shakespeare used many high-frequency words, but we still don't know much about his personal favorite words. To see that, a different/more complex approach would be necessary.