# Functions for frequency list generations 

In [43]:
import os

path_to_corpus="exercise-5/corpus"

def traverse_directory(path):
  return [os.path.join(path, f) for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]

files = traverse_directory(path_to_corpus)
print(files[:5])

['exercise-5/corpus\\a-midsummer-nights-dream_TXT_FolgerShakespeare.txt', 'exercise-5/corpus\\alls-well-that-ends-well_TXT_FolgerShakespeare.txt', 'exercise-5/corpus\\antony-and-cleopatra_TXT_FolgerShakespeare.txt', 'exercise-5/corpus\\as-you-like-it_TXT_FolgerShakespeare.txt', 'exercise-5/corpus\\coriolanus_TXT_FolgerShakespeare.txt']


Here we list all the files in the file path. For this demonstration we will use Shakespeare texts located in my local directory `corpus`. For pragmatic purposes only the first five elements of the list are printed out.

(This applies also to the following demonstrations)

In [12]:
def tokenize_file(path):
  with open(path, "r") as f:
    complete_string = f.read()
    tokens = []
    normalized_tokens = []
    tokens = complete_string.split()

  for token in tokens:
    normalized_tokens.append(token.lower().strip(",;.!?[]()=-"))
  
  while ("" in normalized_tokens):
    normalized_tokens.remove("")
  
  return normalized_tokens

This function takes a path to a text file, reads out the string contained, tokenizes and normalizes it and returns a list of normalized tokens. We will see it at work in the next code snippet.

In [48]:
def compute_counts(pathlist):
  counts = {}
  for path in pathlist:
    tokens = tokenize_file(path)
    for token in tokens:
      if token in counts:
        counts[token] = counts[token] + 1
      else:
        counts[token] = 1

  return counts

counts = compute_counts(files)

print(list(counts.items())[:5])

[('a', 15612), ('midsummer', 4), ("night's", 29), ('dream', 126), ('by', 4147)]


Here we see how the `compute_counts functions` utilizes the `tokenize_file` function to generate a dictionary with the normalized tokens as keys and their respective frequencies as values.

In [51]:
def sort_counts(counts):
  sorted_tuples = sorted(counts.items(), key=lambda item: item[1], reverse=True)
  return sorted_tuples


sorted_counts = sort_counts(counts)

print(sorted_counts[:5])

[('the', 29236), ('and', 28282), ('to', 21909), ('i', 21130), ('of', 18432)]


`soirt_counts` creates a sorted list that contains each key-value-paire as tuples in one list element. The token with the highest count appears first, the following tokens appear in descending order.


# Results

Now that we have defined all functions and modified the corpus inputs, we can output the 100 most frequent words used in the Shakespeare corpus using a final function:


In [166]:
def write_frequencies(list, number):
  rank = 0
  sum = 0
  all_data = ""

  for object in list:
    sum += int(object[1])
  
  for object in list[:number]:
    rank += 1
    token = object[0]
    count = object[1]
    frequency = float(object[1]) / (sum) * 100
    
    if(len(token) > 1):
        print("Rank %3d\t|\tWord: %s\t|\tCount: %s\t|\tFrequency: %5.3f%%\n" % (rank, token, count, frequency))
    else:
        print("Rank %3d\t|\tWord: %s \t|\tCount: %s\t|\tFrequency: %5.3f%%\n" % (rank, token, count, frequency))


write_frequencies(sorted_counts, 100)

Rank   1	|	Word: the	|	Count: 29236	|	Frequency: 3.042%

Rank   2	|	Word: and	|	Count: 28282	|	Frequency: 2.943%

Rank   3	|	Word: to	|	Count: 21909	|	Frequency: 2.280%

Rank   4	|	Word: i 	|	Count: 21130	|	Frequency: 2.199%

Rank   5	|	Word: of	|	Count: 18432	|	Frequency: 1.918%

Rank   6	|	Word: a 	|	Count: 15612	|	Frequency: 1.624%

Rank   7	|	Word: you	|	Count: 14548	|	Frequency: 1.514%

Rank   8	|	Word: my	|	Count: 13055	|	Frequency: 1.358%

Rank   9	|	Word: in	|	Count: 11916	|	Frequency: 1.240%

Rank  10	|	Word: that	|	Count: 11693	|	Frequency: 1.217%

Rank  11	|	Word: is	|	Count: 9851	|	Frequency: 1.025%

Rank  12	|	Word: not	|	Count: 8965	|	Frequency: 0.933%

Rank  13	|	Word: with	|	Count: 8611	|	Frequency: 0.896%

Rank  14	|	Word: me	|	Count: 8121	|	Frequency: 0.845%

Rank  15	|	Word: for	|	Count: 8089	|	Frequency: 0.842%

Rank  16	|	Word: it	|	Count: 8069	|	Frequency: 0.840%

Rank  17	|	Word: he	|	Count: 7928	|	Frequency: 0.825%

Rank  18	|	Word: his	|	Count: 7640	|	Frequency