# Week 1, Lesson 5, Activity 10: Frequency Analysis

&copy;2021, Ekaterina Kochmar

Your task in this activity is to:

- Pre-process the provided texts using tokenization.
- Run frequency analysis based on the techniques you’ve learned about in Activity 9, visualise and analyse the results.

## Task 1: Import data

Import data from NLTK (see http://www.nltk.org/book_1ed/ch02.html), for example, using the Gutenberg dataset:

In [13]:
import nltk
from nltk.corpus import gutenberg

for fileid in gutenberg.fileids():
    # print(fileid, gutenberg.raw(fileid)[:65])
    pass

## Task 2: Use NLTK's FreqDist functionality

Use the `FreqDist` functionality as shown in https://www.nltk.org/book/ch01.html and http://www.nltk.org/book_1ed/ch02.html. 

For the datasets available via NLTK you can either apply tokenization with `word_tokenize` or rely on the `.word` functionality, which provides you with tokenized output:

In [12]:
fdist1 = nltk.FreqDist(gutenberg.words("austen-emma.txt"))
fdist1
# Print out most frequent 50 words with their counts. 
# Hint: you need to use most_common(number_of_words) method applied to fdist1

FreqDist({',': 11454, '.': 6928, 'to': 5183, 'the': 4844, 'and': 4672, 'of': 4279, 'I': 3178, 'a': 3004, 'was': 2385, 'her': 2381, ...})

What can you tell about the most frequent words in this text?

Let's try visualising cumulative frequency of the most frequent $30$ words:

In [None]:
# Hint: you need to use plot(number_of_words, cumulative=True) method applied to fdist1

What does this plot suggest?

## Task 3: Implement FreqDist from scratch

Collect words, calculate their frequency, and return a dictionary sorted in the reverse order:

In [None]:
import operator

def collect_word_map(word_list):
    word_map = {}
    for a_word in word_list:
        word_map[a_word] = # update the count for a_word in word_map by 1. 
                           # Hint: word_map.get(a_word) returns the current count,
                           #       word_map.get(a_word, 0) allows you to cover cases where current word count is 0 
    return word_map
    
# Let's sort the word frequency map by word counts, 
# starting from the largest count (reverse order), 
# and print up to 10 most frequent words
word_map = collect_word_map(gutenberg.words("austen-emma.txt"))
sorted_map = (sorted(word_map.items(), key=operator.itemgetter(1)))[::-1]
print(sorted_map[:10])

Let's calculate the percentage of the content covered by specific (most frequent) words. E.g., what percentage of words used in text are commas?

In [None]:
def collect_percentage_map(word_map, up_to):
    total_count = sum(word_map.values())
    sorted_map = # sort the word frequency map by word counts, starting from the largest count (reverse order)
    percentage_map = [(item[0], 100*float(item[1])/float(total_count)) for item in sorted_map[:up_to]]
    return percentage_map

print(collect_percentage_map(word_map, 50))

Finally, let's visualise the cumulative frequency counts as a histogram:

In [None]:
import numpy as np
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
%matplotlib inline

def visualise_dist(word_map, up_to):
    width = 10.0
    percentage_map = # apply collect_percentage_map from above
    sort_map = {}
    rank = 0
    cum_sum = 0
    # Store cumulative percetage of coverage
    for item in percentage_map:
        rank += 1
        cum_sum += item[1]
        sort_map[rank] = cum_sum
    # How much do the top n words account for?
    print("Total cumulative coverage = %.2f" % cum_sum + "%")
    
    fig, ax = plt.subplots()
    plt.title("Cumulative coverage of the top " + str(up_to) + " words")
    plt.ylabel("Percentage")
    plt.xlabel("Top " + str(up_to) + " words")
    # Build the histogram for the percentages
    plt.bar(range(len(sort_map)), sort_map.values())
    # Label x axis with the ranks of the 1st to n-th most frequent word
    # printing out each 5-th label on the axis
    start, end = ax.get_xlim()
    ax.xaxis.set_ticks(np.arange(start, end+1, 5))
    plt.show()
    
# Explore statistics with a different number of top n words
visualise_dist(word_map, 50)    

What does this cumulative distribution suggest?

## Task 4: Apply to other texts

This is an open-ended task.
