# Additional Exercises - Frequency Distribution


In this notebook (set of exercises) we will create a tool that, given a corpus of text files and a search term, is able to provide us with information about the frequency distribution of the term across the files in the corpus.

## Setup

This is a little bit of setup. First, we import necessary libraries. Of course, feel free to add libraries as needed! After, we clone the workshop repository and use the provided helper script to download a series of Sherlock Holmes short stories.

In [None]:
# Regular Expressions
import re

# Pathlib
from pathlib import Path

# Counter for getting frequencies
from collections import Counter

# DataFrames
import pandas as pd

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
%%capture
!git clone https://github.com/IngoKl/python-programming-for-linguists
!cd python-programming-for-linguists/2021/data && sh download_sherlockholmes.sh

## Step 1: Preparing the Data

After running the `download_sherlockholmes.sh` script above, you will have 12 short stories (*The Adventures of Sherlock Holmes*) in the `python-programming-for-linguists/2021/data/corpora/holmes` folder.

The goal of this first step is to read and prepare the data. Your goal will be to create the data structure below. 

Please note that there are other, better and more efficient, data structures to achieve the same goals. However, we are building a solution that mirrors practices in corpus linguistics without being too conscious of memory of computation limitations.

In [None]:
corpus = [
          {
           'filename': 'bery.txt', 
           'text': '...', 
           'story_title': 'THE ADVENTURE OF THE BERYL CORONET', 
           'length': None, 
           'frequencies': {}
          },
]

corpus

Obviously, your solution will create a list with more than one item. The `frequencies` dictionary as well as `length` can be empty for now. We will populate it in the next step. `text` is supposed to contain the actual text.

If you want to, you can preprocess the text before adding it to `corpus`.

The trickiest bit is getting the `story_title` from the file. Have a look at one of the actual text files and remember what you've learned about regular expressions.

In [None]:
def get_story_title(text):
  # YOUR CODE
  title = None

  return title

def preprocess_text(text):
  # YOUR CODE

  return text

In [None]:
corpus = []
files = Path('python-programming-for-linguists/2021/data/corpora/holmes').glob('*.txt')

# YOUR CODE

## Step 2: Getting the Frequencies

You will need to generate frequency tables and add them to `corpus`. At the same time, you should populate `length` with the number of tokens in the document.

This also means that you will have to tokenize the stories first. Remember that you can use `dict()` to turn a `Counter` object into a dictionary.

Ultimately, `frequency`, for each story, should contain a structure like below:

In [None]:
frequencies = {
    'word_a': 42,
    'word_b': 12,
}

In [None]:
def tokenize(text):
  # YOUR CODE
  pass

In [None]:
# YOUR CODE

## Step 3: Frequencies and Frequency Distribution

Now you will need to write a function that takes a `corpus` as well as `search_term`. You will also need to account for both the absolute as well as the relative (per 1,000 tokens) frequencies.

If you need to check whether something is in a dictionary, you can do the following: `if x in y`

You will generate a frequency table for the search term that looks as follows:

In [None]:
frequency_table = {
    # Filename: (abs_frequency, rel_frequency_per_1000)
    'story_title_a': (1, 2),
    'story_title_b': (1, 2)
}

In [None]:
def get_frequencies(corpus, search_term):
  frequency_table = {}

  # YOUR CODE

  return frequency_table

The following code is **provided for you**. You don't have to change anything here. Just need to make sure that you `get_frequencies` function works well with it. 

* We will nicely print the results
* We will calculate a very basic dispersion statistic (Range_2)
* We will plot the results using `seaborn`

In [None]:
def plot_frequency_table(frequency_table, search_term):

  df = pd.DataFrame(frequency_table).transpose()
  df.columns = ['abs_frequency', 'rel_frequency']
  df = df.sort_values('rel_frequency', ascending=False)

  ax = sns.barplot(y=df.index, x='rel_frequency', data=df, color='#EF2D56')
  ax.set_title(f'Frequency Distribution of {search_term} (per 1,000 Tokens')

In [None]:
search_term = 'watson'

parts_with_st = 0
frequency_table = get_frequencies(corpus, search_term)

print(f'Distribution of "{search_term}":\n')
for s in frequency_table:
  
  if frequency_table[s][0] > 0:
    parts_with_st += 1

  print(f'- {frequency_table[s][0]} ({round(frequency_table[s][1], 2)} per 1,000 tokens) in {s}')

# Range_2
range_2 = ( parts_with_st / len(frequency_table.keys()) ) * 100

print(f'\nThe Range_2 is: {round(range_2, 2)}%\n')

plot_frequency_table(frequency_table, search_term)