ELEC-E5550 - Statistical Natural Language Processing
# SET 1: Text Preprocessing

## Released: 16.01.2024
## Deadline: 26.01.2024 at 23:59

# Overview
Consider this assignment an introduction to statistics of different language units (letters, pairs of letters, words). We will explore the frequency distribution of these different language units and then discuss what this knowledge might give. Moreover, we'll talk about how to handle raw text, how to separate it into different units and how those units and operations are called.

# Table of contents

* [Task 1: Letter and Letter pair Frequency analysis](#task_1)
    * [Step 1.1: Prepare the text (1 point)](#subtask_1_1)
    * [Step 1.2: Get letter frequencies (3 points)](#subtask_1_2)
    * [Step 1.3: Letter frequency analysis (1 point)](#subtask_1_3)
    * [Step 1.4: Count all possible two-letter strings (1 point)](#subtask_1_4)
    * [Step 1.5: Get letter pair counts (3 points)](#subtask_1_5)
    * [Step 1.6: Letter pair frequency analysis (3 points)](#subtask_1_6)
* [Task 2: Word Tokenization](#task_2)
    * [Step 2.1: Tokenize by whitespaces (1 point)](#subtask_2_1)
    * [Step 2.2: Tokenize with regular expressions (5 points)](#subtask_2_2)
    * [Step 2.3: Use Treebank tokenizer (1 point)](#subtask_2_3)
* [Task 3: ](#task_3)
    * [Step 3.1: Analyse word frequencies (3 points)](#subtask_3_1)
    * [Step 3.2: Remove stop words (1 point)](#subtask_3_2)
* [Checklist before submission](#checklist)

## TASK 1 <a class="anchor" id="task_1"></a>
## Letter and Letter pair Frequency analysis

The data used in this assignment is "The Gold-Bug" by Edgar Allan Poe. It is actually a story about the importance of letter frequencies. The narrator in the story was able to decipher a message leading to a hidden treasure by applying frequency analysis. The cipher used in the story is a substitute cipher where each letter is replaced by a different letter or number.

Knowing the frequency of letters in a language is important not only for solving ciphers, but it also has practical applications like data compression. For example, Morse code uses the shortest symbols for the most frequent letters. 

For the purposes of this assignment, "The Gold-Bug" text serves as a representation of the English language. While the text is short, the statistics you will compute from it in this assignment reflect those of the English language in general. You will not need any external data to answer the questions in this assigment.

In the first task you'll need to discover for yourself the frequency distribution of single letters and of letter pairs in English.

## 1.1  <a class="anchor" id="subtask_1_1"></a>
### Prepare the text (1 point)

First of all, we need to load the text into the Jupyter Notebook and prepare it for further analysis. Create a function that reads the data located in `/coursedata/text-processing/the_gold-bug.txt`, lowercases it, and returns the text as a list of separate letters, ignoring all non-alphabetic characters.

HINT1: [string methods might come in handy](https://www.w3schools.com/python/python_ref_string.asp)

HINT2: you can employ Python's open() and read() functions

HINT3: Alphabetic characters are characters defined as “Letter” in the Unicode character database

In [2]:
def read(file_name):
    """This function creates a list of lowercase alphabetic characters from a .txt file 
    
    Parameters
    ----------
    file_name : str
        a path to the text file
    
    Returns
    -------
    lowercased_letters : list of strings
        text as a list of lowercase letters
    """
    # YOUR CODE HERE
    # raise NotImplementedError()

    with open(file_name, encoding='utf-8') as file:
        text = file.read()
        lowercased_letters = [char.lower() for char in text if char.isalpha()]    

    return lowercased_letters

bug_file_path = '../coursedata/text-processing/the_gold-bug.txt'
bug_letters = read(bug_file_path)
print(bug_letters[:30])

['t', 'h', 'e', 'g', 'o', 'l', 'd', 'b', 'u', 'g', 'w', 'h', 'a', 't', 'h', 'o', 'w', 'h', 'a', 't', 'h', 'o', 't', 'h', 'i', 's', 'f', 'e', 'l', 'l']


In [2]:
!pip install nose
from nose.tools import assert_equal

# checks if your function returns a list
assert_equal(type(read(bug_file_path)), list)

# checks if your function returns a list of strings
assert_equal(type(read(bug_file_path)[0]), str)

# checks if your list is of the right length
assert_equal(len(read(bug_file_path)), 58269)

# checks if the strings in your list are alpabetic characters
assert_equal(all([letter.isalpha() for letter in read(bug_file_path)]), True)

# checks if the strings in your list are lowercased
assert_equal(all([letter.islower() for letter in read(bug_file_path)]), True)

# checks if your list has the first 10 members right 
assert_equal(read(bug_file_path)[:10],
             ['t', 'h', 'e', 'g', 'o', 'l', 'd', 'b', 'u', 'g'])




## 1.2 <a class="anchor" id="subtask_1_2"></a>
### Get letter frequencies (3 points)

Now we can count how many times each letter occurred in the story, and then turn these counts into Maximum Likelihood probability estimates of seeing each letter. 

To do so, write a function that takes in a list of letters and returns a dictionary with their frequencies relative to the size of the whole text: $ freq_x = \frac{n_x}{N} $, where $n_x$ is the number of times a letter $x$ was seen, and $N$ is the total number of letters in the text. These relative frequencies are probability estimates for the letters in our text.

HINT: you might find **nltk.FreqDist** or **collections.Counter** useful

In [3]:
from collections import Counter
from nltk.probability import FreqDist

def get_freq(letters):
    """This function computes MLE probabilities of letters
    
    Given a list of letters, this function should return a probability dictionary,
    where letters are keys and their MLE probabilities are values

    Parameters
    ----------
    letters : list of strings
        text as a list of only lowercase letters 
    
    Returns
    -------
    letter_freq_dict : dictionary-like object
        a frequency dictionary of letters where values are relative frequencies (MLE probabilities)
    """
    
    # YOUR CODE HERE
    # raise NotImplementedError()

    freq_dist = FreqDist(letters)
    total_letters = len(letters)    
    # Convert the counts to probabilities
    letter_freq_dict = {letter: freq / total_letters for letter, freq in freq_dist.items()}
    return letter_freq_dict

bug_probability_dict = get_freq(bug_letters)
print(bug_probability_dict)

{'t': 0.09413238600284886, 'h': 0.05786953611697472, 'e': 0.13125332509567694, 'g': 0.01961591927096741, 'o': 0.07221678765724485, 'l': 0.03988398633921982, 'd': 0.04347079922428736, 'b': 0.01769379944739055, 'u': 0.03248725737527673, 'w': 0.022361804733220067, 'a': 0.07722802862585594, 'i': 0.07178774305376787, 's': 0.06034083303300211, 'f': 0.023889203521598106, 'n': 0.06713689955207744, 'c': 0.026137397243817466, 'm': 0.025725514424479567, 'y': 0.019667404623384645, 'r': 0.056256328407901283, 'v': 0.009009936673016528, 'q': 0.0010297070483447459, 'p': 0.020062125658583466, 'k': 0.006023786232816764, 'x': 0.0020594140966894918, 'z': 0.0007551185021194804, 'j': 0.0019049580394377799}


In [4]:
from numpy.testing import assert_almost_equal

# checks if the alphabet length is 26
assert_equal(len(get_freq(bug_letters)), 26)

# checks if probability of all the letters equals one
assert_almost_equal(sum(get_freq(bug_letters).values()), 1., 3)

# checks if the algorithm is doing what it is supposed to be doing on a dummy example
assert_equal(get_freq(['b','b','b','a','a','c']), {"b":3/6,"a":2/6,"c":1/6})

# checks if the probability of 'e' is correct
assert_almost_equal(get_freq(bug_letters)['e'], 0.13125332509567694, 3)


## 1.3 <a class="anchor" id="subtask_1_3"></a>
### Letter frequency analysis (1 point)
The counts of letters in a text differ quite much. That means the probabilities of seeing each letter are also different.

Using your frequency dictionary, answer the following questions:

1. What is the probability of the most frequent letter? (0.3 points)
2. What is the least probable letter? (0.3 points)
3. What is the order of English letters according to their probability? Answer with a sorted string of letters starting with the most frequent one. (0.4 points)

Type your answers in the cell below. You can create an additional cell to do the calculations if needed. If the answer is a float, then always insert the full float into the variable or the code that produces the correct float. Floats are checked up to 3 decimal points so there is some room for floating point errors.

In [5]:
# YOUR CODE HERE
# raise NotImplementedError()

sorted_bug_probability_dict = dict(sorted(bug_probability_dict.items(), key=lambda item: item[1], reverse=True))
print("The most frequent letter probability:\n", list(sorted_bug_probability_dict.values())[0])
print("The least frequent letter probability:\n", list(sorted_bug_probability_dict.keys())[-1])
print("sorted letters (most to least frequent):\n", ''.join(list(sorted_bug_probability_dict.keys())))

# put your answer to question 1 as a float number into the variable below
# For example:
# most_frequent_letter_prob = 0.12345
most_frequent_letter_prob = None ## FILL IN THE ANSWER
most_frequent_letter_prob = 0.13125332509567694 ## FILL IN THE ANSWER

# put your answer to question 2 as a string into the variable below
# For example:
# least_probable_letter = 'a'
least_probable_letter = 'z' ## FILL IN THE ANSWER

# put your answer to question 3 as a string into the variable below
# For example:
# sorted_letters = 'abcdefghigklmnopqrstuvwxyz'
sorted_letters = 'etaoinshrdlucmfwpygbvkxjqz' ## FILL IN THE ANSWER

The most frequent letter probability:
 0.13125332509567694
The least frequent letter probability:
 z
sorted letters (most to least frequent):
 etaoinshrdlucmfwpygbvkxjqz


In [6]:
# checks if your answer is a float
assert_equal(type(most_frequent_letter_prob), float)

In [7]:
# checks if your answer is a string
assert_equal(type(least_probable_letter), str)

In [8]:
# checks if you typed in a string of 26 letters
assert_equal(type(sorted_letters), str)
assert_equal(len(sorted_letters), 26)
# checks if the 3rd most frequent letter is 'a'
assert_equal(sorted_letters[2], 'a')

## 1.4 <a class="anchor" id="subtask_1_4"></a>
### Count all possible two-letter strings (1 point)
Similar to single letters, some combinations of language units are more likely than others. You'll see it in a minute.

There are 26 letters in the English alphabet. How many possible two-letter strings are there according to combinatorics ('permutations' combinatorically speaking)? For example, if we have an alphabet of 3 letters **a**, **b** and **c**, we can have 9 two-letter strings: **aa**, **bb**, **cc**, **ab**, **ac**, **ba**, **bc**, **ca**, **cb**.

Complete the function below, so we can count all n-length strings for an alphabet of any length.

In [9]:
def number_of_permutations(number_of_letters, sequence_len):
    """Counts the number of possible letter permutations in a string.
    
    This function takes the number of letters in an alphabet and the desired length of a string,
    and outputs the number of all possible letter permutations in a string of this length.
    A string can contain the same letter "sequence_len" times.
    
    Parameters
    ----------
    number_of_letters : int 
        a number of letters in an alphabet
    sequence_len : int
        the length of a string
    
    Returns
    -------
    num_of_permutations : int
        the number of all possible strings of a given length with a given alphabet
    """
    
    # YOUR CODE HERE
    # raise NotImplementedError()
    num_of_permutations = number_of_letters ** sequence_len
    return num_of_permutations

In [10]:
assert_equal(number_of_permutations(2,2), 4)
assert_equal(number_of_permutations(2,1), 2)
assert_equal(number_of_permutations(3,2), 9)


## 1.5 <a class="anchor" id="subtask_1_5"></a>
### Get letter pair counts (3 points)

In previous task, you computed the total number of possible permutations, where two-lettered strings were given as an example. However, not all strings of two letters actually appear in English. A fact like this can be used in such applications as predictive text: your phone suggests what might be the next word you need based on previous words you typed (not all word sequences are possible or equally probable).

In the following task, you'll need to count all two-letter strings that appear in the given Gold-Bug text.

In [11]:
def count_and_sort_pairs(letters):
    """This function counts letter pairs and sorts them according to their frequency.
    
    This function takes a text represented as a list of lowercase letters
    and converts it into a sorted list of tuples, where the first element
    is a two-letter string, and the second element is the count of this letter pair in the text.
    The first element of a list should be a tuple for the most frequent pair.
    
    Parameters
    ----------
    letters : list of strings
        text as a list of lowercase letters 
    
    Returns
    -------
    pairs_sorted : list of (str, int) tuples
        a list of tuples (letter_pair, count) sorted by the count element 
    """
    
    # YOUR CODE HERE
    # raise NotImplementedError()
    
    letters_string = ''.join(letters)
    # Count all possible two-letter pairs using a sliding window approach
    letter_pairs = [letters_string[i:i+2] for i in range(len(letters_string)-1)]
    letter_pairs_freq = Counter(letter_pairs)
    pairs_sorted = sorted(letter_pairs_freq.items(), key=lambda pair: pair[1], reverse=True)
    return pairs_sorted

bug_pairs_sorted = count_and_sort_pairs(bug_letters)

print(bug_pairs_sorted[:20])

[('th', 1800), ('he', 1432), ('in', 1099), ('re', 935), ('er', 924), ('an', 858), ('ed', 790), ('en', 726), ('es', 692), ('nd', 676), ('ea', 666), ('nt', 662), ('at', 646), ('it', 642), ('ha', 628), ('on', 614), ('ou', 603), ('te', 582), ('ti', 567), ('st', 567)]


In [12]:
# checks if the function returns a list
assert_equal(type(count_and_sort_pairs(bug_letters)), list)

# checks if the function returns a list of tuples
assert_equal(type(count_and_sort_pairs(bug_letters)[0]), tuple)

# checks if the function returns a list of tuples (str, int)
assert_equal((type(count_and_sort_pairs(bug_letters)[0][0]),type(count_and_sort_pairs(bug_letters)[0][1])), (str,int))

# checks that the most frequent pair was seen 1800 times
assert_equal(count_and_sort_pairs(bug_letters)[0][1], 1800)

# checks if a functions works right for the dummy example
assert_equal(count_and_sort_pairs("aaaabbabc")[:2], [("aa",3),("ab",2)])
assert(count_and_sort_pairs("aaaabbabc")[2] in [("ba",1), ("bb", 1), ("bc", 1)])


## 1.6 <a class="anchor" id="subtask_1_6"></a>
### Letter pair frequency analysis (3 points)
Using the sorted list you've created (`bug_pairs_sorted`), answer the following questions in the cell below:

1. How many different two-letter combinations have you actually encountered in the data? (0.6 points)
2. What fraction of all theoretically possible two-letter strings (i.e. permutations) is it? (0.6 points)
3. What is the most frequent two-letter string in English (make a mental note if it is surprising or not)? The Gold-Bug text serves here as a proxy for the English language. (0.6 points)
4. What is the probability of seeing a pair where both letters are the same? (0.6 points)
5. What is the probability of a pair starting with 'm'? (0.6 points)

Type your answers in the cell below. You can create an additional cell to do the calculations if needed.

In [13]:
# YOUR CODE HERE
# raise NotImplementedError()

n_pairs = len(bug_pairs_sorted)

frac_of_pairs = n_pairs / (26 * 26)

most_frequent_pair = bug_pairs_sorted[0][0]

total_counts = sum(count for pair, count in bug_pairs_sorted)
p_same_letters = sum(count for pair, count in bug_pairs_sorted if pair[0] == pair[1]) / total_counts

p_starts_with_m = sum(count for pair, count in bug_pairs_sorted if pair[0] == 'm') / total_counts

# Output the results
print(f"Number of different two-letter combinations: {n_pairs}")
print(f"Fraction of all theoretically possible two-letter strings: {frac_of_pairs}")
print(f"Most frequent two-letter string: {most_frequent_pair}")
print(f"Probability of seeing a pair where both letters are the same: {p_same_letters}")
print(f"Probability of a pair starting with 'm': {p_starts_with_m}")


# put your answer to question 1 as an int into the variable below
# For example:
# n_pairs = 1000
n_pairs = 519 ##FILL IN THE ANSWER

# put your answer to question 2 as a float into the variable below
# For example:
# frac_of_pairs = 0.12345
frac_of_pairs = 0.7677514792899408 ##FILL IN THE ANSWER

# put your answer to question 3 as a string into the variable below
# For example:
# most_frequent_pair = 'ab'
most_frequent_pair = 'th' ##FILL IN THE ANSWER

# put your answer to question 4 as a float into the variable below
# For example:
# p_same_letters = 0.12345
p_same_letters = 0.03459875060067275 ##FILL IN THE ANSWER

# put your answer to question 5 as a float into the variable below
# For example:
# p_starts_with_m = 0.12345
p_starts_with_m = 0.025725955927781975 ##FILL IN THE ANSWER

Number of different two-letter combinations: 519
Fraction of all theoretically possible two-letter strings: 0.7677514792899408
Most frequent two-letter string: th
Probability of seeing a pair where both letters are the same: 0.03459875060067275
Probability of a pair starting with 'm': 0.025725955927781975


In [14]:
# checks if your answer is an int
assert_equal(type(n_pairs) , int)

In [15]:
# checks if your answer is a float
assert_equal(type(frac_of_pairs), float)

In [16]:
# checks if your answer is a string
assert_equal(type(most_frequent_pair),str)

In [17]:
# checks if your answer is a float
assert_equal(type(p_same_letters),float)

In [18]:
# checks if your answer is a float
assert_equal(type(p_starts_with_m),float)

## TASK 2 <a class="anchor" id="task_2"></a>
## Word Tokenization

In this task, you will create a function that splits the text into more elaborate units than just letters: words.

Text data is a part of virtually any NLP application. Sometimes you're lucky, and instead of plain raw text you get nice and clean text, but this is not always the case. Before getting your hands dirty with your actual application, you would most probably need to perform some manipulation of the text. For instance, separate it into words and sentences, and remove unwanted symbols. Different tasks require different preprocessing techniques. In this task we'll use some simple ones.

It's not trivial to separate words from a string of text. The first thing that needs to be decided is what to count as a word. Should punctuation and numbers be considered words? Should *frogs* and *frog* be considered the same word? What about *Frog*, *frog* and *FROG*? Before answering those questions, let's make sure we are on the same page and discuss some terminology.

When talking about words, we can mean several different things: lemmas, word types and word tokens.

* **Lemma** - an identifier of a set of lexical forms sharing the same stem (*run* is the lemma for *runs* and *running*), a dictionary form of a word.
* **Word type** - a distinct word in a text (all the instances of *runs* are counted as one word type).
* **Word token** - every instance of word occurrence (every instance of *runs* counted as a separate word token).

Thus:
* **Tokenization** - a process of separating out word tokens from text
* **Lemmatization** - a process of assigning a group of word forms their lemma, and further separating out these lemmas from text

It is common to also widen the definition of 'word' to include punctuation.

Generally, English doesn't require lemmatization since it has quite a limited number of word forms. For this reason, we'll leave this task out, for now, and focus on tokenization instead.

Let's create a tokenizer that considers numbers and punctuation as tokens and doesn't separate hyphenated words like *dum-dum*. For that you'll need:
- regular expressions
- string operations

## 2.1 <a class="anchor" id="subtask_2_1"></a>
### Tokenize by whitespaces (1 point)
Let's start off by separating words just by whitespaces and see what happens to our dummy sentence example: 
*It's a dum-dum example, we'll place it here to prove a point. Also look at this number: 300.99.*

HINT: One string method is particularly useful

In [19]:
dummy_example = "It's a dum-dum example, we'll place it here to prove a point. Also, look at this number: 300.99."

def whitespace_tokenize(raw_string):    
    """This function tokenizes strings by whitespaces.
    
    Any whitespace separator should work. 
    For example, this function should be able to tokenize by '\n',
    and two consecutive whitespaces should be regarded as a single separator.
    
    Parameters
    ----------
    raw_string : str
        some text to tokenize
        
    Returns
    -------
    whitespace_tokenized : list of strings
        list of tokens 
    """
    # YOUR CODE HERE
    # raise NotImplementedError()
    whitespace_tokenized = raw_string.split()
    return whitespace_tokenized

dum_dum_example = whitespace_tokenize(dummy_example)
# see what you've got
print(dum_dum_example)

["It's", 'a', 'dum-dum', 'example,', "we'll", 'place', 'it', 'here', 'to', 'prove', 'a', 'point.', 'Also,', 'look', 'at', 'this', 'number:', '300.99.']


In [20]:
# checks if the first token is correct
assert_equal(dum_dum_example[0], "It's")

# checks if number of tokens is correct
assert_equal(len(dum_dum_example), 18)

# checks if all whitespaces were removed
assert_equal(any([' ' in t for t in dum_dum_example]), False)

# checks if double whitespaces are removed
assert_equal(whitespace_tokenize('  a  a  '), ['a','a'])

# checks if tab ia removed too
assert_equal(whitespace_tokenize('  a \t a  '), ['a','a'])



## 2.2 <a class="anchor" id="subtask_2_2"></a>
### Tokenize with regular expressions (5 points)

As can be seen from the dummy example, it's not enough to just separate words by the whitespaces. This way we get tokens like *'example,'*, *'point.'* and *'number:'*. It's not ideal because we would actually like to have punctuation marks as separate tokens but keep them inside the items like prices and numbers (4.99). Thus, we need something more complex: a regular expression.


If you've never used regular expressions before or you've never used them in Python, you can read about how they work with the re module [here](https://docs.python.org/3/howto/regex.html). 

But to give you a simple example, you can think of a regular expression as a shoe that only fits some of the strings. For example, regular expression 'a' (`re.compile("a")`) only fits a string 'a', but regular expression with a special character '\d' (`re.compile("\d")`) fits any digit '0','1','2',...'9'. Regular expression 'a|\d' (`re.compile("a|\d")`) fits to 'a' string OR to any string with a digit ('|' plays a role of disjunction). You can ask to find all the substrings of a string that match some regular expression pattern with the following piece of code: `re.findall(regular_expression, string_to_look_for_matches)`. More examples are in the cell below.

In [21]:
# example of regular expression usage
import re

regex_digit = re.compile("\d") # any digit
regex_a = re.compile("a") # only letter 'a'
regex_a_digit = re.compile("a\d") # letter 'a' followed by any digit 
regex_a_or_digit = re.compile("a|\d") # letter 'a' OR any digit 
regex_digit_once_or_more = re.compile("\d+") # any digit one or more times

print(re.findall(regex_digit ,'a123hnd7hjaf'))
print(re.findall(regex_a ,'a123hnd7hjaf'))
print(re.findall(regex_a_digit ,'a123hnd7hjaf'))
print(re.findall(regex_a_or_digit ,'a123hnd7hjaf'))
print(re.findall(regex_digit_once_or_more ,'a123hnd7hjaf'))

['1', '2', '3', '7']
['a', 'a']
['a1']
['a', '1', '2', '3', '7', 'a']
['123', '7']


We can use regular expressions to describe what type of substring we want to get out of a string. In addition to separation by whitespaces, we want to keep words or numbers with a hyphen, an apostrophe or a point inside, and we want to split punctuation marks from the end of words. For these means, you'll need to write a regular expression that matches:

- all alphanumeric strings with hyphen, apostrophe or point inside (i.e. should be able to find "44.44","a-ha","it's")

**OR**
- any non-whitespace character followed between zero and unlimited times by any alphanumeric character (i.e. "hello!.?" should result in "hello", "!", "." and "?".)

HINT1: "\S" - non-whitespace character 

HINT2: "\w" - alphanumeric character

HINT3: "[123]" - one of the characters in the brackets

HINT4: "*" - zero or more times

HINT5: more hints in this [cheat sheet](https://www.rexegg.com/regex-quickstart.html)

In [22]:
# YOUR CODE HERE
# raise NotImplementedError()

# put a compiled regular expression tokenizer into the variable below
# For example:
# regex_tokenizer = re.compile('\w+')
regex_tokenizer = re.compile('\w+[-.\']*\w+|\S') ## FILL IN THE ANSWER

# look if our dummy example is now tokenized properly
print(re.findall(regex_tokenizer, dummy_example))

["It's", 'a', 'dum-dum', 'example', ',', "we'll", 'place', 'it', 'here', 'to', 'prove', 'a', 'point', '.', 'Also', ',', 'look', 'at', 'this', 'number', ':', '300.99', '.']


In [23]:
# checks if punctuation marks are separated from the end of a word
assert_equal(re.findall(regex_tokenizer, "hello!.?"),['hello', '!', '.', '?'])

# checks if punctuation marks are separated from the end of a word and words are whitespace separated
assert_equal(re.findall(regex_tokenizer, "well,  well, well"),['well', ',', 'well', ',', 'well'])

# checks if words and numbers with -'. inside are kept intact
assert_equal(re.findall(regex_tokenizer, "bye-bye, we'll call you at 3.15"),
             ['bye-bye', ',', "we'll", 'call', 'you', 'at', '3.15'])

# checks if the last token in the dummy sentence is correct
assert_equal(re.findall(regex_tokenizer, dummy_example)[-1], '.')

# checks if number of tokens in the dummy sentence is correct
assert_equal(len(re.findall(regex_tokenizer, dummy_example)), 23)


## 2.3 <a class="anchor" id="subtask_2_3"></a>
### Use Treebank tokenizer (1 points)

As you've already noticed, the process of creating a tokenizer is pretty complicated. There are many more things to consider, and a tokenizer should be chosen in accordance with a task. For example, we might also want to capture abbreviations (U.S.A.), percentages (82%) or URLs.

Luckily, there are already several good tokenizers implemented for us. For instance, the NLTK package has several (https://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize). 

Let's tokenize our text using the Treebank tokenizer. It uses regular expressions to tokenize text so that tokens match those used in a popular [Penn Treebank](https://web.archive.org/web/19970614160127/http://www.cis.upenn.edu/~treebank/) dataset. This tokenizer also assumes that the text has already been segmented into sentences. Perform sentence segmentation using NLTK's `sent_tokenize()`. Don't forget to lowercase the tokens after tokenizing. If you lowercase before tokenization, it may have an effect on the segmentation algorithm.

In [24]:
# There were some issues related to SSL certificates when downloading
# nltk library (punkt).
# This is a workaround that disables SSL check.
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import sent_tokenize, TreebankWordTokenizer

# here is how the Treebank tokenizer handles the dummy sentence
print(TreebankWordTokenizer().tokenize(dummy_example))

def tokenize_and_lowercase(file_name):
    """This function tokenizes text files into lowercased tokens with TreebankWordTokenizer
    
    Read a text file into a string,
    tokenize this string into sentences using sent_tokenize(),
    tokenize each sentence into tokens using TreebankWordTokenizer().tokenize(),
    lowercase each token


    Parameters
    ----------
    file_name : str
        a path to the text file
    
    Returns
    -------
    tokens : list of strings
        text as a list of lowercased tokens
    """
    # YOUR CODE HERE
    # raise NotImplementedError()

    with open(file_name, 'r', encoding='utf-8') as file:
        text = file.read()
    tokens = []
    sentences = sent_tokenize(text)
    tokenizer = TreebankWordTokenizer()
    for sentence in sentences:
        words = tokenizer.tokenize(sentence)
        tokens.extend([word.lower() for word in words]) 
    return tokens


tokenized_bug = tokenize_and_lowercase(bug_file_path)

['It', "'s", 'a', 'dum-dum', 'example', ',', 'we', "'ll", 'place', 'it', 'here', 'to', 'prove', 'a', 'point.', 'Also', ',', 'look', 'at', 'this', 'number', ':', '300.99', '.']


In [25]:
# checks if the number of tokens is correct
assert_equal(len(tokenize_and_lowercase(bug_file_path)), 16172)
# checks if all tokens are lowercased
assert_equal(all([x.lower for x in tokenize_and_lowercase(bug_file_path)]), True)
# checks if the first token is correct
assert_equal(tokenize_and_lowercase(bug_file_path)[0], 'the')
# checks if the last 5 tokens are correct
assert_equal(tokenize_and_lowercase(bug_file_path)[-5:], ['who', 'shall', 'tell', '?', '”'])



## TASK 3 <a class="anchor" id="task_3"></a>
## Word frequencies

In this task we will explore the distribution of word frequencies and discuss what influence it can have on different NLP tasks. Here you need to remember the definitions of word token and word type given at the beginning of [task 2](#task_2).

## 3.1 <a class="anchor" id="subtask_3_1"></a>
### Analyse word frequencies (3 points)

You've already recorded the statistics of letters and letter pairs, now you can repurpose those functions to answer the following questions:

1. How many word tokens are there in the text? (0.6 points)
2. How many word types are there in the text? (0.6 points)
3. What are 10 most frequent word tokens? Report them as a list starting with the most frequent one? (0.6 points)
4. What is the fraction of word types (out of all word types) that appeared in the text only 2 times or less? (0.6 points)
5. What is the fraction of word types (out of all word types) that appeared in the text 50 times or more? (0.6 points)

Type your answers in the cell below. You can create an additional cell to do the calculations if needed.

In [26]:
# YOUR CODE HERE
# raise NotImplementedError()

from collections import Counter

tokens = tokenize_and_lowercase(bug_file_path)

n_word_tokens = len(tokens)
print("Number of Word Tokens:", n_word_tokens)

word_types = set(tokens)
n_word_types = len(word_types)
print("Number of Word Types:", n_word_types)

counter = Counter(tokens)
top_ten_words = [word for word, count in counter.most_common(10)]
print("10 Most Frequent Word Tokens:", top_ten_words)

freq_2_or_less = sum(1 for word in word_types if counter[word] <= 2)
frac_2_or_less = freq_2_or_less / n_word_types
print("Fraction of Word Types Appearing 2 Times or Less:", frac_2_or_less)

freq_50_or_more = sum(1 for word in word_types if counter[word] >= 50)
frac_50_or_more = freq_50_or_more / n_word_types
print("Fraction of Word Types Appearing 50 Times or More:", frac_50_or_more)


# put your answer to question 1 as an int into the variable below
# For example:
# n_word_tokens = 1000
n_word_tokens = 16172 ##FILL IN THE ANSWER

# put your answer to question 3 as an int into the variable below
# For example:
# n_word_types = 1000
n_word_types = 2977 ##FILL IN THE ANSWER

# put your answer to question 3 as list of strings into the variable below
# For example:
# top_ten_words = ['hello', 'world']
top_ten_words = [',', 'the', '.', 'of', 'and', 'to', 'a', 'i', 'in', 'it'] ##FILL IN THE ANSWER

# put your answer to question 4 as a float into the variable below
# For example:
# frac_2_or_less = 0.12345
frac_2_or_less = 0.7467248908296943 ##FILL IN THE ANSWER

# put your answer to question 5 as a float into the variable below
# For example:
# frac_50_or_more = 0.12345
frac_50_or_more = 0.014779979845482029 ##FILL IN THE ANSWER

Number of Word Tokens: 16172
Number of Word Types: 2977
10 Most Frequent Word Tokens: [',', 'the', '.', 'of', 'and', 'to', 'a', 'i', 'in', 'it']
Fraction of Word Types Appearing 2 Times or Less: 0.7467248908296943
Fraction of Word Types Appearing 50 Times or More: 0.014779979845482029


In [27]:
# checks if your answer is an int
assert_equal(type(n_word_tokens),int)

In [28]:
# checks if your answer is an int
assert_equal(type(n_word_types),int)

In [29]:
# checks if your answer is of the right formt
assert_equal(type(top_ten_words),list)
assert_equal(type(top_ten_words[0]),str)
assert_equal(len(top_ten_words),10)
# checks if you've got the most frequent token right
assert_equal(top_ten_words[0],',')


In [30]:
# checks if your answer is a float
assert_equal(type(frac_2_or_less),float)

In [31]:
# checks if your answer is a float
assert_equal(type(frac_50_or_more),float)

## 3.2 <a class="anchor" id="subtask_3_2"></a>
### Remove stop words (1 point)
As you can see, the most frequent word types are not specific to the Poe's story, but are pretty much the same across English language.

In information theory, the more likely an event is to occur, the less information it contains. Thus, if an event is not a surprise, it's simply "old news". For some natural language applications, it means that words like *to* and *the* don't tell anything important about a text. They are not helpful in recognising its topic or its author, for instance.

Such frequent uninformative words are called **stop words**, and, in some cases, they can simply be cleaned out from data. There exist prepared lists of such words in English. Let's remove stop words using a list provided by the NLTK package.

In [32]:
from nltk.corpus import stopwords
nltk.download('stopwords', quiet=True)

stop_words_english = stopwords.words('english')

def remove_stop_words(tokenized_text, stop_words):
    """This function removes stop words from lowercased tokenized text
    
    Parameters
    ----------
    tokenized_text : list of strings
        lowercased text tokens 
    stop_words : list of strings
        a list of words to remove
    
    Returns
    -------
    clean_text : list of strings
        list of text tokens with stop words removed
    """
    
    # YOUR CODE HERE
    # raise NotImplementedError()
    clean_text = [word for word in tokenized_text if word not in stop_words]
    
    return clean_text
    
clean_bug = remove_stop_words(tokenized_bug, stop_words_english)

In [33]:
# checks if the number of tokens is right
assert_equal(len(clean_bug), 9309)

# checks if the first token is right
assert_equal(clean_bug[0], 'gold-bug')


## Checklist before submission <a class="anchor" id="checklist"></a>
### 1
To make sure that you didn't forget to import some package or to name some variable, press **Kernel -> Restart** and then **Cell -> Run All**. This way your code will be run exactly in the same order as during the autograding.
### 2
(Click the **Validate** button in the upper menu to check that you haven't missed anything.) **NB** At the moment, the validate button doesn't work on the upper menu. **Instead, you can validate the notebook in the Assignments tab.**
### 3
To submit the notebook, click on the **jupyterhub** logo in the upper left part of the window, choose the **Assignments** tab, and press **submit**. You can submit multiple times, only the last one counts.
### 4
Please fill in the feedback form in the [Assignment](https://mycourses.aalto.fi/mod/questionnaire/view.php?id=689919) section of Mycoures.