# Implementing prefix trees

Alright, we are finally in a position to implement prefix trees, and we will contrast them with an approach based on tuples.

## Storing n-gram frequencies in a Counter

With our modified definition of the `bigrams` function, a tokenized text is converted to a list of tuples, where each tuple is a bigram of the text.
Since tuples are immutable and hence hashable, they can be used as keys.
Hence we can feed this list of bigrams into a Counter without getting complaints about unhashable keys.
We then convert the counts to frequencies, giving us a database of bigram frequencies.

In [0]:
from nltk.corpus import brown
from collections import Counter


def bigrams(text):
    """Convert a text (list of strings) to bigram-tuples."""
    return [tuple(text[n:n+2]) for n in range(len(text) - 1)]


# compute counts
bigram_counts = Counter(bigrams(brown.words()))

# convert to frequencies
total = sum(bigram_counts.values())
for word in bigram_counts:
    bigram_counts[word] /= total

We can now easily determine the most common bigrams.

In [0]:
bigram_counts.most_common(10)

As you can see, the Brown corpus includes punctuation and does not normalize capitalization.
Depending on our specific domain of application, we might want to normalize some of this, but we will ignore these issues here.

We can also use our function for bigram predictions together with `counts`.
We just have to adjust for the fact that the context is now checked against a slice of a tuple, rather than a list.
Hence we first convert the context to a tuple.

In [0]:
def bigram_completions(word, context, counts):
    """Suugest word completion based on bigram frequencies."""
    # set of all compatible bigrams
    comps = [comp for comp in counts
             if comp[:-1] == tuple(context) and
                comp[-1].startswith(word)]
    # sort the bigram completions
    ordered_ngrams = sorted(comps,
                            key=counts.get,
                            reverse=True)
    # only keep last word of each bigram
    return [ngram[-1] for ngram in ordered_ngrams]


bigram_completions("test", ["a"], bigram_counts)

We can also look at the length of the counter to determine how many bigrams were generated from the Brown corpus.

In [0]:
len(bigram_counts)

Well, that's quite a lot, but it's much less than one might expect.
Remember, just 10,000 distinct words already allow for 10 million distinct bigrams.
This shows quite nicely that English puts tight restrictions on the order of words.
It's anything but "anything goes".

At any rate, let's now look at how the same counter can be represented as a prefix tree instead.

## Prefix trees as nested dictionaries

The idea behind a prefix tree is share information across items.
Suppose we have three trigrams `("the", "old", "man")`, `("the", "old", "woman")`, and `("the", "ugly", "man")`, with the respective frequencies `0.1`, `0.2`, and `0.3`.
With a prefix tree, we can conflate them into the representation below:

```
the
 |
 |--> old
 |     |
 |     |--> man: 0.1
 |     |
 |     |--> woman: 0.2
 |
 |--> ugly
       |
       |--> man: 0.3
```

We can replicate this structure with nested dictionaries.

In [0]:
trigram_tree = {"the": {"old": {"man": {"_freq": 0.1},
                                "woman": {"_freq": 0.2}},
                        "ugly": {"man": {"_freq": 0.3}}}}

print("Whole tree:\n", trigram_tree)
print()
print("Trigrams starting with \"the\":\n",
      trigram_tree.get("the"))
print()
print("Trigrams starting with \"the old\":\n",
      trigram_tree.get("the").get("old"))
print()
print("Subtree for \"the old woman\":\n",
      trigram_tree.get("the").get("old").get("woman"))
print()
print("Frequency for \"the old woman\":\n",
      trigram_tree.get("the").get("old").get("woman").get("freq"))

Note that the key for frequency starts with an underscore.
This is a trick to make sure that it won't be confused with some word `freq` that might occur in the corpus.

Of course we do not want to build a prefix tree like this by hand, that would mean a lot of tedious and error-prone work.
Instead, we will build it from an existing counter of n-gram frequencies.

In [0]:
def ngramcounter_to_prefixtree(counter):
    """Convert counter with n-gram frequencies to prefix tree."""
    # initialize prefix tree as empty dictionary
    tree = {}
    # iterate over key-value pairs in counter
    for ngram, freq in counter.items(): 
        # start at root of prefix tree
        current_subtree = tree
        # then iterate over all words in ngram
        for word in ngram:
            if current_subtree.get(word):
                current_subtree = current_subtree[word]
            else:
                current_subtree[word] = {}
                current_subtree = current_subtree[word]
        # at the end, add frequency to current_subtree
        current_subtree["_freq"] = freq
    return tree


# make sure you've run the first cell in this notebook,
# otherwise bigram_counts won't be defined
bigram_tree = ngramcounter_to_prefixtree(bigram_counts)

print(bigram_tree.get("a").get("test").get("freq"))

The code above might be a little confusing to you.
It builds the tree in a top-down fashion, constantly resetting the value for `current_subtree` to a lower node while it moves along the n-gram.
If the node does not exist, it creates a new subtree for it.
The code below does the same thing, but with additional `print`-statements so you can see what it is going on.
We also run it over a toy example, with the Brown bigram counts the output would take forever to print.

In [0]:
def ngramcounter_to_prefixtree_with_print(counter):
    """Convert counter with n-gram frequencies to prefix tree."""
    # initialize prefix tree as empty dictionary
    tree = {}
    print("Starting up... tree is set to", tree)
    # iterate over key-value pairs in counter
    for ngram, freq in counter.items(): 
        # start at root of prefix tree
        print("\n==========================\n")
        print(f"Working on \"{ngram}\" with frequency {freq}")
        current_subtree = tree
        print("Setting current subtree to tree:\n", current_subtree)
        # then iterate over all words in ngram
        for word in ngram:
            print("---")
            print(f"Working on \"{word}\" with current subtree\n", current_subtree)
            if current_subtree.get(word):
                current_subtree = current_subtree[word]
                print(f"Found subtree for \"{word}\"")
            else:
                print(f"No subtree found for \"{word}\"")
                print(f"Adding empty subtree under \"{word}\" in current subtree")
                current_subtree[word] = {}
                print("Subtree has been expanded to\n", current_subtree)
                current_subtree = current_subtree[word]
            print("Reference of current_subtree variable is now\n", current_subtree)
            print(f"under \"{word}\" in prefix tree:\n", tree)
        # at the end, add frequency to current_subtree
        print("Reached end of n-gram, adding frequency")
        current_subtree["freq"] = freq
        print(f"Prefix tree updated with frequency for \"{ngram}\"")
        print(tree)
    return tree
        

example_counts = {("the", "old", "man"): 0.1,
                  ("the", "old", "woman"): 0.2,
                  ("the", "ugly", "man"): 0.3}
ngramcounter_to_prefixtree_with_print(example_counts)

Alright, we have a function that converts an n-gram counter to an equivalent prefix tree.
Note that the prefix tree conversion works for any counter, as long as its keys can be iterated over.
For instance, we could just as well use the function above to convert a Counter with words counts to a prefix tree.
In this case, each character of the word would be the root of another subtree.

In [0]:
from nltk.corpus import words
from pprint import pprint

wordlist = words.words()
counts = Counter(wordlist)
tree = ngramcounter_to_prefixtree(counts)

pprint(tree.get("w").get("o").get("r").get("d").get("s"))

The output above tells us that the word list contains at least the following:

- wordsman (count: 1)
- wordsmanship (count: 1)
- wordsmith (count: 1)
- wordspite (count: 1)
- wordster (count: 1)

Interestingly, words is apparently not in the list.
We can check independently that this is indeed the case (which tells us that plurals are not stored separately in nltk's wordlist, apparently).

In [0]:
from nltk.corpus import words

"words" in words.words()

Let's compare that to the output if we use the Brown corpus as a basis instead.

In [0]:
from nltk.corpus import brown
from pprint import pprint

wordlist = brown.words()
counts = Counter(wordlist)
tree = ngramcounter_to_prefixtree(counts)

pprint(tree.get("w").get("o").get("r").get("d").get("s"))

Apparently, the Brown corpus contains 269 instances of *words*, but not a single instance of the completions in the word list.
Let's check that too.

In [0]:
from nltk.corpus import brown

wordlist = set(brown.words())
print("words" in wordlist)
for suffix in ["man", "manship", "mith", "pite", "ter"]:
    print("words" + suffix in wordlist)

Alright, it looks like our prefix tree conversion does indeed work as desired.
We have successfully created a new data structure for ourselves, one that is implemented in terms of nested dictionaries.
The task is far from done, though.
We still need at least two helper functions that make the data structure easy to work with:

- an update mechanism for adding new elements to the tree
- a search function for easily retrieving items (the chains of `.get` calls isn't exactly nice)

But this is left as an exercise to the reader (i.e. you).

## Bullet-point summary

- Prefix trees can be implemented as nested dictionaries.
  There are more efficient options, but those are a lot more difficult to handle.
  If you want a production-level implementation, check out the `pygtrie` package.
  
- The conversion builds the prefix tree in a top-down fashion, adding new subtrees (i.e. dictionaries) as necessary.
  Make sure you understand 100% of the code for this.