# Basic recipes for working with n-grams

## Getting word completions with `.startswith`

Given a list of words, it is very easy to find possible completions using the `.startswith` method.
A condition like `some_string.startswith(other_string)` evalutes to `True` iff `some_string` starts with `other_string`.

In [None]:
if "hello".startswith("hell"):
    print("You can't spell hello without hell.")

The set of all possible completions is best built with a set comprehension (see the Python recap for details on comprehensions).

In [None]:
from nltk.corpus import brown


def completions(word, corpus):
    """Compute set of completions for word according to corpus."""
    return {comp for comp in corpus
            if comp.startswith(word)}


completions("test", brown.words())

## Converting counts to frequencies

For many applications we do not want absolute counts but rather frequencies.
The frequency of a word is obtained by dividing its counts in a corpus by the total number of words in the same corpus.

In [None]:
from nltk.corpus import brown
from collections import Counter
from pprint import pprint

# compute counts for all words in the Brown corpus
counts = Counter(brown.words())
# let's look at the 10 most common words and their counts
pprint(counts.most_common(10))

# now we convert to frequencies;
# the total is the sum of all the values (i.e. counts) in the Counter
total = sum(counts.values())
# we replace each word count by its frequency
for word in counts:
    counts[word] /= total  # augmented assignment divides each count by total

# and here's the results after converting to frequencies
pprint(counts.most_common(10))

## Sorting with `key`

When presenting word completions, we do not want them in alphabetical order, but rather in decreasing order of frequency.
This is a bit tricky.
The `sorted` function by default orders lists in ascending order:

- Lists of numbers are sorted from smallest to largest.

In [None]:
sorted([10, 4, 3, 8, 7, 5])

- Lists of strings are ordered alphabetically, but with all uppercase letters before all lowercase letters.

In [None]:
sorted(["banana", "John", "Bill", "J-pop", "apple"])

The option `reverse` can be used to flip the order from asending to descending.

In [None]:
sorted([10, 4, 3, 8, 7, 5], reverse=True)

In [None]:
sorted(["banana", "John", "Bill", "J-pop", "apple"], reverse=True)

When a completely different ordering pattern is desired, it has to be specified by a function that is passed in through the `key` parameter.
Here is a silly example:

In [None]:
def new_sort(word):
    """Deprioritize banana and J-pop."""
    if word == "banana":
        return 1000
    elif word == "J-pop":
        return 2000
    else:
        return 250
    
sorted(["banana", "John", "Bill", "J-pop", "apple"], key=new_sort)

The function `new_sort` maps `"banana"` to `1000`, `"J-pop"` to `100`, and everything else to `250`.
When we tell `sorted` to use `new_sort` as the key for sorting, it no longer tries to establish an alphabetical order.
Instead, it looks at the numbers returned by `new_sort` and orders the words as if they had been replaced by these numbers.
Intuitively, `["banana", "John", "Bill", "J-pop", "apple"]` becomes `[1000, 250, 250, 2000, 250]`, which is sorted as `[250, 250, 250, 1000, 2000]` and then is mapped back to `["John", "Bill", "apple", "banana", "J-pop"].
For elements that are mapped to the same number, the original order is preserved.

In [None]:
# we've switched John and apple, and they're also switched after sorting
sorted(["banana", "apple", "Bill", "J-pop", "John"], key=new_sort)

So if we want to order completions by frequency, we can write a function that maps each word to its frequency using our frequency counter.
This function is then used as a key for sorting.
Fortunately, such a function already exists.
Instead of something like `some_counter[some_item]`, one can also use `some_counter.get(some_item)`.

In [None]:
from nltk.corpus import brown
from collections import Counter

counts = Counter(brown.words())
print(counts["test"])
print(counts.get("test"))

So we can use the `get` method as a key.

In [None]:
# a toy example
from nltk.corpus import brown
from collections import Counter

counts = Counter(brown.words())
test_list = ["test", "wicked", "John", "Sue", "polar", "the", "of"]
sorted(test_list, key=counts.get)  # sort in increasing order of frequency

And here's a full example for sorting word completions

In [None]:
from nltk.corpus import brown
from collections import Counter


def completions(word, counter):
    """Compute list of completions for word according to counter"""
    return [comp for comp in counter
            if comp.startswith(word)]


def sorted_completions(word, counter):
    """Sort completions in descending order of frequency"""
    return sorted(completions(word, counter), key=counter.get, reverse=True)


counts = Counter(brown.words())
total = sum(counts.values())
for c in counts:
    counts[c] /= total

sorted_completions("test", counts)

## Converting a text to bigrams

The lecture notes provide the following recipe for obtaining a list of bigrams from a tokenized text.

In [None]:
def bigrams(text):
    """Convert a text (list of strings) to bigrams."""
    return [text[n:n+2] for n in range(len(text) - 1)]

bigrams(["A", "simple", "example", "for", "our", "bigram", "conversion"])

In order to understand how this works, you need to know the following:

1. slices (covered in Python recap), and
1. `len` (covered in Python recap), and
1. list comprehensions (covered in Python recap), and
1. `range`.

The range function is very intuitive: it can be used in a `for`-loop to iterate over numbers.

In [None]:
for n in range(4):
    print(n)

So `range(n)` allows us to iterate over `0`, `1`, `2`, ..., `n-1`.
Careful, though, `range(n)` does **not** include `n`.
For instance, `range(4)` above only goes from `0` to `3`, it excludes `4`.

What, then, does the code above do?
It constructs a list that consists of slices of `text`.
Each slice spans from position `n` to `n+2`.
The lowest value of `n` is 0, the beginning of the text.
The largest value is a bit trickier:

The largest value must be chosen such that `[n:n+2]` is the slice of the last two words in the text.
To achieve this, `n` must be the index of the last but one word.
Because of how Python lists work, this means that `n` must be `len(text) - 2`.
But remember that the highest value of `range(len(text))` is `len(text) - 1`.
That's one too high, so we have to subtract one, giving us `range(len(text) - 1)`.

Observe how the code misbehaves if we change the range:

In [None]:
def bigrams(text):
    """Convert a text (list of strings) to bigrams."""
    return [text[n:n+2] for n in range(len(text))]  # range too big

bigrams(["A", "simple", "example", "for", "our", "bigram", "conversion"])

In [None]:
def bigrams(text):
    """Convert a text (list of strings) to bigrams."""
    return [text[n:n+2] for n in range(len(text) - 2)]  # range too small

bigrams(["A", "simple", "example", "for", "our", "bigram", "conversion"])

Make sure you understand the position math in these examples.
In particular:

- Why does the slice go from `n` to `n+2`, rather than `n+1` or `n+3`?
- Why do we subtract 1 from the length before computing the range?

If you are unsure, revisit the material on list indices.
From here on out, it is crucial that you can easily do index calculations of this kind.