Here, we'll work through some of the practicalities of creating and counting ngrams from text.

## First, unigrams


In [1]:
from gutenbergpy.textget import get_text_by_id, strip_headers
def get_clean_book(book_id):
    """Get the cleaned book

    Args:
        book_id (str|int): The book id

    Returns:
        (str): The full book
    """
    raw_book = get_text_by_id(book_id)
    book_byte = strip_headers(raw_book)
    book_clean = book_byte.decode("utf-8")

    return book_clean

In [2]:
moby_dick = get_clean_book(2701)

Next, tokenize

In [3]:
from nltk.tokenize import word_tokenize

moby_words = word_tokenize(moby_dick)

Next, count

In [6]:
from collections import Counter

moby_count = Counter(moby_words)
moby_count.most_common(10)

[(',', 19211),
 ('the', 13717),
 ('.', 7164),
 ('of', 6563),
 ('and', 6009),
 ('to', 4515),
 ('a', 4507),
 (';', 4178),
 ('in', 3915),
 ('that', 2918)]

The *"unigram"* probability of a word:
$$
P(w_i) = \frac{C(w_i)}{\sum_{i=1}^n C(w_i)}
$$

We can get $C(w_i)$ from the counter dictionary

In [7]:
moby_count["whale"]

771

The trickier thing to get is $\sum_{i=1}^n C(w_i)$. One way to do it is with a for loop.

In [11]:
total_freq = 0
for word in moby_count:
    total_freq += moby_count[word]

total_freq

255958

In [12]:
moby_prob = {}
for word in moby_count:
    moby_prob[word] = moby_count[word] / total_freq

moby_prob["whale"]

0.0030122129411856635

## Introducing numpy

`numpy` is a python package that allows you to do numeric computation more easilly.

In [13]:
## if you need to install it:
# ! pip install numpy

import numpy as np

In [14]:
sample_array = np.array([0, 1, 3])

In [19]:
sample_array.sum()

4

In [20]:
sample_array/sample_array.sum()

array([0.  , 0.25, 0.75])

In [22]:
[
    sample_array.min(), 
    sample_array.max()
]

[0, 3]

Relating words, counts and probabilities

In [24]:
word_list = [w for w in moby_count]
count_array = np.array([moby_count[w] for w in word_list])

prob_array = count_array / count_array.sum()

:::{layout-ncol=2}

::::{.column}

$$
\frac{C(w_i)}{\sum_{i=1}^n w_i}
$$

::::

::::{.column}

```python
count_array / count_array.sum()
```

::::

:::