Let's get a little bit practical with 

- getting text
- getting a tokenizer
- using the tokenizer


For this lesson, we're going to use `gutenbergpy` and `nltk`, but if you try to import them right now, like they were in the course notes, you're going to get an error.

::: codebox

```
---------------------------------------------------------
ModuleNotFoundError      Traceback (most recent call last)
Cell In[2], line 1
----> 1 import nltk

ModuleNotFoundError: No module named 'nltk'
```

:::

## Installing `gutenbergpy`

We'll need to install these packages. We'll start with `gutenbergpy`.

In [None]:
! pip install gutenbergpy

Now, we can import the functions to get Project Gutenberg books. The url for Moby Dick on Project Gutenberg is [https://www.gutenberg.org/ebooks/2701](https://www.gutenberg.org/ebooks/2701). That last part of the url is the ID of the book, which we can pass to `get_text_by_id()` to download the book.

In [1]:
from gutenbergpy.textget import get_text_by_id, strip_headers

book_id = 2701

raw_book = get_text_by_id(book_id)

`raw_book` contains the book with all of its legal headers and footers. we can remove the headers and footers with `strip_headers()`

In [2]:
book_byte = strip_headers(raw_book)

One last hitch here has to do with "character encoding". We need to "decode" it.

In [3]:
book_clean = book_byte.decode("utf-8")

Let's wrap that up into one function we can re-run on new IDs

In [4]:
def get_clean_book(book_id):
    """Get the cleaned book

    Args:
        book_id (str|int): The book id

    Returns:
        (str): The full book
    """
    raw_book = get_text_by_id(book_id)
    book_byte = strip_headers(raw_book)
    book_clean = book_byte.decode("utf-8")

    return book_clean

Go ahead and point `get_clean_book()` at another book id.

## NLTK tokenization

Let's tokenize one of our books with `nltk.tokenize.word_tokenize()`.

### Steps

1. Install `nltk`.
2. Try tokenizing your book.

It might not go right at first. You can double check what to do here in [the course notes](https://lin511-2024.github.io/notes/meetings/03_tokenization.html#tokenizers--part-1-).

## Lets try `spacy`

To work with `spacy`, we need to:

1. Install `spacy`
2. Install one of the spacy models.

### The steps

1. Go to the [spacy website](https://spacy.io/)
2. Can you find the code to successfully install it and its language model?

In [5]:
## Installation

Let's tokenize a book.

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [7]:
import re
first_para = re.findall(
    r"Call me Ishmael.*?\n\n", 
    book_clean, 
    re.DOTALL)[0]

In [8]:
para_doc = nlp(first_para)

The output of `nlp` is actually a complex object enriched with a lot of information that we can access a few different ways.

In [9]:
para_doc

Call me Ishmael. Some years ago—never mind how long precisely—having
little or no money in my purse, and nothing particular to interest me
on shore, I thought I would sail about a little and see the watery part
of the world. It is a way I have of driving off the spleen and
regulating the circulation. Whenever I find myself growing grim about
the mouth; whenever it is a damp, drizzly November in my soul; whenever
I find myself involuntarily pausing before coffin warehouses, and
bringing up the rear of every funeral I meet; and especially whenever
my hypos get such an upper hand of me, that it requires a strong moral
principle to prevent me from deliberately stepping into the street, and
methodically knocking people’s hats off—then, I account it high time to
get to sea as soon as I can. This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in this. If they
but knew it, almost al

To get any particular token out, you can do ordinary indexing.

In [10]:
para_doc[2]

Ishmael

To get the actual *text* of a token, we need to get its `.text` attribute.

In [11]:
para_doc[2].text

'Ishmael'

There's lots of great stuff we can get out, like each sentence.

In [12]:
list(para_doc.sents)[0]

Call me Ishmael.

Or the parts of speech of each token.

In [13]:
first_sent = list(para_doc.sents)[0]
[x.pos_ for x in first_sent]

['VERB', 'PRON', 'PROPN', 'PUNCT']

In [14]:
[x.morph for x in first_sent]

[VerbForm=Inf,
 Case=Acc|Number=Sing|Person=1|PronType=Prs,
 Number=Sing,
 PunctType=Peri]

In [15]:
first_sent[1].morph.to_dict()

{'Case': 'Acc', 'Number': 'Sing', 'Person': '1', 'PronType': 'Prs'}