# Cleaning Data for LLMs

It is unreasonable to expect taking raw text from a variety of sources and expect them to be ready for large language models. There are a series of steps to get the data ready, from cleaning to vectorizing it. We will focus on cleaning the text data first, covering NLTK and spAcy. 

## The Legend of Sleepy Hollow

For this example, we are going to download the American short story *The Legend of Sleepy Hollow* by Washington Irving. A plain text format [can be found easily at Project Gutenberg](https://www.gutenberg.org/ebooks/41) but we have it downloaded with this notebook for convenience. Let's load the file contents as a string into the `text` variable.

In [2]:
import urllib.request

urllib.request.urlretrieve(
    r"https://raw.githubusercontent.com/thomasnield/machine-learning-demo-data/master/llm/legend_of_sleepy_hollow.txt", 
    "legend_of_sleepy_hollow.txt"
)

filename = 'legend_of_sleepy_hollow.txt' 
file = open(filename, encoding="utf-8")
text = file.read()
file.close()
text

'\ufeffThe Project Gutenberg eBook of The Legend of Sleepy Hollow\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Legend of Sleepy Hollow\n\n\nAuthor: Washington Irving\n\nRelease date: June 27, 2008 [eBook #41]\n                Most recently updated: June 27, 2022\n\nLanguage: English\n\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW ***\n\n\n\n\nThe Legend of Sleepy Hollow\n\nby Washington Irving\n\n\n\n\nFOUND AMONG THE PAPERS OF THE LATE DIEDRICH KNICKERBOCKER.\n\n\n        A pleasing land of drowsy head it was,\n          Of dreams that wave

Let's then display the contents. 

In [4]:
# display the text 
text

'\ufeffThe Project Gutenberg eBook of The Legend of Sleepy Hollow\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Legend of Sleepy Hollow\n\n\nAuthor: Washington Irving\n\nRelease date: June 27, 2008 [eBook #41]\n                Most recently updated: June 27, 2022\n\nLanguage: English\n\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW ***\n\n\n\n\nThe Legend of Sleepy Hollow\n\nby Washington Irving\n\n\n\n\nFOUND AMONG THE PAPERS OF THE LATE DIEDRICH KNICKERBOCKER.\n\n\n        A pleasing land of drowsy head it was,\n          Of dreams that wave

Here we can make some observations about our data. 

* Thankfully this is pretty clean text and we do not have to clean up any HTML, PDF markup, or other boilerplate here.
* There is some boilerplate for licensing and other metadata which we may want to remove.
* This book is in English and was not translated from another language.
* We do not anticipate spelling or grammar mistakes.
* There are some interesting hyphenations and historical spellings like "red-tipt" and "yellow-tipt."
* We also have frequent uses of newline `\n` characters and these are artificially injected at every 70 characters.
* There do not seem to be numbers, or at least enough of them, that we have to handle.
* There are names in this document, like Yost Van Houten.

There is a lot more going on here but this is simple enough to get us started. 

If we open up the text file directly in a text editor we will see there are license boilerplate before line 27 and after line 1159. It might be easier to use the keywords that end and start these boilerplate sections respectively. We can use some regular expression patterns for this. 

In [5]:
import re 

text = re.sub(r"^(.|\n)+START OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW \*{3}", '', text)
text = re.sub(r"\*{3} END OF THE PROJECT GUTENBERG EBOOK THE LEGEND OF SLEEPY HOLLOW (.|\n)+", '', text)
text = text.strip()

text

'The Legend of Sleepy Hollow\n\nby Washington Irving\n\n\n\n\nFOUND AMONG THE PAPERS OF THE LATE DIEDRICH KNICKERBOCKER.\n\n\n        A pleasing land of drowsy head it was,\n          Of dreams that wave before the half-shut eye;\n        And of gay castles in the clouds that pass,\n          Forever flushing round a summer sky.\n                                         CASTLE OF INDOLENCE.\n\n\nIn the bosom of one of those spacious coves which indent the eastern\nshore of the Hudson, at that broad expansion of the river denominated\nby the ancient Dutch navigators the Tappan Zee, and where they always\nprudently shortened sail and implored the protection of St. Nicholas\nwhen they crossed, there lies a small market town or rural port, which\nby some is called Greensburgh, but which is more generally and properly\nknown by the name of Tarry Town. This name was given, we are told, in\nformer days, by the good housewives of the adjacent country, from the\ninveterate propensity of their h

For this example, we are going to download the American short story *The Legend of Sleepy Hollow* by Washington Irving. A plain text format [can be found easily at Project Gutenberg](https://www.gutenberg.org/ebooks/41) but we have it downloaded with this notebook for convenience. Let's load the file contents as a string into the `text` variable.

## Manual Tokenization

Understandably, if we want to meaningfully prepare this data we will need to split up the words. We will learn how to do this from scratch in Python to understand the process a little bit before we bring in libraries to help us. 

Let's remove the boilerplate at the beginning and end of the document. 

In [6]:
text.split()

['The',
 'Legend',
 'of',
 'Sleepy',
 'Hollow',
 'by',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'THE',
 'PAPERS',
 'OF',
 'THE',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER.',
 'A',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was,',
 'Of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half-shut',
 'eye;',
 'And',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass,',
 'Forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky.',
 'CASTLE',
 'OF',
 'INDOLENCE.',
 'In',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'Hudson,',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'Dutch',
 'navigators',
 'the',
 'Tappan',
 'Zee,',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'St.',
 'Nicholas',
 'when',
 'they',
 'crossed,',
 'th

We can again use [regular expressions](https://www.oreilly.com/content/an-introduction-to-regular-expressions/) to match whitespace or more elaborate patterns. In this case, hyphenated words are split into separate tokens. 

In [10]:
import re 

words = re.split(r'\W+', text)

words

['The',
 'Legend',
 'of',
 'Sleepy',
 'Hollow',
 'by',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'THE',
 'PAPERS',
 'OF',
 'THE',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER',
 'A',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was',
 'Of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half',
 'shut',
 'eye',
 'And',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass',
 'Forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky',
 'CASTLE',
 'OF',
 'INDOLENCE',
 'In',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'Hudson',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'Dutch',
 'navigators',
 'the',
 'Tappan',
 'Zee',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'St',
 'Nicholas',
 'when',
 'they',
 'crossed',
 'there',


Now let's say we want to remove punctuation. We can get a convenient set of punctuation characters from Python's standard library. 

In [11]:
import re 
import string 

print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


We can then construct a character set using a regular expression by using these punctuation characters, and remove said punctuation characters. 

In [14]:
regex_punct = re.compile(f'[{re.escape(string.punctuation)}]')
stripped = [regex_punct.sub('', w) for w in words]
stripped

['The',
 'Legend',
 'of',
 'Sleepy',
 'Hollow',
 'by',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'THE',
 'PAPERS',
 'OF',
 'THE',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER',
 'A',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was',
 'Of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half',
 'shut',
 'eye',
 'And',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass',
 'Forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky',
 'CASTLE',
 'OF',
 'INDOLENCE',
 'In',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'Hudson',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'Dutch',
 'navigators',
 'the',
 'Tappan',
 'Zee',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'St',
 'Nicholas',
 'when',
 'they',
 'crossed',
 'there',


We probably should concern ourselves with making the casing consistent, as in uppercase or lowercase and making sure one convention is stuck to. 

In [16]:
lowercased = [w.lower() for w in stripped]
lowercased

['the',
 'legend',
 'of',
 'sleepy',
 'hollow',
 'by',
 'washington',
 'irving',
 'found',
 'among',
 'the',
 'papers',
 'of',
 'the',
 'late',
 'diedrich',
 'knickerbocker',
 'a',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was',
 'of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half',
 'shut',
 'eye',
 'and',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass',
 'forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky',
 'castle',
 'of',
 'indolence',
 'in',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'hudson',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'dutch',
 'navigators',
 'the',
 'tappan',
 'zee',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'st',
 'nicholas',
 'when',
 'they',
 'crossed',
 'there',


This was a a simple example, using simple clean text with some simple cleaning operations. This is obviously an ideal format to work with text data but it is not always this clean. Sometimes you may have PDF's that have text as images, or social media posts filled with typos and user grammar errors. You may even find domain-specific vocabulary you will not find in a dictionary, or documents with lots of numeric data that really should not be treated as words. You should always strive for simplicity first, and escalate the complexity of the data and its cleaning accordingly. 

## Using NLTK

The Natural Language Toolkit (NLTK) is a Python library for processing and working with text. We can use it to clean text and get it read for machine learning applications. 

You will need to install NLTK.

You will also need to download all the data for the library.


In [17]:
conda install nltk

Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/anaconda-panel-2023.05-py310

  added / updated specs:
    - nltk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2024.9.24  |       h06a4308_0         130 KB
    certifi-2024.8.30          |  py311h06a4308_0         163 KB
    nltk-3.9.1                 |  py311h06a4308_0         2.8 MB
    ------------------------------------------------------------
                                           Total:         3.1 MB

The following packages will be UPDATED:

  ca-certificates                     2023.08.22-h06a4308_0 --> 2024.9.24-h06a4308_0 
  certifi                         2023.7.22-py311h06a4308_0 --> 2024.8.30-py311h06a4308_0 
  nltk                                3.8.1-py311h06a4308_0 --> 3.

In [18]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/f31fc7f8-bc3d-40b7-
[nltk_data]    |     a34e-6280e4b16a68/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /home/f31fc7f8-bc3d-
[nltk_data]    |     40b7-a34e-6280e4b16a68/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to /ho
[nltk_data]    |     me/f31fc7f8-bc3d-40b7-a34e-
[nltk_data]    |     6280e4b16a68/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /home/f31fc7f8-bc3d-40b7-a34e-
[nltk_data]    |     6280e4b16a68/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to 
[nltk_data]    |     /home

True

### Breaking Up Words

We can split up words in NLTK using the `word_tokenize()` function. It will split on white space and punctuation including commas, periods, and contractions like `what's -> what 's`. 

In [19]:
from nltk.tokenize import word_tokenize
words = word_tokenize(text)
words

['The',
 'Legend',
 'of',
 'Sleepy',
 'Hollow',
 'by',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'THE',
 'PAPERS',
 'OF',
 'THE',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER',
 '.',
 'A',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was',
 ',',
 'Of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'half-shut',
 'eye',
 ';',
 'And',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass',
 ',',
 'Forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky',
 '.',
 'CASTLE',
 'OF',
 'INDOLENCE',
 '.',
 'In',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'Hudson',
 ',',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'Dutch',
 'navigators',
 'the',
 'Tappan',
 'Zee',
 ',',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'St.',
 'Nichol

You will see here that the tokens above have punctuation marks as separate tokens. We can filter those out if we like using the `is_alpha()` function. 

In [20]:
no_puncts = [w for w in words if w.isalpha()]
no_puncts

['The',
 'Legend',
 'of',
 'Sleepy',
 'Hollow',
 'by',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'THE',
 'PAPERS',
 'OF',
 'THE',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER',
 'A',
 'pleasing',
 'land',
 'of',
 'drowsy',
 'head',
 'it',
 'was',
 'Of',
 'dreams',
 'that',
 'wave',
 'before',
 'the',
 'eye',
 'And',
 'of',
 'gay',
 'castles',
 'in',
 'the',
 'clouds',
 'that',
 'pass',
 'Forever',
 'flushing',
 'round',
 'a',
 'summer',
 'sky',
 'CASTLE',
 'OF',
 'INDOLENCE',
 'In',
 'the',
 'bosom',
 'of',
 'one',
 'of',
 'those',
 'spacious',
 'coves',
 'which',
 'indent',
 'the',
 'eastern',
 'shore',
 'of',
 'the',
 'Hudson',
 'at',
 'that',
 'broad',
 'expansion',
 'of',
 'the',
 'river',
 'denominated',
 'by',
 'the',
 'ancient',
 'Dutch',
 'navigators',
 'the',
 'Tappan',
 'Zee',
 'and',
 'where',
 'they',
 'always',
 'prudently',
 'shortened',
 'sail',
 'and',
 'implored',
 'the',
 'protection',
 'of',
 'Nicholas',
 'when',
 'they',
 'crossed',
 'there',
 'lies',
 'a',
 'small',


### Breaking Up Sentences

Another way we can process this text is to break it up into sentences rather than words. We can bring in the `sent_tokenize()` function from NLTK to achieve this. We can then grab the 25th sentence in the story. 

In [21]:
from nltk import sent_tokenize

sentences = sent_tokenize(text)
print(sentences[25])

They are like those little nooks of still
water, which border a rapid stream, where we may see the straw and
bubble riding quietly at anchor, or slowly revolving in their mimic
harbor, undisturbed by the rush of the passing current.


### Stop Words

Another task you might consider doing is removing **stop words**, which are words that bear little meaning like *the* and *is*. You look at stopwords available for English in NLTK. 

In [22]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

{'other', 'it', 'mightn', "that'll", 'before', "couldn't", 'weren', 'its', "won't", 'while', 'most', 'needn', 'through', 'after', 'o', 'of', 'a', "mightn't", 'was', 'now', 'haven', 'to', 'only', 'doesn', 'into', "you're", 'aren', 'isn', 'hers', 'ours', 'for', 'too', 'until', 'her', "you'll", 'myself', 'herself', 'have', 'in', 'does', 'some', "don't", 'didn', "it's", 'down', 'being', 'ain', "isn't", 'above', 'off', 'because', 'if', 'himself', 'the', 'up', "aren't", 'nor', 'at', 'yourself', 'than', 'we', 'his', 'am', 'ourselves', 'below', "wasn't", "should've", 'who', "shouldn't", 'won', 'yourselves', 'd', 'he', 'themselves', 've', 'against', 'having', 'very', 'or', "hasn't", "you've", 'were', "weren't", 'with', 't', "hadn't", 'me', 'hasn', 'no', 'their', 'she', 'your', 'during', 'where', "didn't", 'should', 'shouldn', 'what', 'doing', 'can', 'this', 'both', 'you', 'just', 'further', 'about', 'my', "wouldn't", 'mustn', 'which', 'ma', 'i', 'between', 'wasn', 'will', "doesn't", 'on', 'all'

We can take these stop words, package them into a set, and remove them from our text. Note because the stop words are in lower case, we should compare each word in lower case as well. 

In [23]:
no_stop_words = [w for w in no_puncts if not w.lower() in stop_words]
no_stop_words

['Legend',
 'Sleepy',
 'Hollow',
 'Washington',
 'Irving',
 'FOUND',
 'AMONG',
 'PAPERS',
 'LATE',
 'DIEDRICH',
 'KNICKERBOCKER',
 'pleasing',
 'land',
 'drowsy',
 'head',
 'dreams',
 'wave',
 'eye',
 'gay',
 'castles',
 'clouds',
 'pass',
 'Forever',
 'flushing',
 'round',
 'summer',
 'sky',
 'CASTLE',
 'INDOLENCE',
 'bosom',
 'one',
 'spacious',
 'coves',
 'indent',
 'eastern',
 'shore',
 'Hudson',
 'broad',
 'expansion',
 'river',
 'denominated',
 'ancient',
 'Dutch',
 'navigators',
 'Tappan',
 'Zee',
 'always',
 'prudently',
 'shortened',
 'sail',
 'implored',
 'protection',
 'Nicholas',
 'crossed',
 'lies',
 'small',
 'market',
 'town',
 'rural',
 'port',
 'called',
 'Greensburgh',
 'generally',
 'properly',
 'known',
 'name',
 'Tarry',
 'Town',
 'name',
 'given',
 'told',
 'former',
 'days',
 'good',
 'housewives',
 'adjacent',
 'country',
 'inveterate',
 'propensity',
 'husbands',
 'linger',
 'village',
 'tavern',
 'market',
 'days',
 'may',
 'vouch',
 'fact',
 'merely',
 'adver

### Stemming 

There might be times you want to reduce each word to its root or base. The words *fighter* and *fighting* stem from *fight*. This can help reduce the vocabulary and find broader tones or sentiments in the document. The most popular stemming algorithm is the Porter Stemming algorithm which NLTK has available. 

In [24]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

stemmed = [porter.stem(word) for word in no_stop_words]
stemmed

['legend',
 'sleepi',
 'hollow',
 'washington',
 'irv',
 'found',
 'among',
 'paper',
 'late',
 'diedrich',
 'knickerbock',
 'pleas',
 'land',
 'drowsi',
 'head',
 'dream',
 'wave',
 'eye',
 'gay',
 'castl',
 'cloud',
 'pass',
 'forev',
 'flush',
 'round',
 'summer',
 'sky',
 'castl',
 'indol',
 'bosom',
 'one',
 'spaciou',
 'cove',
 'indent',
 'eastern',
 'shore',
 'hudson',
 'broad',
 'expans',
 'river',
 'denomin',
 'ancient',
 'dutch',
 'navig',
 'tappan',
 'zee',
 'alway',
 'prudent',
 'shorten',
 'sail',
 'implor',
 'protect',
 'nichola',
 'cross',
 'lie',
 'small',
 'market',
 'town',
 'rural',
 'port',
 'call',
 'greensburgh',
 'gener',
 'properli',
 'known',
 'name',
 'tarri',
 'town',
 'name',
 'given',
 'told',
 'former',
 'day',
 'good',
 'housew',
 'adjac',
 'countri',
 'inveter',
 'propens',
 'husband',
 'linger',
 'villag',
 'tavern',
 'market',
 'day',
 'may',
 'vouch',
 'fact',
 'mere',
 'advert',
 'sake',
 'precis',
 'authent',
 'far',
 'villag',
 'perhap',
 'two',
 '

There are also **lemmatization** tools in NLTK, which help group and consolidate terms. For example, "better" has the word "good" as its lemma, and "was" has "be." We will talk more about lemmatization with spaCy. 

## Using spaCy

While NLTK is a great library, another that has grown popular for its scalability and efficiency is [spaCy](https://spacy.io/). We'll cover a few of its features here.  

First install spaCy as well as its English model.
After that, you should be set to run spacy. 

In [53]:
conda install spacy

Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/anaconda-panel-2023.05-py310

  added / updated specs:
    - spacy


The following NEW packages will be INSTALLED:

  annotated-types    pkgs/main/linux-64::annotated-types-0.6.0-py311h06a4308_0 
  catalogue          pkgs/main/linux-64::catalogue-2.0.10-py311h06a4308_0 
  cloudpathlib       pkgs/main/linux-64::cloudpathlib-0.16.0-py311h06a4308_1 
  confection         pkgs/main/linux-64::confection-0.1.4-py311h92b7b1e_0 
  cymem              pkgs/main/linux-64::cymem-2.0.6-py311h6a678d5_0 
  cython-blis        pkgs/main/linux-64::cython-blis-0.7.9-py311hf4808d0_0 
  langcodes          pkgs/main/noarch::langcodes-3.3.0-pyhd3eb1b0_0 
  murmurhash         pkgs/main/linux-64::murmurhash-1.0.7-py311h6a678d5_0 
  preshed            pkgs/main/linux-64::preshed-3.0.6-py311h6a678d5_0 
  pydantic           pkgs/main/linux

In [28]:
!python -m spacy download en_core_web_sm

/opt/conda/envs/anaconda-panel-2023.05-py310/bin/python: No module named spacy


In [57]:
!pip list | grep spacy


spacy                         3.8.2
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5


In [None]:
!pip uninstall spacy
!pip install spacy


Found existing installation: spacy 3.8.2
Uninstalling spacy-3.8.2:
  Would remove:
    /home/f31fc7f8-bc3d-40b7-a34e-6280e4b16a68/.local/bin/spacy
    /home/f31fc7f8-bc3d-40b7-a34e-6280e4b16a68/.local/lib/python3.11/site-packages/spacy-3.8.2.dist-info/*
    /home/f31fc7f8-bc3d-40b7-a34e-6280e4b16a68/.local/lib/python3.11/site-packages/spacy/*
Proceed (Y/n)? 

In [None]:
export PYTHONPATH=$PYTHONPATH:/path/to/your/site-packages


In [None]:
/path/to/python -m spacy download en_core_web_sm


In [None]:
conda install -c conda-forge spacy


In [None]:
!pip install spacy
#!pip install numpy==1.26.4

import spacy 
nlp = spacy.load("en_core_web_sm")
nlp

Let's load up Sleepy Hollow but this time into a spaCy doc. 

In [None]:
sleepy_hollow = nlp(text)
type(sleepy_hollow)

We can traverse the text tokens. 

In [None]:
[token.text for token in sleepy_hollow]

We can also traverse the sentences, which are packaged into `Span` objects. 

In [None]:
[token.text for token in sleepy_hollow.sents]

There are a lot of helpful attributes with each token in spaCy. Below we iterate a handful of tokens from the Sleepy Hollow document and print a few attributes we learned about previously. 

In [None]:
for token in sleepy_hollow[50:60]: 
    print(f"Index: {token.idx}")
    print(f"Text: {token.text}")
    print(f"Is Alpha: {token.is_alpha}")
    print(f"Is Punctuation: {token.is_punct}")
    print(f"Is Stop Word: {token.is_stop}\n\n")
    

You can also implement your own tokenization procedures but we will keep the scope focused for now. Let's take a look at the lemmatization of each token. Sure enough, spaCy will find the lemma of each word. 

In [None]:
for token in sleepy_hollow: 
    if token.is_alpha:
        print(f"{token.text} -> {token.lemma_}")
    

This should give us enough tools and exposure to text cleaning. Just be wary that how you clean your text data is really driven by what you want to achieve and the state of the data itself. We had a nice clean short story to work with here, with an ideal UTF8 text body with no markup from HTML or PDF. There will be times you have to handle domain specific words and language, and decide to remove mathematical symbols like numbers and dates which may not be useful for your language model. Then there are simple but tedious matters like typos and errors, all of which might need to be handled for your large language model. 

Consider saving and documenting your cleaning steps too! Make reusable pipelines for your projects and perhaps even save the cleaned documents. 

## Exercise

Take this excerpt from an Edgar Allen Poe poem and tokenize it with the tool of your choice. 

In [None]:
text = "Once upon a midnight dreary, While I pondered, weak and weary"

# build your model below 

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

## Using NLTK

In [None]:
from nltk.tokenize import word_tokenize
poem = word_tokenize(text)

for w in poem: 
    print(w)

### Using spaCy

In [None]:
import spacy 
nlp = spacy.load("en_core_web_sm")
poem = nlp(text)

for w in poem: 
    print(w)