<a href="https://colab.research.google.com/github/Angel-Castro-RC/Final_NLP/blob/main/F3_1_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS 195: Natural Language Processing
## Tokenization

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ericmanley/f23-CS195NLP/blob/main/F3_1_Tokenization.ipynb)


## References

Python `requests` library quickstart: https://requests.readthedocs.io/en/latest/user/quickstart/

Beautiful Soup documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

GPT Tokenizer Illustration: https://platform.openai.com/tokenizer

Python `split` method: https://docs.python.org/3/library/stdtypes.html#str.split

Hugging Face Byte-Pair Encoding tokenization: https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt

Hugging Face WordPiece tokenization: https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt

In [1]:
import sys
!{sys.executable} -m pip install requests chardet nltk beautifulsoup4 tokenizers transformers



In [32]:
#you shouldn't need to do this in Colab, but I had to do it on my own machine
#in order to connect to the nltk service
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context


## Tokenization

Before you can feed input into most NLP algorithms, you have to **tokenize** the text - break apart the string into units (the *tokens*) that the algorithm needs to work with.

A set of tokens can be
* letter
* words
* a mix of words and punctuation
* parts of words

See how GPT tokenizes here: https://platform.openai.com/tokenizer

It can be accomplished with *rule-based* methods or automatically learned.

As we saw previously, the Python string `split` method can be very useful for rule-based methods:
* if you give it a parameter, it will break up the string using that delimiter
* if you don't it separates by whitespace

In [34]:
text = "I code when I am happy . I am happy therefore I code . "
text_tokens = text.split()

print(text_tokens)

['I', 'code', 'when', 'I', 'am', 'happy', '.', 'I', 'am', 'happy', 'therefore', 'I', 'code', '.']


In [35]:
text = "I code when I am happy . I am happy therefore I code . "
text_tokens = text.split("I") #you probably don't want to do this

print(text_tokens)

['', ' code when ', ' am happy . ', ' am happy therefore ', ' code . ']


## The requests library

The `requests` library is useful for loading data stored on the web.

Here's how we can request the text version of *The Adventures of Sherlock Holmes* from Project Gutenberg: https://www.gutenberg.org/ebooks/1661


In [36]:
import requests

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")

print(response)
print(response.headers)

<Response [200]>
{'date': 'Wed, 06 Dec 2023 17:19:43 GMT', 'server': 'Apache', 'last-modified': 'Tue, 10 Oct 2023 11:01:52 GMT', 'accept-ranges': 'bytes', 'content-length': '607504', 'content-type': 'text/plain'}


A response code of 200 means it worked, and we can look at some of the other metadata that came back with it with `.headers`

Now let's look at what some of this text looks like:

In [57]:
print(response.text) #uncomment to print the whole thing
#print(response.text[4000:6000]) #printing a sample of some text in the middle

The Project Gutenberg eBook of The Adventures of Sherlock Holmes,
by Arthur Conan Doyle

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: October 10, 2023]

Language: English

Character set encoding: UTF-8

Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez

*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK
HOLMES ***




The Adventures of Sherlock Holmes

by Arthur Conan Doyle


Contents

Notice: There are a lot of weird characters like â - if this looks different than what you see when you open the file, it means something went wrong.

Usually, the `response` library can figure out the format that the characters are stored in, and that's what `response.text` does - it assumed these were the [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1) encoding, but that's not quite right.


In [38]:
print(response.encoding)

ISO-8859-1


Let's see what the requests module documentation suggests: https://requests.readthedocs.io/en/latest/user/quickstart/#response-content

look for clues by looking at `response.content`, which will show the text in it's more raw form:

In [58]:
print(response.content)
#print(response.content[4000:6000])



One thing to notice: newlines are represented as `\r\n` rather than the usual `\n` - that will be important later, so remember it

Now we can use a module like `chardet` to detect the encoding

In [59]:
import chardet


encoding_info = chardet.detect(response.content)
print(encoding_info)

{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}


Looks like it is actuall a variant of the popular encoding [UTF-8](https://en.wikipedia.org/wiki/UTF-8)

and now we can set the encoding to match

In [60]:
response.encoding = 'UTF-8-SIG'
print(response.text[4000:6000])

riend and companion.

One night—it was on the twentieth of March, 1888—I was returning from a
journey to a patient (for I had now returned to civil practice), when
my way led me through Baker Street. As I passed the well-remembered
door, which must always be associated in my mind with my wooing, and
with the dark incidents of the Study in Scarlet, I was seized with a
keen desire to see Holmes again, and to know how he was employing his
extraordinary powers. His rooms were brilliantly lit, and, even as I
looked up, I saw his tall, spare figure pass twice in a dark silhouette
against the blind. He was pacing the room swiftly, eagerly, with his
head sunk upon his chest and his hands clasped behind him. To me, who
knew his every mood and habit, his attitude and manner told their own
story. He was at work again. He had risen out of his drug-created
dreams and was hot upon the scent of some new problem. I rang the bell
and was shown up to the chamber which had formerly been in

## Cutting to the content

This ebook has markers showing where the actual content of the book start and stop, so we can cut out the Project Gutenberg preamble and license stuff at the end.

In [61]:
start_text = "*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***"
end_text = "*** END OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***"
start_index = response.text.index(start_text)+len(start_text)
end_index = response.text.index(end_text)
print("Start and end index of the text",start_index,end_index)
sherlock_text = response.text[start_index:end_index]
#print(sherlock_text)
#print(sherlock_text[:1000])

ValueError: ignored

## Now we're ready to tokenize

A question we need to answer: what do we want our tokens to look like?

Do we want to include punctuation? Should it be a separate token?

Do we want it broken into letters? words? sentences?

For this example, let's assume we want to keep punctuation but break it apart from the words it is next to.

Unfortunately, a simple `.split()` won't do the trick - notice the periods are stuck to the words they're next to.



In [50]:
#print(sherlock_text[:1000].split())
print(sherlock_text[:1000])

NameError: ignored

One strategy use the `replace` method to put spaces before and after the periods

In [51]:
example_strategy = sherlock_text[:1000].replace("."," . ")
print(example_strategy)
print(example_strategy.split()) #now . are separate tokens

NameError: ignored

OK - let's do the whole text and separate lots of other punctuation while we're at it

In [52]:
sherlock_text_intermediate = sherlock_text
sherlock_text_intermediate = sherlock_text_intermediate.replace("."," . ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(","," , ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("!"," ! ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("?"," ? ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(":"," : ")
sherlock_text_intermediate = sherlock_text_intermediate.replace(";"," ; ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("“"," “ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("”"," ” ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("’"," ’ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("‘"," ‘ ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("-"," - ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("—"," - ")
sherlock_text_intermediate = sherlock_text_intermediate.replace("_"," _ ")
print(sherlock_text_intermediate[4000:6000])

NameError: ignored

In [None]:
sherlock_tokens = sherlock_text_intermediate.split()
print(sherlock_tokens[:1000])

['cover', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'by', 'Arthur', 'Conan', 'Doyle', 'Contents', 'I', '.', 'A', 'Scandal', 'in', 'Bohemia', 'II', '.', 'The', 'Red', '-', 'Headed', 'League', 'III', '.', 'A', 'Case', 'of', 'Identity', 'IV', '.', 'The', 'Boscombe', 'Valley', 'Mystery', 'V', '.', 'The', 'Five', 'Orange', 'Pips', 'VI', '.', 'The', 'Man', 'with', 'the', 'Twisted', 'Lip', 'VII', '.', 'The', 'Adventure', 'of', 'the', 'Blue', 'Carbuncle', 'VIII', '.', 'The', 'Adventure', 'of', 'the', 'Speckled', 'Band', 'IX', '.', 'The', 'Adventure', 'of', 'the', 'Engineer', '’', 's', 'Thumb', 'X', '.', 'The', 'Adventure', 'of', 'the', 'Noble', 'Bachelor', 'XI', '.', 'The', 'Adventure', 'of', 'the', 'Beryl', 'Coronet', 'XII', '.', 'The', 'Adventure', 'of', 'the', 'Copper', 'Beeches', 'I', '.', 'A', 'SCANDAL', 'IN', 'BOHEMIA', 'I', '.', 'To', 'Sherlock', 'Holmes', 'she', 'is', 'always', '_the_', 'woman', '.', 'I', 'have', 'seldom', 'heard', 'him', 'mention', 'her', 'under', 'any', 'other

## Exercise

The text also contains some underscores. What do these signify?

Underscores in the text are likely used as-is and might have specific significance. In many cases, underscores are used to represent spaces between words in contexts

Should we separate them out? Should we remove them? Go ahead and do what you think you should do.

  if they are used as placeholders for spaces and don't carry specific meaning, you might choose to replace them with spaces.

Can you find any other special characters we should deal with?

Hyphens and Dashes - Quotation Marks

## What if I wanted it broken down by sentences?

In this example, suppose we want
* broken down by words
* no punctuation
* structured by sentence

In [3]:
#split into lists by period
sherlock_sentences = sherlock_text.split(" ")
print(sherlock_sentences[:100])

NameError: ignored

In [53]:
chars_to_remove = [",","!","?",";",":","“","”","’","‘"]
chars_to_change_to_spaces = ["-","—","\r\n"]

for idx in range(len(sherlock_sentences)):

    for c in chars_to_remove:
        sherlock_sentences[idx] = sherlock_sentences[idx].replace(c,"") #replace those characters with the empty string
    for c in chars_to_change_to_spaces:
        sherlock_sentences[idx] = sherlock_sentences[idx].replace(c," ") #replace those characters with a space
    sherlock_sentences[idx] = sherlock_sentences[idx].split()

print(sherlock_sentences[:100])

NameError: ignored

## Exercise

What if we wanted to covert all of the uppercase letters to lowercase? Edit the code to do this to each sentence.

Recall, you can use the `.lower()` string method.

In [None]:
my_string = "here’s another vacancy on the League of the Red-headed Men"
my_string_lower = my_string.lower()
print(my_string_lower)

here’s another vacancy on the league of the red-headed men


In [54]:
sherlock_sentences = sherlock_text.split(".")  # Split into sentences
print(sherlock_sentences[:5])  # Print the first 5 sentences for illustration

chars_to_remove = [",", "!", "?", ";", ":", "“", "”", "’", "‘"]
chars_to_change_to_spaces = ["-", "—", "\r\n"]

for idx in range(len(sherlock_sentences)):
    for c in chars_to_remove:
        sherlock_sentences[idx] = sherlock_sentences[idx].replace(c, "")  # Remove specified characters
    for c in chars_to_change_to_spaces:
        sherlock_sentences[idx] = sherlock_sentences[idx].replace(c, " ")  # Change specified characters to spaces
    sherlock_sentences[idx] = sherlock_sentences[idx].lower().split()  # Convert to lowercase and split into words

print(sherlock_sentences[:5])

NameError: ignored

## What if I wanted it broken down by paragraph?

This time, we'll leave punctuation in.

In [None]:
sherlock_paragraphs = sherlock_text.split("\r\n")
print(sherlock_paragraphs[:50]) #look at the first few paragraphs

['', '', 'cover', '', '', '', '', 'The Adventures of Sherlock Holmes', '', 'by Arthur Conan Doyle', '', '', 'Contents', '', '   I.     A Scandal in Bohemia', '   II.    The Red-Headed League', '   III.   A Case of Identity', '   IV.    The Boscombe Valley Mystery', '   V.     The Five Orange Pips', '   VI.    The Man with the Twisted Lip', '   VII.   The Adventure of the Blue Carbuncle', '   VIII.  The Adventure of the Speckled Band', '   IX.    The Adventure of the Engineer’s Thumb', '   X.     The Adventure of the Noble Bachelor', '   XI.    The Adventure of the Beryl Coronet', '   XII.   The Adventure of the Copper Beeches', '', '', '', '', 'I. A SCANDAL IN BOHEMIA', '', '', 'I.', '', 'To Sherlock Holmes she is always _the_ woman. I have seldom heard him', 'mention her under any other name. In his eyes she eclipses and', 'predominates the whole of her sex. It was not that he felt any emotion', 'akin to love for Irene Adler. All emotions, and that one particularly,', 'were abhorrent 

In [None]:
chars_to_separate = [",","!","?",";",":","“","”","’","‘","-","—","."]

for idx in range(len(sherlock_paragraphs)):
    for c in chars_to_separate:
        sherlock_paragraphs[idx] = sherlock_paragraphs[idx].replace(c," "+c+" ") #put a space before and after the character

    sherlock_paragraphs[idx] = sherlock_paragraphs[idx].split()

print(sherlock_paragraphs[:50])

AttributeError: ignored

## Exercise

Remove empty paragraphs from `sherlock_paragraphs`.

In [None]:
sherlock_paragraphs = sherlock_text.split("\r\n")
print(sherlock_paragraphs[:5])  # Print the first 5 paragraphs for illustration

chars_to_separate = [",", "!", "?", ";", ":", "“", "”", "’", "‘", "-", "—", "."]

for idx in range(len(sherlock_paragraphs)):
    for c in chars_to_separate:
        sherlock_paragraphs[idx] = sherlock_paragraphs[idx].replace(c, " " + c + " ")  # Put a space before and after the character

    sherlock_paragraphs[idx] = sherlock_paragraphs[idx].split()

# Remove empty paragraphs
sherlock_paragraphs = [paragraph for paragraph in sherlock_paragraphs if paragraph]

print(sherlock_paragraphs[:5])

## Working with HTML data

Most data you retrieve from the web is not in text format - it is usually has lots of html tags like `<title>`, `</br>`, and `<p>`.


In [None]:
import requests

response = requests.get("https://en.wikipedia.org/wiki/Sherlock_Holmes")

print(response)
print(response.headers)

<Response [200]>
{'date': 'Mon, 25 Sep 2023 12:17:58 GMT', 'server': 'mw2309.codfw.wmnet', 'x-content-type-options': 'nosniff', 'content-language': 'en', 'accept-ch': '', 'vary': 'Accept-Encoding,Cookie', 'last-modified': 'Mon, 25 Sep 2023 12:02:48 GMT', 'content-type': 'text/html; charset=UTF-8', 'content-encoding': 'gzip', 'age': '60160', 'x-cache': 'cp2027 hit, cp2027 hit/44', 'x-cache-status': 'hit-front', 'server-timing': 'cache;desc="hit-front", host;desc="cp2027"', 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'report-to': '{ "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'nel': '{ "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}', 'set-cookie': 'WMF-Last-Access=26-Sep-2023;Path=/;HttpOnly;secure;Expires=Sat, 28 Oct 2023 00:00:00 GMT, WMF-Last-Access

In [None]:
response.text[:3000]

'<!DOCTYPE html>\n<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled vector-feature-custom-font-size-clientpref-disabled vector-feature-client-preferences-disabled" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8">\n<title>Sherlock Holmes - Wikipedia</title>\n<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabl

## Beautiful Soup

The Beautiful Soup package is great for *parsing* and manipulating HTML: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
from bs4 import BeautifulSoup
import requests

response = requests.get("https://en.wikipedia.org/wiki/Sherlock_Holmes")
sherlock_wiki_html = BeautifulSoup(response.text, 'html.parser')

You can look for a title tag:

In [None]:
print(sherlock_wiki_html.title)

<title>Sherlock Holmes - Wikipedia</title>


Or look for all of the `<a>` tags which are the links to other pages

In [None]:
list_of_links = sherlock_wiki_html.find_all('a')
for link in list_of_links[:100]:
    print(link.get('href'))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Sherlock+Holmes
/w/index.php?title=Special:UserLogin&returnto=Sherlock+Holmes
/w/index.php?title=Special:CreateAccount&returnto=Sherlock+Holmes
/w/index.php?title=Special:UserLogin&returnto=Sherlock+Holmes
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Inspiration_for_the_character
#Fictional_character_biography
#Family_and_early_life
#Life_with_Watson
#Practice
#The_Great_Hiatus
#Retirement
#Personality_and_habits
#Drug_us

## Extracting text with Beautiful Soup

Use the `.get_text()` method on the soup object

In [None]:
sherlock_wiki_text = sherlock_wiki_html.get_text()

sherlock_wiki_text[:2000]

'\n\n\n\nSherlock Holmes - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\nLanguages\n\nLanguage links are at the top of the page across from the title.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\nCreate accountLog in\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1Inspiration for the character\n\n\n\n\n\n\n\n2Fictional character biography\n\n\n\nToggle Fictional character biography subsection\n\n\n\n\n\n2.1Fam

In [None]:
sherlock_wiki_no_lines = sherlock_wiki_text.replace("\n"," ")
sherlock_wiki_no_lines[:2000]

'    Sherlock Holmes - Wikipedia                                   Jump to content        Main menu      Main menu move to sidebar hide    \t\tNavigation \t   Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate      \t\tContribute \t   HelpLearn to editCommunity portalRecent changesUpload file      Languages  Language links are at the top of the page across from the title.                    Search            Search         Create accountLog in       Personal tools       Create account Log in      \t\tPages for logged out editors learn more    ContributionsTalk                            Contents move to sidebar hide     (Top)      1Inspiration for the character        2Fictional character biography    Toggle Fictional character biography subsection      2.1Family and early life        2.2Life with Watson        2.3Practice        2.4The Great Hiatus        2.5Retirement          3Personality and habits    Toggle Personality and habits subsection      3.1Drug u

In [None]:
chars_to_separate = [",","!","?",";",":","\"","\'","-",".","(",")"]

for c in chars_to_separate:
    sherlock_wiki_no_lines = sherlock_wiki_no_lines.replace(c," "+c+" ")

sherlock_wiki_no_lines[:2000]

'    Sherlock Holmes  -  Wikipedia                                   Jump to content        Main menu      Main menu move to sidebar hide    \t\tNavigation \t   Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate      \t\tContribute \t   HelpLearn to editCommunity portalRecent changesUpload file      Languages  Language links are at the top of the page across from the title .                     Search            Search         Create accountLog in       Personal tools       Create account Log in      \t\tPages for logged out editors learn more    ContributionsTalk                            Contents move to sidebar hide      ( Top )       1Inspiration for the character        2Fictional character biography    Toggle Fictional character biography subsection      2 . 1Family and early life        2 . 2Life with Watson        2 . 3Practice        2 . 4The Great Hiatus        2 . 5Retirement          3Personality and habits    Toggle Personality and habits subsect

In [None]:
sherlock_wiki_tokens = sherlock_wiki_no_lines.split()
print(sherlock_wiki_tokens[:500])

['Sherlock', 'Holmes', '-', 'Wikipedia', 'Jump', 'to', 'content', 'Main', 'menu', 'Main', 'menu', 'move', 'to', 'sidebar', 'hide', 'Navigation', 'Main', 'pageContentsCurrent', 'eventsRandom', 'articleAbout', 'WikipediaContact', 'usDonate', 'Contribute', 'HelpLearn', 'to', 'editCommunity', 'portalRecent', 'changesUpload', 'file', 'Languages', 'Language', 'links', 'are', 'at', 'the', 'top', 'of', 'the', 'page', 'across', 'from', 'the', 'title', '.', 'Search', 'Search', 'Create', 'accountLog', 'in', 'Personal', 'tools', 'Create', 'account', 'Log', 'in', 'Pages', 'for', 'logged', 'out', 'editors', 'learn', 'more', 'ContributionsTalk', 'Contents', 'move', 'to', 'sidebar', 'hide', '(', 'Top', ')', '1Inspiration', 'for', 'the', 'character', '2Fictional', 'character', 'biography', 'Toggle', 'Fictional', 'character', 'biography', 'subsection', '2', '.', '1Family', 'and', 'early', 'life', '2', '.', '2Life', 'with', 'Watson', '2', '.', '3Practice', '2', '.', '4The', 'Great', 'Hiatus', '2', '.', '

## Exercise

Suppose you needed to tokenize lots of Wikipedia pages like this. Can you come up with a strategy for jumping straight to the content like we did with the Project Gutenberg book?

Identify Key Sections
Navigate to Main Content

## NLTK Tokenizers

NLTK has some tokenizers - the `punkt` tokenizer is the most popular.

It can tokenize by words:


In [62]:
import nltk
import requests

nltk.download("punkt") #need to do this the first time you run it

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_words = nltk.word_tokenize(sherlock_raw_text)
print(sherlock_words[:1000])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['ï', '»', '¿The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook', '.', 'Title', ':', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'Author', ':', 'Arthur', 'Conan', 'Doyle', 'Release', 'Date', ':', 'November', '29

or sentences

In [63]:
import nltk
import requests

#nltk.download("punkt") #only need to do this once

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_sentences = nltk.sent_tokenize(sherlock_raw_text)
print(sherlock_sentences[:100])

['ï»¿The Project Gutenberg eBook of The Adventures of Sherlock Holmes,\r\nby Arthur Conan Doyle\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever.', 'You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online at\r\nwww.gutenberg.org.', 'If you are not located in the United States, you\r\nwill have to check the laws of the country where you are located before\r\nusing this eBook.', 'Title: The Adventures of Sherlock Holmes\r\n\r\nAuthor: Arthur Conan Doyle\r\n\r\nRelease Date: November 29, 2002 [eBook #1661]\r\n[Most recently updated: October 10, 2023]\r\n\r\nLanguage: English\r\n\r\nCharacter set encoding: UTF-8\r\n\r\nProduced by: an anonymous Project Gutenberg volunteer and Jose Menendez\r\n\r\n*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK\r\nHOLMES ***\r\n\r\n\r\n\r\n\r\nThe A

## Exercise

It seems that there are still some strange characters - can you preprocess the text to fix them before using the NLTK tokenizer?

Could you structure the words by sentences like we did earlier?

In [64]:
import nltk
import requests
import re


# nltk.download("punkt")

response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

# Preprocess the text to handle strange characters
sherlock_raw_text = re.sub(r'â€[^\s]*', '', sherlock_raw_text)

# Structure the words by sentences
sherlock_sentences = nltk.sent_tokenize(sherlock_raw_text)

# Tokenize each sentence
sherlock_tokenized_sentences = []
for sentence in sherlock_sentences:
    # Remove strange characters and tokenize by words
    sentence = re.sub(r'[^A-Za-z\s]', '', sentence)
    tokens = nltk.word_tokenize(sentence)
    sherlock_tokenized_sentences.append(tokens)

# Print the first few tokenized sentences
for tokens in sherlock_tokenized_sentences[:5]:
    print(tokens)


['The', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever']
['You', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 'reuse', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'wwwgutenbergorg']
['If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook']
['Title', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'Author', 'Arthur', 'Conan', 'Doyle', 'Release', 'Date', 'November', 'eBook', 'Most', 'recently', 'updated', 'October', 'Languag

## Automatic Tokenizers

Rather than having to program specific rules for how to tokenize your text, you could learn to do it automatically.

Two popular algorithms:
* Byte-Pair Encoding tokenization (used by OpenAI's GPT)
* WordPiece tokenization (used by Google's BERT)

Main idea:
* do some normalization and pre-tokenization - like the rule-based tokenization we used to form characters into sequences separated by spaces
* start with a vocabulary where each character is a different possible token
* find the most frequent consecutive pair, merge them together into a new token
* keep going until your vocabulary is a desired size

Frequent words - don't break them apart

Less-frequent words - represent them as several subwords

For WordPiece, `##` represents a partial word

In [65]:
import requests
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


response = requests.get("https://www.gutenberg.org/files/1661/1661-0.txt")
sherlock_raw_text = response.text

sherlock_hf_tokens = tokenizer.tokenize( sherlock_raw_text )
print(sherlock_hf_tokens[:1000])

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (143245 > 512). Running this sequence through the model will result in indexing errors


['ï', '»', '¿', 'The', 'Project', 'G', '##ute', '##nberg', 'e', '##B', '##ook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'e', '##B', '##ook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're', '-', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'G', '##ute', '##nberg', 'License', 'included', 'with', 'this', 'e', '##B', '##ook', 'or', 'online', 'at', 'www', '.', 'gut', '##enberg', '.', 'org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'e', '##B', '##ook', '.', 'Title', ':', 'The', 'Adventures', 'of'

## Applied Exploration

Find some new text, tokenize it according to one or more of the methods discussed here

Use it as input for the Markov Chain in the previous set of notes

Describe what you did and record notes about your results



In [68]:
res = requests.get("https://gutenberg.org/cache/epub/236/pg236.txt")

raw_text1 = res.text

sample_text = "This is a sample text. It has multiple sentences. Each sentence has words that need to be tokenized."

# Preprocess the text to handle strange characters
raw_text1 = re.sub(r'â€[^\s]*', '', raw_text1)

# Tokenize by sentences using NLTK
nltk.download("punkt")  # Download the punkt tokenizer if not already downloaded
sentence_tokens_nltk = nltk.sent_tokenize(raw_text1)

# Print the first few sentence tokens
for sentence in sentence_tokens_nltk[:5]:
    print(sentence)

﻿The Project Gutenberg eBook of The Jungle Book
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever.
You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org.
If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.
Title: The Jungle Book


Author: Rudyard Kipling

Release date: January 16, 2006 [eBook #236]
                Most recently updated: May 1, 2023

Language: English



*** START OF THE PROJECT GUTENBERG EBOOK THE JUNGLE BOOK ***



THE JUNGLE BOOK

By Rudyard Kipling



Contents

     Mowgli’s Brothers
     Hunting-Song of the Seeonee Pack
     Kaa’s Hunting
     Road-Song of the Bandar-Log
     “Tiger!
Tiger!”
      Mowgli’s Song
     The White Seal
     Lukann

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The word tokens using the simple split method resulted in a long list of tokens, including special characters and metadata from the Project Gutenberg header.

---
The sentence tokens using NLTK's punkt tokenizer provided meaningful sentences from the text

---
The NLTK tokenizer is context-aware and can handle sentence boundaries more accurately than a simple split method.