# Tokenization



Tokenization or [Text segmentation](https://en.wikipedia.org/wiki/Text_segmentation) is the problem of dividing a string of written language into its component words.

The most simple way to divide a text into a list of its words is to split over the whitespaces.


In [None]:
text = "Let's eat, grandpa"
print(text.split())

["Let's", 'eat,', 'grandpa']


The problem with that approach is that contractions (Let's -> Let + s) are not handled and punctuations signs stay attached to the nearest word (eat, -> eat + ,).

The right way to tokenize is to use a tokenizer. Most NLP libraries offer their own tokenizers. Here we will use tokenizers from the [NLTK](https://www.nltk.org/) library.

The NLTK library offers many [tokenizers](https://www.nltk.org/api/nltk.tokenize.html). We'll work with the WordPunctTokenizer.

But first let's install NLTK and download the necessary resources.




In [None]:
!pip install nltk



In [None]:
import nltk
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

# Apply the WordPunctTokenizer

We get a different results. The punctuations are now handled as tokens.

In [None]:
from nltk.tokenize import WordPunctTokenizer
tokens = WordPunctTokenizer().tokenize("Let's eat your soup, Grandpa.")
print(tokens)

['Let', "'", 's', 'eat', 'your', 'soup', ',', 'Grandpa', '.']


Let's tokenize the text from the Wikipedia Earth page and look at the frequency of the most common words.

In [None]:
from nltk.tokenize import WordPunctTokenizer
from collections import Counter
import requests

def wikipedia_page(title):
    '''
    This function returns the raw text of a wikipedia page
    given a wikipedia page title
    '''
    params = {
        'action': 'query',
        'format': 'json', # request json formatted content
        'titles': title, # title of the wikipedia page
        'prop': 'extracts',
        'explaintext': True
    }
    # send a request to the wikipedia api
    response = requests.get(
         'https://en.wikipedia.org/w/api.php',
         params= params
     ).json()

    # Parse the result
    page = next(iter(response['query']['pages'].values()))
    # return the page content
    if 'extract' in page.keys():
        return page['extract']
    else:
        return "Page not found"

In [None]:
text = wikipedia_page('Earth').lower()
tokens = WordPunctTokenizer().tokenize(text)
print(Counter(tokens).most_common(20))

[('the', 736), (',', 589), ('.', 490), ('of', 362), ('and', 288), ('earth', 261), ('is', 174), ('to', 165), ('s', 160), ("'", 159), ('in', 157), ('a', 140), ('(', 113), ('-', 78), ('by', 77), ('as', 74), ('with', 73), ('from', 70), ('surface', 66), ('at', 59)]


We now see that earth and earth's for instance are no longer separate tokens and that the punctuation signs are stand alone tokens. This will come in handy if we want to remove them.

# Tokenization on characters

We can also tokenize on characters instead of words.


In [None]:
# example of character tokenization
char_tokens = [ c for c in text ]
print("Most common characters in the text")
print(Counter(char_tokens).most_common(20))
print()
print(f"All characters in the text: \n{set(char_tokens)}")

Most common characters in the text
[(' ', 8960), ('e', 5579), ('t', 4310), ('a', 4142), ('i', 3293), ('o', 3229), ('s', 3115), ('r', 3059), ('n', 3033), ('h', 2241), ('l', 2022), ('c', 1575), ('d', 1475), ('m', 1300), ('u', 1164), ('f', 951), ('p', 943), ('g', 841), ('y', 676), (',', 632)]

All characters in the text: 
{'p', 'l', '1', 'x', '×', '−', 'n', 'ē', 'þ', '[', 'w', 'v', '-', 'i', 'ɡ', 'a', '2', 'µ', '*', 'y', 'k', '4', '0', '—', '±', '=', 'ć', 'c', 'f', '5', 'á', 't', 'â', '(', '6', 'ῖ', 'q', 'ū', 'ð', ')', 'e', '–', 'α', 'h', 'ʻ', 'u', '8', "'", 'ñ', '+', 'o', 'γ', '"', '3', 'j', '\n', 's', '7', '%', ':', '/', '̯', 'ö', ']', 'æ', 'č', 'r', '°', ',', ';', 'd', 'ō', '?', ' ', 'b', '.', 'm', 'z', '9', 'g', 'ῆ'}


# N-grams


Some words are better taken together: New York, Happy end, Wall street, Linear regression etc ... . When tokenizing we want to consider all possible adjacent pairs of words in the text. We can do this with the NLTK ngrams function


In [None]:
from nltk import ngrams

text = "How much wood would a woodchuck chuck if a woodchuck could chuck wood?".lower()

# Tokenize
tokens = WordPunctTokenizer().tokenize(text)

# bigrams
bigrams = [w for w in  ngrams(tokens,n=2)]
print(bigrams)

print()
bigrams = ['_'.join(bg) for bg in bigrams]
print(bigrams)



[('how', 'much'), ('much', 'wood'), ('wood', 'would'), ('would', 'a'), ('a', 'woodchuck'), ('woodchuck', 'chuck'), ('chuck', 'if'), ('if', 'a'), ('a', 'woodchuck'), ('woodchuck', 'could'), ('could', 'chuck'), ('chuck', 'wood'), ('wood', '?')]

['how_much', 'much_wood', 'wood_would', 'would_a', 'a_woodchuck', 'woodchuck_chuck', 'chuck_if', 'if_a', 'a_woodchuck', 'woodchuck_could', 'could_chuck', 'chuck_wood', 'wood_?']


In [None]:
# and for trigrams

trigrams = ['_'.join(w) for w in  ngrams(tokens,n=3)]

print(trigrams)


['how_much_wood', 'much_wood_would', 'wood_would_a', 'would_a_woodchuck', 'a_woodchuck_chuck', 'woodchuck_chuck_if', 'chuck_if_a', 'if_a_woodchuck', 'a_woodchuck_could', 'woodchuck_could_chuck', 'could_chuck_wood', 'chuck_wood_?']


# add ngrams to list of tokens
Let's add the bigrams and trigrams to the list of tokens on the wikipedia Earth page and look at the frequency of ngrams.

In [None]:
text = wikipedia_page('Earth').lower()
unigrams = WordPunctTokenizer().tokenize(text)
bigrams = ['_'.join(w) for w in  ngrams(unigrams,n=2)]
trigrams = ['_'.join(w) for w in  ngrams(unigrams,n=3)]

In [None]:
tokens = unigrams + bigrams + trigrams

In [None]:
print(f"we have a total of {len(tokens)} tokens, including: \n- {len(unigrams)} unigrams \n- {len(bigrams)} bigrams \n- {len(trigrams)} trigrams. ")

we have a total of 33798 tokens, including: 
- 11267 unigrams 
- 11266 bigrams 
- 11265 trigrams. 


In [None]:
Counter(tokens).most_common(50)

[('the', 736),
 (',', 589),
 ('.', 490),
 ('of', 362),
 ('and', 288),
 ('earth', 261),
 ('is', 174),
 ('to', 165),
 ('s', 160),
 ("'", 159),
 ("'_s", 159),
 ('in', 157),
 ('a', 140),
 ("earth_'", 137),
 ("earth_'_s", 137),
 ('(', 113),
 ('of_the', 98),
 ('._the', 81),
 ('-', 78),
 ('by', 77),
 ('as', 74),
 ('with', 73),
 ('from', 70),
 (',_and', 67),
 ('surface', 66),
 ('at', 59),
 ('that', 58),
 ('in_the', 56),
 (')', 54),
 (',_the', 53),
 ('water', 52),
 ('of_earth', 51),
 ('sun', 50),
 ('are', 49),
 ('it', 46),
 ('about', 44),
 ('the_sun', 43),
 ('===', 42),
 ('to_the', 42),
 ('this', 41),
 ('its', 41),
 ('atmosphere', 41),
 ('on', 41),
 ('solar', 40),
 ('land', 40),
 ('which', 36),
 ('crust', 36),
 ('has', 36),
 ('million', 36),
 ("of_earth_'", 36)]

We have multiple bigrams in the top 50 tokens:

- of_the
- of_earth
- in_the

Adding ngrams to a list of tokens may help down the line when classifying text.