## Tokenization

***Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.***

### Why is Tokenization required in NLP?


Before processing a natural language, we need to identify the words that constitute a string of characters. That’s why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analyzing the words present in the text.

Let’s take an example. Consider the below string:

                                                       “This is a cat.”

What do you think will happen after we perform tokenization on this string? We get [‘This’, ‘is’, ‘a’, cat’].
There are numerous uses of doing this. We can use this tokenized form to:
Count the number of words in the text
Count the frequency of the word, that is, the number of times a particular word is present

### How to Perform Tokenization

> #### Tokenization using Python’s split() function

**Word Tokenization**

In [31]:
%%time
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization. A multi-planet species by building a self-sustaining city on Mars."""# Splits at space 
text.split()

CPU times: user 46 µs, sys: 1e+03 ns, total: 47 µs
Wall time: 53.2 µs


['Founded',
 'in',
 '2002,',
 'SpaceX’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization.',
 'A',
 'multi-planet',
 'species',
 'by',
 'building',
 'a',
 'self-sustaining',
 'city',
 'on',
 'Mars.']

**Sentence Tokenization**

In [32]:
%%time
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization. A multi-planet species by building a self-sustaining city on Mars."""# Splits at space 
# Splits at '.' 
text.split('. ') 

CPU times: user 42 µs, sys: 1 µs, total: 43 µs
Wall time: 48.9 µs


['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization',
 'A multi-planet species by building a self-sustaining city on Mars.']

> #### Tokenization using Regex function

In [17]:
import re

In [33]:
%%time
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization. A multi-planet species by building a self-sustaining city on Mars."""# Splits at space 
word = re.findall("[\w']+", text)
word

CPU times: user 42 µs, sys: 0 ns, total: 42 µs
Wall time: 46 µs


['Founded',
 'in',
 '2002',
 'SpaceX',
 's',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'A',
 'multi',
 'planet',
 'species',
 'by',
 'building',
 'a',
 'self',
 'sustaining',
 'city',
 'on',
 'Mars']

In [38]:
%%time
sentences = re.compile('[.!?] ').split(text)
sentences

CPU times: user 30 µs, sys: 1 µs, total: 31 µs
Wall time: 35 µs


['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization',
 'A multi-planet species by building a self-sustaining city on Mars.']

> #### Tokenization using NLTK function

In [43]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [44]:
%%time
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization. A multi-planet species by building a self-sustaining city on Mars."""# Splits at space 
word_tokenize(text)


CPU times: user 533 µs, sys: 0 ns, total: 533 µs
Wall time: 538 µs


['Founded',
 'in',
 '2002',
 ',',
 'SpaceX',
 '’',
 's',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 '.',
 'A',
 'multi-planet',
 'species',
 'by',
 'building',
 'a',
 'self-sustaining',
 'city',
 'on',
 'Mars',
 '.']

In [58]:
%%time
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization. A multi-planet species by building a self-sustaining city on Mars."""# Splits at space 
sent_tokenize(text)

CPU times: user 194 µs, sys: 1e+03 ns, total: 195 µs
Wall time: 199 µs


['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization.',
 'A multi-planet species by building a self-sustaining city on Mars.']

> #### Tokenization using regexp_tokenize function

In [56]:
from nltk.tokenize.regexp import regexp_tokenize

In [59]:
%%time
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization. A multi-planet species by building a self-sustaining city on Mars."""# Splits at space 
regexp_tokenize(text, pattern = '\s+', gaps = True)

CPU times: user 71 µs, sys: 1e+03 ns, total: 72 µs
Wall time: 76.1 µs


['Founded',
 'in',
 '2002,',
 'SpaceX’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization.',
 'A',
 'multi-planet',
 'species',
 'by',
 'building',
 'a',
 'self-sustaining',
 'city',
 'on',
 'Mars.']

In [61]:
%%time
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization. A multi-planet species by building a self-sustaining city on Mars."""# Splits at space 
regexp_tokenize(text, pattern = '[.!?]', gaps = True)

CPU times: user 51 µs, sys: 0 ns, total: 51 µs
Wall time: 55.1 µs


['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization',
 ' A multi-planet species by building a self-sustaining city on Mars']