# <span style = "color:green"> What is Tokenization?</span>

***

Tokenization is one of the most common tasks when it comes to working with text data. But what does the term 'tokenization' actually mean?

Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.

Let's Visualize it with an example:

<center><div class="alert alert-block alert-warning">
<b>Natural Language Processing</b>
</div></center>

<center><div class="alert alert-block alert-success">
<b>['Natural','Language','Processing']</b>
</div></center>

The tokens could be words, numbers or punctuation marks. In tokenization, smaller units are creaeted by locating word boundaries. Wait- what are word boundaries?

These are the ending point of a word and the beginnning of the next word. These tokens are considered as a first step for stemming and lemmatization(We will discuss about these later on).

### Why is Tokenization required in NLP?

I want you to think about the English language here. Pick up any sentence you can think of and hold that in your mind as you read this section. This will help you understand the importance of tokenization in a much easier manner.

Before preprocessing a natural language, we need to identify the words that constitute a string of characters. That's why tokenization is the most basic step to proceed with NLP (text data). This is important because the meaning of the text could easily be interpreted by analysing the words present in the text.

Let's take an example. Consider the below string:
<center>"This is a cat"</center>

What do you think will happen after we perform tokenization on this string? We get
<center>['This','is','a','cat']</center>

There are numerous uses of doing this. We can use this tokenized form to :
* Count the number of words in the text.
* Count the frequency of the word, that is, the number of times a particular word is present.

And so on. We can extract alot more information which we'll discuss in detail in future articles. For now, it's time to dive into the meat of this article-the different methods of performing tokenization in NLP

## Methods to perform Tokenization in Python

We are going to look at different ways we can perform tokenization on text data.

### 1. Tokenization using python's split() function.

#### Word tokenization

In [3]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

# Splits at space 
text.split() 

['Founded',
 'in',
 '2002,',
 'SpaceX’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi-planet',
 'species',
 'by',
 'building',
 'a',
 'self-sustaining',
 'city',
 'on',
 'Mars.',
 'In',
 '2008,',
 'SpaceX’s',
 'Falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid-fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'Earth.']

#### Sentence tokenization

This is similar to word tokenization. Here, we study the structure of sentence in the analysis. A sentence usually ends with a full stop(.), So we can use"." as a seperator.

In [4]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# Splits at '.' 
text.split('. ') 

['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars',
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

<div class="alert alert-block alert-danger">
One major drawback of using python's split() method is that we can use only one seperator at a time. Another thing to note-in word tokenization, split() did not consider punctuation as a seperate token.
</div>

### 2. Tokenization with NLTK

Now, this is a library you will appreciate the more you work with text data, NLTK, short for Natual Language Toolkit, is a library written in Python for symbolic and statistical Natural Language processing.

NLTK contains a module called tokenize() which further classifies into two sub-categories:
* <b>Word Tokenize:</b> We use the word_tokenize() method to split a sentence into tokens or words
* <b>Sentence Tokenize:</b> We use the sent_tokenize() method to split a document or paragraph into sentences

Let's see both of these done one-by-one.

#### Word Tokenization

In [5]:
from nltk.tokenize import word_tokenize 
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
word_tokenize(text)

['Founded',
 'in',
 '2002',
 ',',
 'SpaceX',
 '’',
 's',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi-planet',
 'species',
 'by',
 'building',
 'a',
 'self-sustaining',
 'city',
 'on',
 'Mars',
 '.',
 'In',
 '2008',
 ',',
 'SpaceX',
 '’',
 's',
 'Falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid-fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'Earth',
 '.']

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [6]:
from nltk.tokenize import sent_tokenize, word_tokenize
t="One of the first steps in cleaning data for NLP is to remove unwanted characters from your text, such as punctuation, numbers, symbols, HTML tags, or emojis. These characters can introduce noise and ambiguity to your data, and may not be relevant for your NLP task."
st=sent_tokenize(t)
print(st)

['One of the first steps in cleaning data for NLP is to remove unwanted characters from your text, such as punctuation, numbers, symbols, HTML tags, or emojis.', 'These characters can introduce noise and ambiguity to your data, and may not be relevant for your NLP task.']


In [8]:
w_t=[word_tokenize(i) for i in st]
print(w_t)

[['One', 'of', 'the', 'first', 'steps', 'in', 'cleaning', 'data', 'for', 'NLP', 'is', 'to', 'remove', 'unwanted', 'characters', 'from', 'your', 'text', ',', 'such', 'as', 'punctuation', ',', 'numbers', ',', 'symbols', ',', 'HTML', 'tags', ',', 'or', 'emojis', '.'], ['These', 'characters', 'can', 'introduce', 'noise', 'and', 'ambiguity', 'to', 'your', 'data', ',', 'and', 'may', 'not', 'be', 'relevant', 'for', 'your', 'NLP', 'task', '.']]


Notice how NLTK is considering punctuation as token? Hence for future tasks, we need to remove the punctuations from the initial list.

#### Sentence Tokenization

In [6]:
from nltk.tokenize import sent_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
sent_tokenize(text)

['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.',
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

### 3. Tokenization with Keras

Keras! One of the hottest deep learning frameworks in the industry right now. It is an open source neural network library for Python. Keras is super easy to use and can also run on top of tensorflow.

In the NLP context, we can use Keras for cleaning the unstructured text data that we typically collect.

#### Word Tokenization

In [8]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
# define
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
# tokenize
result = text_to_word_sequence(text)
result

['founded',
 'in',
 '2002',
 'spacex’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi',
 'planet',
 'species',
 'by',
 'building',
 'a',
 'self',
 'sustaining',
 'city',
 'on',
 'mars',
 'in',
 '2008',
 'spacex’s',
 'falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid',
 'fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'earth']

In [9]:
from tensorflow.keras.preprocessing.text import text_to_word_sequence
t="One of the first steps in cleaning data for NLP is to remove unwanted characters from your text, such as punctuation, numbers, symbols, HTML tags, or emojis. These characters can introduce noise and ambiguity to your data, and may not be relevant for your NLP task."
print(text_to_word_sequence(t))

['one', 'of', 'the', 'first', 'steps', 'in', 'cleaning', 'data', 'for', 'nlp', 'is', 'to', 'remove', 'unwanted', 'characters', 'from', 'your', 'text', 'such', 'as', 'punctuation', 'numbers', 'symbols', 'html', 'tags', 'or', 'emojis', 'these', 'characters', 'can', 'introduce', 'noise', 'and', 'ambiguity', 'to', 'your', 'data', 'and', 'may', 'not', 'be', 'relevant', 'for', 'your', 'nlp', 'task']


Keras lowers the case of all the alphabets before tokenizing them. That saves us quite a lot of time as you can imagine!

## Conclusion

Tokenization is a critical step in the overall NLP Pipeline. We cannot simply jump into the model building part without cleaning the text first.

There are various other ways as well to tokenize but these ones we discussed are good enough to get started on the topic.

***