<a href="https://colab.research.google.com/github/Raj-dot-GitHub/NLP-Notes/blob/main/Tokenization/NLP_Tokenization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### In this notebook we will discover two types of tokenization techniques and their implementation in different NLP libraries.

## **What is Tokenization?**

> Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called **tokens**. These tokens could be words, numbers or punctuation marks.

Let’s take an example. Consider the below string:

“This is a cat.”

Output after tokenization:- ["This", "is", "a", "cat"]

## **Why do we need Tokenization?**
> Tokenization will divide the text into words. This will make it easier to interpret the text by analysing the words.

### **1. Tokenization using Python's split() function:**

#### **Word Tokenization** :- Dividing the text into a list of words.

In [53]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

text.split()     # This will split words from whitespace.

['Founded',
 'in',
 '2002,',
 'SpaceX’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi-planet',
 'species',
 'by',
 'building',
 'a',
 'self-sustaining',
 'city',
 'on',
 'Mars.',
 'In',
 '2008,',
 'SpaceX’s',
 'Falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid-fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'Earth.']

#### **Sentence Tokenization :-** Dividing the text into a list of sentences.

In [54]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

text.split(". ")   # This will split from wherever a sentence terminates.

['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars',
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

#### **Drawbacks of using Python's split() method for tokenization:**

1.  We can use only one separator at a time.
2.  In Word Tokenization, split() did not consider punctuation as a separate token. 

### **2. Tokenization using Regular Expressions (RegEx):**

#### **Word Tokenization**

In [27]:
import re
text = "Hello, I\'m Raj Sinha, a statistic graduate and a machine learning enthusiast !"
words = re.split(r"\W+", text)   # Splitting by Whitespace
print(words)

['Hello', 'I', 'm', 'Raj', 'Sinha', 'a', 'statistic', 'graduate', 'and', 'a', 'machine', 'learning', 'enthusiast', '']


In [38]:
# Removing punctuations from the text.

text = 'I\'m with you for the entire life in U.K.!'
import string
import re
# split into words by white space
words = text.split()
# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in words]
print(stripped)

['Im', 'with', 'you', 'for', 'the', 'entire', 'life', 'in', 'UK']


In [30]:
# We can use "string.printable" which is opposite of "string.punctuation" to return to the original text.

re_print = re.compile(r"[^%s]" % re.escape(string.printable))
result = [re_print.sub("", w) for w in words]
print(result)

['Hello,', "I'm", 'Raj', 'Sinha,', 'a', 'statistic', 'graduate', 'and', 'a', 'machine', 'learning', 'enthusiast', '!']


#### **Sentence Tokenization**

In [32]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

sentences = re.compile("[.!?]").split(text)
print(sentences)

['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on, Mars', ' In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth', '']


#### **Note:-** We used re.compile() function wherein we passed [.?!] as the seperators. Using this method gives us an edge over the Python's split() method.

### **3. Tokenization in NLTK**

NLTK provides us with a tokenize() module which further classifies in two sub-categories:
1. word_tokenize  (Word tokenization)
2. sent_tokenize  (Sentence tokenization)

#### **Word Tokenization**

In [35]:
import nltk
nltk.download('punkt')        # You have to download 'punkt' module from nltk.

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [36]:
from nltk.tokenize import word_tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

tokenized = word_tokenize(text)
print(tokenized)

['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', ',', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth', '.']


#### **Sentence Tokenization**

In [37]:
from nltk.tokenize import sent_tokenize
text = """A.P.J. Abdul Kalam, in full Avul Pakir Jainulabdeen Abdul Kalam, (born October 15, 1931, Rameswaram, India—died July 27, 2015, Shillong), Indian scientist and politician who played a leading role in the development of India’s missile and nuclear weapons programs. He was president of India from 2002 to 2007."""

tokenized = sent_tokenize(text)
print(tokenized)

['A.P.J.', 'Abdul Kalam, in full Avul Pakir Jainulabdeen Abdul Kalam, (born October 15, 1931, Rameswaram, India—died July 27, 2015, Shillong), Indian scientist and politician who played a leading role in the development of India’s missile and nuclear weapons programs.', 'He was president of India from 2002 to 2007.']


#### **Note:-** NLTK is considering punctuation as a token.

In [31]:
# Lowering the words in the text:

words = text.split()
# Convert to lower case
lower_words = [word.lower() for word in words]
print(lower_words)

['hello,', "i'm", 'raj', 'sinha,', 'a', 'statistic', 'graduate', 'and', 'a', 'machine', 'learning', 'enthusiast', '!']


### **4. Tokenization using Spacy.**

#### **Word Tokenization**

In [39]:
# We will use spacy.lang.en which supports English language.
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

In [40]:
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth"""

In [42]:
# Create a word list containing tokens of the text.
doc = nlp(text)
tokens_list = []
for tokens in doc:
  tokens_list.append(tokens.text)

print(tokens_list)

['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable', 'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', '-', 'planet', '\n', 'species', 'by', 'building', 'a', 'self', '-', 'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s', 'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\n', 'liquid', '-', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth']


In [43]:
# Looping through the tokens in the text.
doc2 = nlp(u"We're here to help! Join our Discord server, named as Datazen and also don't forget to follow us on Instagram.")
for tokens in doc2:
  print(tokens)

We
're
here
to
help
!
Join
our
Discord
server
,
named
as
Datazen
and
also
do
n't
forget
to
follow
us
on
Instagram
.


In [5]:
doc3 = nlp(u"I\'m 6 feet tall and my BMI is 22.31, which suggests that I\'m healthy. So I decided to order a cheese pizza for myself worth $5.")
for tokens in doc3:
  print(tokens)

I
'm
6
feet
tall
and
my
BMI
is
22.31
,
which
suggests
that
I
'm
healthy
.
So
I
decided
to
order
a
cheese
pizza
for
myself
worth
$
5
.


In [6]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")
for tokens in doc4:
  print(tokens)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


In [7]:
len(doc4)

11

In [8]:
len(doc)

13

In [9]:
len(doc3.vocab)      # Count of vocabulary in doc3 variable.

528

In [11]:
# Indexing:
doc5 = nlp(u'It is better to give than to receive')
print(doc5[2])

better


In [12]:
# Slicing:
print(doc5[1:4])

is better to


In [14]:
print(doc5[-4:])

give than to receive


In [18]:
doc6 = nlp(u'Apple to build a factory in Hong Kong for $6 million')
for tokens in doc6:
  print(tokens.text, end = ",")

print("\n\n-----------------------")

# Tokens associated with their labels.
for ent in doc6.ents:
  print(ent.text + " - " + ent.label_ + str(spacy.explain(ent.label_)))

Apple,to,build,a,factory,in,Hong,Kong,for,$,6,million,

-----------------------
Apple - ORGCompanies, agencies, institutions, etc.
Hong Kong - GPECountries, cities, states
$6 million - MONEYMonetary values, including unit


In [17]:
print(doc6.ents)
print(len(doc6.ents))

(Apple, Hong Kong, $6 million)
3


In [19]:
# Dividing the text into chunks.
doc7 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")
for chunks in doc7.noun_chunks:     # Noun chunks
  print(chunks.text)

Autonomous cars
insurance liability
manufacturers


In [20]:
doc8 = nlp(u"Red cars do not carry higher insurance rates.")
for chunks in doc8.noun_chunks:
  print(chunks.text)

Red cars
higher insurance rates


In [21]:
doc9 = nlp(u"He was a one-eyed, one-horned, flying, purple people-eater.")
for chunks in doc9.noun_chunks:
  print(chunks.text)

He
a one-eyed, one-horned, flying, purple people-eater


In [22]:
# Using 'displacy' to find the parts of speech of the tokens in the text with visuals:
from spacy import displacy
doc10 = nlp(u"Apple is going to build a factory in U.K. for $6 million.")
displacy.render(doc10, style = 'dep', jupyter = True, options = {'distance': 110})

In [23]:
# Using 'displacy' to get description about the tokens.
doc11 = nlp(u"Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.")
displacy.render(doc11, style = 'ent', jupyter= True)

In [24]:
doc12 = nlp(u"This is a sentence.")
displacy.serve(doc12, style = 'dep')


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


### **Sentence Tokenization**

In [44]:
from spacy.lang.en import English

nlp = English()

In [45]:
# Creating the pipeline "sentencizer" component
sbd = nlp.create_pipe('sentencizer')

# Add the component to the pipeline
nlp.add_pipe(sbd)

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
sents_list

['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet \nspecies by building a self-sustaining city on Mars.',
 'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel launch vehicle to orbit the Earth.']

### **5. Tokenization in Keras**

In [None]:
pip install keras

#### **Word Tokenization**

In [48]:
from keras.preprocessing.text import text_to_word_sequence

text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

result = text_to_word_sequence(text)
result



['founded',
 'in',
 '2002',
 'spacex’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi',
 'planet',
 'species',
 'by',
 'building',
 'a',
 'self',
 'sustaining',
 'city',
 'on',
 'mars',
 'in',
 '2008',
 'spacex’s',
 'falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid',
 'fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'earth']

#### **Note:-** Keras lowers the case of all the alphabets before tokenizing them.

### **6. Tokenization using Gensim**

In [49]:
pip install gensim



#### **Word Tokenization**

We can use the gensim.utils class to import the tokenize method for performing word tokenization.

In [51]:
from gensim.utils import tokenize
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
list(tokenize(text))

['Founded',
 'in',
 'SpaceX',
 's',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi',
 'planet',
 'species',
 'by',
 'building',
 'a',
 'self',
 'sustaining',
 'city',
 'on',
 'Mars',
 'In',
 'SpaceX',
 's',
 'Falcon',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 'liquid',
 'fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'Earth']

#### **Sentence Tokenization**

To perform sentence tokenization, we use the split_sentences method from the gensim.summerization.texttcleaner class:

In [52]:
from gensim.summarization.textcleaner import split_sentences
text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""
result = split_sentences(text)
result

['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet ',
 'species by building a self-sustaining city on Mars.',
 'In 2008, SpaceX’s Falcon 1 became the first privately developed ',
 'liquid-fuel launch vehicle to orbit the Earth.']

#### **Note:-** Gensim tokenized the text on encountering "\n" while other libraries ignored it.

## **That's It !!**

### Congratulations. You have learned the various tokenization techniques and you have also seen their implementation using different NLP libraries.