# **Tokenization**
---


### **What is Tokenization?**

Tokenization is breaking a sentence, paragraph, or an text document into smaller units, such as words or characters. Each of these smaller units are called tokens. The tokens could be words, numbers or punctuation marks. 


---


### **Need of Tokenization**

Tokenization is the most basic step to proceed with NLP. After tokenization the meaning of the text can be interpreted by analyzing the words present in the text. If we are given a paragraph, we need to get all sentences. From all these sentences, we need words and then only we can understand the text completely.

---



### **Different methods to perform Tokenization**



<hr>

### **1. Tokenization using Regular Expression**

<hr/>

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern. RegEx can be used to check if a string contains the specified search pattern. Python has a built-in package called re, which can be used to work with Regular Expressions.

---


| Function      | Description |
| ----------- | ----------- |
| findall      | Returns a list containing all matches     |
| search   | Returns a Match object if there is a match anywhere in the string        |
| split      | Returns a list where the string has been split at each match |
| sub         | Replaces one or many matches with a string

---


| Function     | Description |
| ----------- | ----------- |
| \w     | Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)      |
|  \W     |  Returns a match where the string DOES NOT contain any word characters |
| \d  | Returns a match where the string contains digits (numbers from 0-9)        |
|   \D      |   Returns a match where the string DOES NOT contain digits  |
| \s       |  Returns a match where the string contains a white space character |
|  \S      |   Returns a match where the string DOES NOT contain a white space character | 



#### **Word Tokenization**

To split the sentences into words or tokens.

In [None]:
#import the required library
import re

#Text to be tokenized
text = "Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity. At Robofied, we are doing research in speech, natural language, and machine learning. We develop open-source solutions for developers which empowers them so that they can make better products for the world. We educate people about Artificial Intelligence, its scope and impact via resources and tutorials."
#Word Tokenization
tokens = re.findall("[\w']+", text)
tokens

['Robofied',
 'is',
 'a',
 'comprehensive',
 'Artificial',
 'Intelligence',
 'platform',
 'based',
 'in',
 'Gurugram',
 'Haryana',
 'working',
 'towards',
 'democratizing',
 'safe',
 'artificial',
 'intelligence',
 'towards',
 'a',
 'common',
 'goal',
 'of',
 'Singularity',
 'At',
 'Robofied',
 'we',
 'are',
 'doing',
 'research',
 'in',
 'speech',
 'natural',
 'language',
 'and',
 'machine',
 'learning',
 'We',
 'develop',
 'open',
 'source',
 'solutions',
 'for',
 'developers',
 'which',
 'empowers',
 'them',
 'so',
 'that',
 'they',
 'can',
 'make',
 'better',
 'products',
 'for',
 'the',
 'world',
 'We',
 'educate',
 'people',
 'about',
 'Artificial',
 'Intelligence',
 'its',
 'scope',
 'and',
 'impact',
 'via',
 'resources',
 'and',
 'tutorials']

#### **Sentence Tokenization**

To split a document or paragraph into sentences.


In [None]:
#import the required library
import re

#Text to be tokenized
text = "Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity. At Robofied, we are doing research in speech, natural language, and machine learning. We develop open-source solutions for developers which empowers them so that they can make better products for the world. We educate people about Artificial Intelligence, its scope and impact via resources and tutorials."
#Sentence Tokenization
sentences = re.split(r'[.!?]',text)
sentences

['Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity',
 ' At Robofied, we are doing research in speech, natural language, and machine learning',
 ' We develop open-source solutions for developers which empowers them so that they can make better products for the world',
 ' We educate people about Artificial Intelligence, its scope and impact via resources and tutorials',
 '']

<hr>

### **2. Tokenization using NLTK**

<hr/>

NLTK, short for Natural Language ToolKit, is a library written in Python for symbolic and statistical Natural Language Processing. 
NLTK contains a module called tokenize() which further classifies into two sub-categories:

    1. Word tokenize: We use the word_tokenize() method to split a sentence into tokens or words
    2. Sentence tokenize: We use the sent_tokenize() method to split a document or paragraph into sentences



### **Word Tokenization**



In [None]:
#importing libraries
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
#text to be tokenized
text = "Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity. At Robofied, we are doing research in speech, natural language, and machine learning. We develop open-source solutions for developers which empowers them so that they can make better products for the world. We educate people about Artificial Intelligence, its scope and impact via resources and tutorials."
#Word Tokenization
print(word_tokenize(text))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
['Robofied', 'is', 'a', 'comprehensive', 'Artificial', 'Intelligence', 'platform', 'based', 'in', 'Gurugram', ',', 'Haryana', 'working', 'towards', 'democratizing', 'safe', 'artificial', 'intelligence', 'towards', 'a', 'common', 'goal', 'of', 'Singularity', '.', 'At', 'Robofied', ',', 'we', 'are', 'doing', 'research', 'in', 'speech', ',', 'natural', 'language', ',', 'and', 'machine', 'learning', '.', 'We', 'develop', 'open-source', 'solutions', 'for', 'developers', 'which', 'empowers', 'them', 'so', 'that', 'they', 'can', 'make', 'better', 'products', 'for', 'the', 'world', '.', 'We', 'educate', 'people', 'about', 'Artificial', 'Intelligence', ',', 'its', 'scope', 'and', 'impact', 'via', 'resources', 'and', 'tutorials', '.']


### **Sentence Tokenization**

In [None]:
#importing libraries
import nltk
from nltk.tokenize import sent_tokenize
#text to be tokenize
text = "Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity. At Robofied, we are doing research in speech, natural language, and machine learning. We develop open-source solutions for developers which empowers them so that they can make better products for the world. We educate people about Artificial Intelligence, its scope and impact via resources and tutorials."
#Sentence Tokenization
print(sent_tokenize(text))

['Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity.', 'At Robofied, we are doing research in speech, natural language, and machine learning.', 'We develop open-source solutions for developers which empowers them so that they can make better products for the world.', 'We educate people about Artificial Intelligence, its scope and impact via resources and tutorials.']


<hr>

### **3. Tokenization using SpaCy**

<hr/>

SpaCy is a free, open-source library for advanced Natural Language Processing in Python. spaCy comes with pretrained NLP models that can perform most common NLP tasks, such as tokenization, parts of speech (POS) tagging, named entity recognition (NER), lemmatization, transforming to word vectors etc.
 It supports over 49+ languages and is very fast.

### **Word Tokenization**

In [None]:
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

text = "Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity. At Robofied, we are doing research in speech, natural language, and machine learning. We develop open-source solutions for developers which empowers them so that they can make better products for the world. We educate people about Artificial Intelligence, its scope and impact via resources and tutorials."

my_doc = nlp(text)
for token in my_doc:
  print(token.text)


Robofied
is
a
comprehensive
Artificial
Intelligence
platform
based
in
Gurugram
,
Haryana
working
towards
democratizing
safe
artificial
intelligence
towards
a
common
goal
of
Singularity
.
At
Robofied
,
we
are
doing
research
in
speech
,
natural
language
,
and
machine
learning
.
We
develop
open
-
source
solutions
for
developers
which
empowers
them
so
that
they
can
make
better
products
for
the
world
.
We
educate
people
about
Artificial
Intelligence
,
its
scope
and
impact
via
resources
and
tutorials
.


### **Sentence Tokenization**

In [None]:
from spacy.lang.en import English

# Load English tokenizer
nlp = English()

# Create the pipeline 'sentencizer' component
snt = nlp.create_pipe('sentencizer')

# Add the component to the pipeline
nlp.add_pipe(snt)

text = text = "Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity. At Robofied, we are doing research in speech, natural language, and machine learning. We develop open-source solutions for developers which empowers them so that they can make better products for the world. We educate people about Artificial Intelligence, its scope and impact via resources and tutorials."

doc = nlp(text)
for sent in doc.sents:
  print(sent.text)


Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity.
At Robofied, we are doing research in speech, natural language, and machine learning.
We develop open-source solutions for developers which empowers them so that they can make better products for the world.
We educate people about Artificial Intelligence, its scope and impact via resources and tutorials.
