# Tokenization using 
-  1.NLTK
-  2.spacy
-  3.textblob
-  4. gensim
-  5. tensorflow
-  6. bert
-  7. Enchant 
-  8. Tokenization using Regular Expressions (RegEx)
-  9. Keras - Tokenization 
-  10.Split() function

# What is Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens
(or)
Tokenization or word segmentation is a simple process of separating sentences or words from the corpus into small units, i.e. tokens.

# Types of tokenization
white space tokenization, Dictionary based, Rule based, Penn Tree, spacy, Moses, 
Subword Tokenization(Byte pair, word piece, Sentence piece, unigram language model)
### White Space Tokenization
This is the simplest tokenization technique. Given a sentence or paragraph it tokenizes into words by splitting the input whenever a white space in encountered. 

### Dictionary Based Tokenization
In this method the tokens are found based on the tokens already existing in the dictionary. If the token is not found, then special rules are used to tokenize it. It is an advanced technique compared to whitespace tokenizer.

### Rule Based Tokenization
In this technique a set of rules are created for the specific problem. The tokenization is done based on the rules. For example creating rules bases on grammar for particular language.

### Regular Expression Tokenizer
This technique uses regular expression to control the tokenization of text into tokens. Regular expression can be simple to complex and sometimes difficult to comprehend. This technique should be preferred when the above methods does not serve the required purpose. It is a rule based tokenizer.

### Penn TreeBank Tokenization
Tree bank is a corpus created which gives the semantic and syntactical annotation of language. Penn Treebank is one of the largest treebanks which was published. This technique of tokenization separates the punctuation, clitics (words that occur along with other words like I’m, don’t) and hyphenated words together.

### Spacy Tokenizer
This is a modern technique of tokenization which faster and easily customizable. It provides the flexibility to specify special tokens that need not be segmented or need to be segmented using special rules. Suppose you want to keep # as a separate token, it takes precedence over other tokenization operations.

### Moses Tokenizer
This is a tokenizer which is advanced and is available before Spacy was introduced. It is basically a collection of complex normalization and segmentation logic which works very well for structured language like English.

##### Subword Tokenization
This tokenization is very useful for specific application where sub words make significance. In this technique the most frequently used words are given unique ids and less frequent words are split into sub words and they best represent the meaning independently.
- This helps the language model not to learn fewer and fewest as two separate words. 
Byte-Pair Encoding (BPE)
This technique is based on the concepts in information theory and compression. BPE uses Huffman encoding for tokenization meaning it uses more embedding or symbols for representing less frequent words and less symbols or embedding for more frequently used words.

##### WordPiece
WordPiece is similar to BPE techniques expect the way the new token is added to the vocabulary. BPE considers the token with most frequent occurring pair of symbols to merge into the vocabulary. While WordPiece considers the frequency of individual symbols also and based on below count it merges into the vocabulary.
Count (x, y) = frequency of (x, y) / frequency (x) * frequency (y)
The pair of symbols with maximum count will be considered to merge into vocabulary. So it allows rare tokens to be included into vocabulary as compared to BPE.



Corpus (or corpora in plural) - is simply a certain collection of language data (e.g. texts). Corpora are normally used for training different models of text classification or sentiment analysis, for instance.

Token - is a final string that is detached from the primary text, or in other words, it's an output of tokenization.



# 1. NLTK Tokenization

NLTK - We recommend NLTK only as an education and research tool. Its modularized structure makes it excellent for learning and exploring NLP concepts, but it’s not meant for production.

In [None]:
#Tokenization using nltk
import nltk
nltk.download()
paragraph = '''
The Telecommunications Industry has been among the best performing industries in the world in 
recent years. Telecom companies face a unique set of challenges that stem from technology 
trends and customer demands. The convergence of applications, networks or content in this
new-age information super highway has become the next path-breaking move in core mass-market 
technology providing single connectivity and integrated user experience. 
'''
# sentence tokenization
sentences = nltk.sentences(paragraph)
print(sentences)
# word tokenization
words = nltk.word_tokenize(paragraph)
print(words)

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


# 2. spacy Tokenization

In [29]:
!pip install spacy
!python -m spacy download en
from spacy.lang.en import English
nlp = English()
text = '''
The Telecommunications Industry has been among the best performing industries in the world in 
recent years. Telecom companies face a unique set of challenges that stem from technology 
trends and customer demands. The convergence of applications, networks or content in this
new-age information super highway has become the next path-breaking move in core mass-market 
technology providing single connectivity and integrated user experience. 
'''
# Word Tokenization
doc = nlp(text)
# create list of word tokens
token_list = []
for t in doc:
    token_list.append(t.text)
print("Word Tokenization:", token_list)

# sentence tokenization

# Load English tokenizer
nlp = English()

# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')
# Add the component to the pipeline
nlp.add_pipe(sbd)

#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)

# create list of sentence tokens
sents_list = []
for sent in doc.sents:
    sents_list.append(sent.text)
print("sentence list: ",sents_list)
    



You should consider upgrading via the 'c:\users\asha.ponnada\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.



[93m    Error: Couldn't link model to 'en'[0m
    Creating a symlink in spacy/data failed. Make sure you have the required
    permissions and try re-running the command as admin, or use a
    virtualenv. You can still import the model as a module and call its
    load() method, or create the symlink manually.

    C:\Users\asha.ponnada\AppData\Local\Programs\Python\Python37\lib\site-packages\en_core_web_sm
    -->
    C:\Users\asha.ponnada\AppData\Local\Programs\Python\Python37\lib\site-packages\spacy\data\en


[93m    Creating a shortcut link for 'en' didn't work (maybe you don't have
    admin permissions?), but you can still load the model via its full
    package name: nlp = spacy.load('{name}')[0m
    Download successful but linking failed

Word Tokenization: ['\n', 'The', 'Telecommunications', 'Industry', 'has', 'been', 'among', 'the', 'best', 'performing', 'industries', 'in', 'the', 'world', 'in', '\n', 'recent', 'years', '.', 'Telecom', 'companies', 'face', 'a', 'unique',

You should consider upgrading via the 'C:\Users\asha.ponnada\AppData\Local\Programs\Python\Python37\python.exe -m pip install --upgrade pip' command.


# 3. textblob Tokenization

textblob - TextBlob is built on top of NLTK, and it’s more easily-accessible. This is our favorite library for fast-prototyping or building applications that don’t require highly optimized performance. Beginners should start here.

TextBlob can processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Noun phrase extraction
Part-of-speech tagging
Sentiment analysis
Classification (Naive Bayes, Decision Tree)
Tokenization (splitting text into words and sentences)
Word and phrase frequencies
Parsing
n-grams
Word inflection (pluralization and singularization) and lemmatization
Spelling correction
Add new models or languages through extensions
WordNet integration


In [8]:
#Tokenization using TextBlog

!pip install -U textblob
!python -m textblob.download_corpora
import textblob
from textblob import TextBlob
corpus = '''The Telecommunications Industry has been among the best performing industries in the world in 
recent years. Telecom companies face a unique set of challenges that stem from technology 
trends and customer demands. The convergence of applications, networks or content in this
new-age information super highway has become the next path-breaking move in core mass-market 
technology providing single connectivity and integrated user experience.
'''
blob_obj = TextBlob(corpus)

print(" --------- Sentence tokenization using textblob ----------")
print(blob_obj.sentences)
# Word Tokenize
print(blob_obj.words)

Requirement already up-to-date: textblob in c:\users\asha.ponnada\appdata\local\programs\python\python37\lib\site-packages (0.15.3)


You should consider upgrading via the 'c:\users\asha.ponnada\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\asha.ponnada\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\asha.ponnada\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\asha.ponnada\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\asha.ponnada\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\asha.ponnada\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\asha.ponnada\AppData\Roaming\nltk_data...
[nltk_data]   Pa

# 4. gensim Tokenization
gensim -Gensim is most commonly used for topic modeling and similarity detection. It’s not a general-purpose NLP library, but for the tasks it does handle, it does them well.

In [18]:
# Sentence Tokenization
import gensim
from gensim import corpora
from pprint import pprint

text = '''The Telecommunications Industry has been among the best performing industries in the world in 
recent years. Telecom companies face a unique set of challenges that stem from technology 
trends and customer demands. The convergence of applications, networks or content in this
new-age information super highway has become the next path-breaking move in core mass-market 
technology providing single connectivity and integrated user experience.'''

tokens = [[token for token in sentence.split()] for sentence in text]
gensim_dictionary.add_documents(tokens)

print("The dictionary has: " + str(len(gensim_dictionary)) + " tokens")
print(gensim_dictionary.token2id)

The dictionary has: 29 tokens
{'T': 0, 'h': 1, 'e': 2, 'l': 3, 'c': 4, 'o': 5, 'm': 6, 'u': 7, 'n': 8, 'i': 9, 'a': 10, 't': 11, 's': 12, 'I': 13, 'd': 14, 'r': 15, 'y': 16, 'b': 17, 'g': 18, 'p': 19, 'f': 20, 'w': 21, '.': 22, 'q': 23, 'v': 24, ',': 25, 'k': 26, '-': 27, 'x': 28}


# 5. tensorflow Tokenization

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
    'where to go',
    'how will we move'
]
tokenizer = Tokenizer(num_words = 100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
print(word_index)


{'where': 1, 'to': 2, 'go': 3, 'how': 4, 'will': 5, 'we': 6, 'move': 7}


# 6. bert Tokenization

In [4]:
from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("bert-base-cased")
tz.convert_tokens_to_ids(["characteristically"])
sent = "algorithm that breaks a word into several subwords, such that commonly seen subwords can"
tz.tokenize(sent)


Neither PyTorch nor TensorFlow >= 2.0 have been found.Models won't be available and only tokenizers, configurationand file/data utilities can be used.
100%|███████████████████████████████| 213450/213450 [00:00<00:00, 227192.99B/s]


['algorithm',
 'that',
 'breaks',
 'a',
 'word',
 'into',
 'several',
 'sub',
 '##words',
 ',',
 'such',
 'that',
 'commonly',
 'seen',
 'sub',
 '##words',
 'can']

# 7. Enchant Tokenization

Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not.

Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.

More information is available on the Enchant website:

Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves splitting words from the body of the text.
Ref : https://pyenchant.github.io/pyenchant/tutorial.html

In [22]:
!pip install pyenchant
from enchant.tokenize import get_tokenizer 
import enchant
help(enchant)
  
# the text to be tokenized  
text = '''The Telecommunications Industry has been among the best performing industries in the world in 
recent years. Telecom companies face a unique set of challenges that stem from technology 
trends and customer demands. The convergence of applications, networks or content in this
new-age information super highway has become the next path-breaking move in core mass-market 
technology providing single connectivity and integrated user experience.
'''
# getting tokenizer class 
tokenizer = get_tokenizer("en_US") 
  
token_list =[] 
for words in tokenizer(text): 
    token_list.append(words) 
  
# print the words with POS 
print(token_list) 

Help on package enchant:

NAME
    enchant

DESCRIPTION
    # pyenchant
    #
    # Copyright (C) 2004-2011, Ryan Kelly
    #
    # This library is free software; you can redistribute it and/or
    # modify it under the terms of the GNU Lesser General Public
    # License as published by the Free Software Foundation; either
    # version 2.1 of the License, or (at your option) any later version.
    #
    # This library is distributed in the hope that it will be useful,
    # but WITHOUT ANY WARRANTY; without even the implied warranty of
    # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    # Lesser General Public License for more details.
    #
    # You should have received a copy of the GNU Lesser General Public
    # License along with this library; if not, write to the
    # Free Software Foundation, Inc., 59 Temple Place - Suite 330,
    # Boston, MA 02111-1307, USA.
    #
    # In addition, as a special exception, you are
    # given permission to link the 

You should consider upgrading via the 'c:\users\asha.ponnada\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


In [23]:
# print only the words 
word_list =[] 
  
for tokens in token_list: 
    word_list.append(tokens[0]) 
print(word_list) 

['The', 'Telecommunications', 'Industry', 'has', 'been', 'among', 'the', 'best', 'performing', 'industries', 'in', 'the', 'world', 'in', 'recent', 'years', 'Telecom', 'companies', 'face', 'a', 'unique', 'set', 'of', 'challenges', 'that', 'stem', 'from', 'technology', 'trends', 'and', 'customer', 'demands', 'The', 'convergence', 'of', 'applications', 'networks', 'or', 'content', 'in', 'this', 'new', 'age', 'information', 'super', 'highway', 'has', 'become', 'the', 'next', 'path', 'breaking', 'move', 'in', 'core', 'mass', 'market', 'technology', 'providing', 'single', 'connectivity', 'and', 'integrated', 'user', 'experience']


# 8. Tokenization using Regular Expressions (RegEx)

In [35]:
import re
text = '''
The Telecommunications Industry has been among the best performing industries in the world in 
recent years. Telecom companies face a unique set of challenges that stem from technology 
trends and customer demands. The convergence of applications, networks or content in this
new-age information super highway has become the next path-breaking move in core mass-market 
technology providing single connectivity and integrated user experience.'''
tokens = re.findall("[\w']+",text)
print("------- Word Tokenization ------",tokens)

sentences = re.compile('[.!?]').split(text)
print("------- sent tokenization ------",sentences)

------- Word Tokenization ------ ['The', 'Telecommunications', 'Industry', 'has', 'been', 'among', 'the', 'best', 'performing', 'industries', 'in', 'the', 'world', 'in', 'recent', 'years', 'Telecom', 'companies', 'face', 'a', 'unique', 'set', 'of', 'challenges', 'that', 'stem', 'from', 'technology', 'trends', 'and', 'customer', 'demands', 'The', 'convergence', 'of', 'applications', 'networks', 'or', 'content', 'in', 'this', 'new', 'age', 'information', 'super', 'highway', 'has', 'become', 'the', 'next', 'path', 'breaking', 'move', 'in', 'core', 'mass', 'market', 'technology', 'providing', 'single', 'connectivity', 'and', 'integrated', 'user', 'experience']
------- sent tokenization ------ ['\nThe Telecommunications Industry has been among the best performing industries in the world in \nrecent years', ' Telecom companies face a unique set of challenges that stem from technology \ntrends and customer demands', ' The convergence of applications, networks or content in this\nnew-age infor

# 9. Keras - Tokenization 
it does 3 things
Splits words by space (split=” “).

Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’).

Converts text to lowercase (lower=True).

In [39]:
!pip install keras
import keras

from keras.preprocessing.text import text_to_word_sequence
result = text_to_word_sequence(text)
print("word tokenization",result)

from keras.preprocessing.text import 

word tokenization ['the', 'telecommunications', 'industry', 'has', 'been', 'among', 'the', 'best', 'performing', 'industries', 'in', 'the', 'world', 'in', 'recent', 'years', 'telecom', 'companies', 'face', 'a', 'unique', 'set', 'of', 'challenges', 'that', 'stem', 'from', 'technology', 'trends', 'and', 'customer', 'demands', 'the', 'convergence', 'of', 'applications', 'networks', 'or', 'content', 'in', 'this', 'new', 'age', 'information', 'super', 'highway', 'has', 'become', 'the', 'next', 'path', 'breaking', 'move', 'in', 'core', 'mass', 'market', 'technology', 'providing', 'single', 'connectivity', 'and', 'integrated', 'user', 'experience']


You should consider upgrading via the 'c:\users\asha.ponnada\appdata\local\programs\python\python37\python.exe -m pip install --upgrade pip' command.


# Tokenization using Split() function

In [43]:
# Word tikenization using python function
print(text.split())
# Sent Tokenization
text.split('. ')

['The', 'Telecommunications', 'Industry', 'has', 'been', 'among', 'the', 'best', 'performing', 'industries', 'in', 'the', 'world', 'in', 'recent', 'years.', 'Telecom', 'companies', 'face', 'a', 'unique', 'set', 'of', 'challenges', 'that', 'stem', 'from', 'technology', 'trends', 'and', 'customer', 'demands.', 'The', 'convergence', 'of', 'applications,', 'networks', 'or', 'content', 'in', 'this', 'new-age', 'information', 'super', 'highway', 'has', 'become', 'the', 'next', 'path-breaking', 'move', 'in', 'core', 'mass-market', 'technology', 'providing', 'single', 'connectivity', 'and', 'integrated', 'user', 'experience.']


['\nThe Telecommunications Industry has been among the best performing industries in the world in \nrecent years',
 'Telecom companies face a unique set of challenges that stem from technology \ntrends and customer demands',
 'The convergence of applications, networks or content in this\nnew-age information super highway has become the next path-breaking move in core mass-market \ntechnology providing single connectivity and integrated user experience.']