###Word Tokenization Techniques in NLP

What is word tokenization?

Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called **tokens.** These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

    For example, the text “It is cold” can be tokenized into ‘It’, ‘is’, ‘cold’

Tokenization can be done to either separate words or sentences.

    If the text is split into words using some separation technique it is called word tokenization and 
    same separation done for sentences is called sentence tokenization.

In [6]:
#Let's import relevamt libraries
import nltk
nltk.download('punkt')

#Simple example of word tokenization
from nltk.tokenize import word_tokenize

text = 'Tokenization is breaking the raw text into small chunks'
print(word_tokenize(text))   #print tokenized words

'''You can see the sentence is broken down into tokens'''

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Tokenization', 'is', 'breaking', 'the', 'raw', 'text', 'into', 'small', 'chunks']


'You can see the sentence is broken down into tokens'

#### Example to tokenize multiple sentences first, and then the words contained in sentences of a text 

In [10]:
from nltk.tokenize import sent_tokenize

text2 = "Hello everyone. Welcome to the NLP Class. We'll learn about tokenization."
for t in sent_tokenize(text2):

    x =word_tokenize(t)
    print(x) 

['Hello', 'everyone', '.']
['Welcome', 'to', 'the', 'NLP', 'Class', '.']
['We', "'ll", 'learn', 'about', 'tokenization', '.']


**Stop words:**

Stop words are those words in the text which does not add any meaning to the sentence and their removal will not affect the processing of text for the defined purpose. They are removed from the vocabulary to reduce noise and to reduce the dimension of the feature set

    Different libraries to perform word tokenization:
          NLTK, Spacy, Genism, Keras 

**Various Tokenization Techniques:**

    1.Whitespace Tokenization
This is the simplest tokenization technique. Given a sentence or paragraph it tokenizes into words by splitting the input wherever a white space in encountered. This is the fastest tokenization technique but will work for languages in which the white space breaks apart the sentence into meaningful words.


In [11]:
# You can also import WhitespaceTokenizer() method from nltk
from nltk.tokenize import WhitespaceTokenizer

In [15]:
wtk = WhitespaceTokenizer()

#Give string input
text3 = 'Natural Language Processing is a subset of Deep Learning'

#use tokenize method
tokens = wtk.tokenize(text3)
print(tokens)

['Natural', 'Language', 'Processing', 'is', 'a', 'subset', 'of', 'Deep', 'Learning']


    2.Dictionary Based Tokenization
In this method the tokens are found based on the tokens already existing in the dictionary. If the token is not found, then special rules are used to tokenize it. It is an advanced technique compared to whitespace tokenizer

In [18]:
# create a list and dictionary variable
lst = ['Spinach','Mango','Cashewnut']
dct = {'vegetable':'Spinach','fruit':'Mango','dryfruit':'Cashewnut'}


#extract words from the dictionary items
word2index = {key:val for val, key in dct.items()}

#print tokenized words
tokenized_words = [[word2index[word] for word in text.split()] for text in lst] 
tokenized_words

[['vegetable'], ['fruit'], ['dryfruit']]

    3.Rule Based Tokenization
In this technique a set of rules are created for the specific problem. The tokenization is done based on the rules. For example creating rules bases on grammar for particular language.

    Regular Expression Tokenizer
This technique uses regular expression to control the tokenization of text into tokens. Regular expression can be simple to complex and sometimes difficult to comprehend. This technique should be preferred when the above methods does not serve the required purpose. It is a rule based tokenizer

In [20]:
# import RegexpTokenizer
from nltk.tokenize import RegexpTokenizer

tk = RegexpTokenizer("[\w']+")   #[\w']+ is one type of regular expression which extracts whole words from text.

#give an input string
text4 = "Let's see if we can work together."
tokens = tk.tokenize(text4)
tokens

["Let's", 'see', 'if', 'we', 'can', 'work', 'together']

    Punctuation-based tokenizer
Punctuation-based tokenization splits on whitespace and punctuations and also retains the punctuations.Punctuation-based tokenization overcomes the issue above and provides a meaningful token

In [22]:
#import wordpunct_tokenize form nltk library
from nltk.tokenize import wordpunct_tokenize

text5 = 'Mr.Manoj buys property in: Hyderabad & Mumbai'
tokens = wordpunct_tokenize(text5)
tokens

['Mr', '.', 'Manoj', 'buys', 'property', 'in', ':', 'Hyderabad', '&', 'Mumbai']

    Tweet Tokenizer
Special texts, like Twitter tweets, have a characteristic structure and the generic tokenizers mentioned above fail to produce viable tokens when applied to these datasets. NLTK offers a special tokenizer for tweets to help in this case. This is a rule-based tokenizer that can remove HTML code, remove problematic characters, remove Twitter handles, and normalize text length by reducing the occurrence of repeated letters

In [24]:
#import TweetTokenizer from nltk
from nltk.tokenize import TweetTokenizer

#create object of tokenizer
tknzr = TweetTokenizer(strip_handles=True)
tweet= " @NLP_learner: NLP is way tooo coool:-) :-P <3"

x= tknzr.tokenize(tweet)
print(x)

[':', 'NLP', 'is', 'way', 'tooo', 'coool', ':-)', ':-P', '<3']


    MWE(Multi-Word Expression) Tokenizer
The multi-word expression tokenizer is a rule-based, “add-on” tokenizer offered by NLTK. Once the text has been tokenized by a tokenizer of choice, some tokens can be re-grouped into multi-word expressions.

MWE Tokenizer takes a string and merges multi-word expressions into single tokens, using a lexicon of MWEs

In [25]:
from nltk.tokenize import MWETokenizer
   
# Create a reference variable for Class MWETokenizer
tk = MWETokenizer([('M', 'W', 'E'), ('Multi', 'Word', 'Tokenier')])
tk.add_mwe(('Natural', 'Language', 'Processing'))
   
# Create a string input
text = "What is M W E in Natural Language Processing"
   
# Use tokenize method
tokenized = tk.tokenize(text.split())
   
print(tokenized)

['What', 'is', 'M_W_E', 'in', 'Natural_Language_Processing']


    4.Penn TreeBank/Default Tokenization
Tree bank is a corpus created which gives the semantic and syntactical annotation of language. Penn Treebank is one of the largest treebanks which was published. This technique of tokenization separates the punctuation, clitics (words that occur along with other words like I’m, don’t) and hyphenated words together.

In [28]:
#import tokenizer from nltk
from nltk.tokenize import TreebankWordTokenizer

#create object of TreebankWordTokenizer
tk = TreebankWordTokenizer()
text = "That's True, Mr. Manoj Singh."
tokens = tk.tokenize(text)
tokens

['That', "'s", 'True', ',', 'Mr.', 'Manoj', 'Singh', '.']

    5.Spacy Tokenizer
This is a modern technique of tokenization which is faster and easily customizable. It provides the flexibility to specify special tokens that need not be segmented or need to be segmented using special rules. Suppose you want to keep $ as a separate token, it takes precedence over other tokenization operations.



In [30]:
import spacy
from spacy.lang.en import English

# Load English tokenizer, tagger, parser, NER and word vectors by creating object
nlp = English()


In [31]:
# give an input string with multiple sentences
string = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed 
liquid-fuel launch vehicle to orbit the Earth."""

In [32]:
text = nlp(string)

# Create list of word tokens
token_list = []

for token in text:
  token_list.append(token.text)
token_list 

['Founded',
 'in',
 '2002',
 ',',
 'SpaceX',
 '’s',
 'mission',
 'is',
 'to',
 'enable',
 'humans',
 'to',
 'become',
 'a',
 'spacefaring',
 'civilization',
 'and',
 'a',
 'multi',
 '-',
 'planet',
 '\n',
 'species',
 'by',
 'building',
 'a',
 'self',
 '-',
 'sustaining',
 'city',
 'on',
 'Mars',
 '.',
 'In',
 '2008',
 ',',
 'SpaceX',
 '’s',
 'Falcon',
 '1',
 'became',
 'the',
 'first',
 'privately',
 'developed',
 '\n',
 'liquid',
 '-',
 'fuel',
 'launch',
 'vehicle',
 'to',
 'orbit',
 'the',
 'Earth',
 '.']

    6.Moses Tokenizer
This is a tokenizer which is advanced and is available before Spacy was introduced. It is basically a collection of complex normalization and segmentation logic which works very well for structured language like English.

In [34]:
!pip install sacremoses
from sacremoses import MosesTokenizer

Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[?25l[K     |▍                               | 10 kB 24.8 MB/s eta 0:00:01[K     |▊                               | 20 kB 31.7 MB/s eta 0:00:01[K     |█                               | 30 kB 21.7 MB/s eta 0:00:01[K     |█▌                              | 40 kB 16.9 MB/s eta 0:00:01[K     |█▉                              | 51 kB 5.7 MB/s eta 0:00:01[K     |██▏                             | 61 kB 6.2 MB/s eta 0:00:01[K     |██▋                             | 71 kB 5.6 MB/s eta 0:00:01[K     |███                             | 81 kB 6.2 MB/s eta 0:00:01[K     |███▎                            | 92 kB 6.2 MB/s eta 0:00:01[K     |███▋                            | 102 kB 5.4 MB/s eta 0:00:01[K     |████                            | 112 kB 5.4 MB/s eta 0:00:01[K     |████▍                           | 122 kB 5.4 MB/s eta 0:00:01[K     |████▊                           | 133 kB 5.4 MB/s eta 0:0

In [36]:
#create tokenizer object
mt = MosesTokenizer()
text = "okay,Let me check Moses Tokenizer, Mr.Vishal"
tokens = mt.tokenize(text)
tokens

['okay', ',', 'Let', 'me', 'check', 'Moses', 'Tokenizer', ',', 'Mr.Vishal']

---------------
---------------