The input in natural language processing is text. The data collection for this text happens from a lot of sources. This requires a lot of cleaning and processing before the data can be used for analysis.

**These are some of the methods of processing the data in NLP:**

* Tokenization
* Stop words removal
* Stemming
* Normalization
* Lemmatization
* Parts of speech tagging

# Tokenization 

* Tokenization helps to convert the paragraph into sentences and words. Converted words in a form of **list**.  

* Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words.

* For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’. 
* There are different methods and libraries available to perform tokenization. NLTK, Gensim, Keras are some of the libraries that can be used to accomplish the task.

* Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called word tokenization and same    separation done for sentences is called sentence tokenization.

* There are various tokenization techniques available which can be applicable based on the language and purpose of modeling. Below are a few of the tokenization techniques used in NLP.

<center><img src="https://miro.medium.com/max/1400/1*FTEu803GEsNrNslvY1RbXQ.png" width="600"/></center>


**White Space Tokenization:**

This is the simplest tokenization technique. Given a sentence or paragraph it tokenizes into words by splitting the input whenever a white space in encountered. This is the fastest tokenization technique but will work for languages in which the white space breaks apart the sentence into meaningful words. Example: English.

**Dictionary Based Tokenization:**

In this method the tokens are found based on the tokens already existing in the dictionary. If the token is not found, then special rules are used to tokenize it. It is an advanced technique compared to whitespace tokenizer.

**Rule Based Tokenization:**

In this technique a set of rules are created for the specific problem. The tokenization is done based on the rules. For example creating rules bases on grammar for particular language.

**Regular Expression Tokenizer:**

This technique uses regular expression to control the tokenization of text into tokens. Regular expression can be simple to complex and sometimes difficult to comprehend. This technique should be preferred when the above methods does not serve the required purpose. It is a rule based tokenizer.

**Penn TreeBank Tokenization:**

Tree bank is a corpus created which gives the semantic and syntactical annotation of language. Penn Treebank is one of the largest treebanks which was published. This technique of tokenization separates the punctuation, clitics (words that occur along with other words like I’m, don’t) and hyphenated words together.

**Spacy Tokenizer:**

This is a modern technique of tokenization which faster and easily customizable. It provides the flexibility to specify special tokens that need not be segmented or need to be segmented using special rules. Suppose you want to keep $ as a separate token, it takes precedence over other tokenization operations.

**Moses Tokenizer:**

This is a tokenizer which is advanced and is available before Spacy was introduced. It is basically a collection of complex normalization and segmentation logic which works very well for structured language like English.

**Subword Tokenization:**

This tokenization is very useful for specific application where sub words make significance. In this technique the most frequently used words are given unique ids and less frequent words are split into sub words and they best represent the meaning independently. 

For example if the word few is appearing frequently in the text it will be assigned a unique id, where fewer and fewest which are rare words and are less frequent in the text will be split into sub words like few, er, and est. This helps the language model not to learn fewer and fewest as two separate words. This allows to identify the unknown words in the data set during training. There are different types of subword tokenization and they are given below and Byte-Pair Encoding and WordPiece will be discussed briefly.

<center><img src="https://miro.medium.com/max/1400/1*7_s0e2RuWz5_0GYwqQzFlQ.png" width="600"/></center>

1. Byte-Pair Encoding (BPE)
2. WordPiece
3. Unigram Language Model
4. SentencePiece

This technique is based on the concepts in information theory and compression. BPE uses Huffman encoding for tokenization meaning it uses more embedding or symbols for representing less frequent words and less symbols or embedding for more frequently used words.

**Byte-Pair Encoding (BPE):**

The BPE tokenization is bottom up sub word tokenization technique. The steps involved in BPE algorithm is given below.

Starts with splitting the input words into single unicode characters and each of them corresponds to a symbol in the final vocabulary.
Find the most frequent occurring pair of symbols from the current vocabulary.
Add this to the vocabulary and size of vocabulary increases by one.
Repeat steps ii and iii till the defined number of tokens are built or no new combination of symbols exist with required frequency.

**WordPiece:**

WordPiece is similar to BPE techniques expect the way the new token is added to the vocabulary. BPE considers the token with most frequent occurring pair of symbols to merge into the vocabulary. While WordPiece considers the frequency of individual symbols also and based on below count it merges into the vocabulary.

`Count (x, y) = frequency of (x, y) / frequency (x) * frequency (y)`

The pair of symbols with maximum count will be considered to merge into vocabulary. So it allows rare tokens to be included into vocabulary as compared to BPE.

Let's do the tokenization with different methods: 

## 1. NLTK: 

In [1]:
import nltk #natural languages tool kit(library)

nltk.download() # It makes sure we download everything in the nltk library!

In [2]:
corpus = "Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.Often, programmers fall in love with Python because of the increased productivity it provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn't catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python itself, testifying to Python's introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective."

* In order to convert paragraph into sentences we use (sent_tokenize()) Function!

In [3]:
token_corpus = nltk.sent_tokenize(corpus) 
#Sentences tokenize (paragraph into sentences)

In [5]:
token_corpus  # results

['Python is an interpreted, object-oriented, high-level programming language with dynamic semantics.',
 'Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.',
 "Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance.",
 'Python supports modules and packages, which encourages program modularity and code reuse.',
 'The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.Often, programmers fall in love with Python because of the increased productivity it provides.',
 'Since there is no compilation step, the edit-test-debug cycle is incredibly fast.',
 'Debugging Python programs is easy: a bug or bad input will never cause a segmentati

* In order to convert paragraph into words we use (word_tokenize()) Function!

In [6]:
word_corpus = nltk.word_tokenize(corpus)
# word tokeniztion 

**This is the first step in text preprocessing, either you can convert into sentences or words!-**

In [8]:
word_corpus  # results

['Python',
 'is',
 'an',
 'interpreted',
 ',',
 'object-oriented',
 ',',
 'high-level',
 'programming',
 'language',
 'with',
 'dynamic',
 'semantics',
 '.',
 'Its',
 'high-level',
 'built',
 'in',
 'data',
 'structures',
 ',',
 'combined',
 'with',
 'dynamic',
 'typing',
 'and',
 'dynamic',
 'binding',
 ',',
 'make',
 'it',
 'very',
 'attractive',
 'for',
 'Rapid',
 'Application',
 'Development',
 ',',
 'as',
 'well',
 'as',
 'for',
 'use',
 'as',
 'a',
 'scripting',
 'or',
 'glue',
 'language',
 'to',
 'connect',
 'existing',
 'components',
 'together',
 '.',
 'Python',
 "'s",
 'simple',
 ',',
 'easy',
 'to',
 'learn',
 'syntax',
 'emphasizes',
 'readability',
 'and',
 'therefore',
 'reduces',
 'the',
 'cost',
 'of',
 'program',
 'maintenance',
 '.',
 'Python',
 'supports',
 'modules',
 'and',
 'packages',
 ',',
 'which',
 'encourages',
 'program',
 'modularity',
 'and',
 'code',
 'reuse',
 '.',
 'The',
 'Python',
 'interpreter',
 'and',
 'the',
 'extensive',
 'standard',
 'library',

## 2. spaCy

In [10]:
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()

# Create a blank Tokenizer with just the English vocab
tokenizer = Tokenizer(nlp.vocab)

tokens = tokenizer("This is a sentence")

ModuleNotFoundError: No module named 'thinc.neural'

## 3. Keras: 

In [11]:
from keras.preprocessing.text import Tokenizer, text_to_word_sequence 

tokenizer = Tokenizer()
tokenizer.fit_on_texts(corpus)

list_words = text_to_word_sequence(corpus)
list_words

['python',
 'is',
 'an',
 'interpreted',
 'object',
 'oriented',
 'high',
 'level',
 'programming',
 'language',
 'with',
 'dynamic',
 'semantics',
 'its',
 'high',
 'level',
 'built',
 'in',
 'data',
 'structures',
 'combined',
 'with',
 'dynamic',
 'typing',
 'and',
 'dynamic',
 'binding',
 'make',
 'it',
 'very',
 'attractive',
 'for',
 'rapid',
 'application',
 'development',
 'as',
 'well',
 'as',
 'for',
 'use',
 'as',
 'a',
 'scripting',
 'or',
 'glue',
 'language',
 'to',
 'connect',
 'existing',
 'components',
 'together',
 "python's",
 'simple',
 'easy',
 'to',
 'learn',
 'syntax',
 'emphasizes',
 'readability',
 'and',
 'therefore',
 'reduces',
 'the',
 'cost',
 'of',
 'program',
 'maintenance',
 'python',
 'supports',
 'modules',
 'and',
 'packages',
 'which',
 'encourages',
 'program',
 'modularity',
 'and',
 'code',
 'reuse',
 'the',
 'python',
 'interpreter',
 'and',
 'the',
 'extensive',
 'standard',
 'library',
 'are',
 'available',
 'in',
 'source',
 'or',
 'binary',


## 4. TextBlob

In [14]:
from textblob import TextBlob

blob_text = TextBlob(corpus) 

print(blob_text.words)

['Python', 'is', 'an', 'interpreted', 'object-oriented', 'high-level', 'programming', 'language', 'with', 'dynamic', 'semantics', 'Its', 'high-level', 'built', 'in', 'data', 'structures', 'combined', 'with', 'dynamic', 'typing', 'and', 'dynamic', 'binding', 'make', 'it', 'very', 'attractive', 'for', 'Rapid', 'Application', 'Development', 'as', 'well', 'as', 'for', 'use', 'as', 'a', 'scripting', 'or', 'glue', 'language', 'to', 'connect', 'existing', 'components', 'together', 'Python', "'s", 'simple', 'easy', 'to', 'learn', 'syntax', 'emphasizes', 'readability', 'and', 'therefore', 'reduces', 'the', 'cost', 'of', 'program', 'maintenance', 'Python', 'supports', 'modules', 'and', 'packages', 'which', 'encourages', 'program', 'modularity', 'and', 'code', 'reuse', 'The', 'Python', 'interpreter', 'and', 'the', 'extensive', 'standard', 'library', 'are', 'available', 'in', 'source', 'or', 'binary', 'form', 'without', 'charge', 'for', 'all', 'major', 'platforms', 'and', 'can', 'be', 'freely', 'd

## 5. Gensim

In [15]:
from gensim.utils import tokenize 

list(tokenize(corpus))

['Python',
 'is',
 'an',
 'interpreted',
 'object',
 'oriented',
 'high',
 'level',
 'programming',
 'language',
 'with',
 'dynamic',
 'semantics',
 'Its',
 'high',
 'level',
 'built',
 'in',
 'data',
 'structures',
 'combined',
 'with',
 'dynamic',
 'typing',
 'and',
 'dynamic',
 'binding',
 'make',
 'it',
 'very',
 'attractive',
 'for',
 'Rapid',
 'Application',
 'Development',
 'as',
 'well',
 'as',
 'for',
 'use',
 'as',
 'a',
 'scripting',
 'or',
 'glue',
 'language',
 'to',
 'connect',
 'existing',
 'components',
 'together',
 'Python',
 's',
 'simple',
 'easy',
 'to',
 'learn',
 'syntax',
 'emphasizes',
 'readability',
 'and',
 'therefore',
 'reduces',
 'the',
 'cost',
 'of',
 'program',
 'maintenance',
 'Python',
 'supports',
 'modules',
 'and',
 'packages',
 'which',
 'encourages',
 'program',
 'modularity',
 'and',
 'code',
 'reuse',
 'The',
 'Python',
 'interpreter',
 'and',
 'the',
 'extensive',
 'standard',
 'library',
 'are',
 'available',
 'in',
 'source',
 'or',
 'binar

## 6. Transformer Tokenizers 

If you are curious about how transformers tokeniers work, you can click this [**link**](https://colab.research.google.com/drive/1v0LTKt0CGFdbjhoDUNhoWGRWpB-khXn4?usp=sharing)

<center><img src="https://thumbs.gfycat.com/UncomfortableHatefulKakarikis-size_restricted.gif" width="600"/></center>