# Exploring Tokenization

In order to build up a vocabulary, the first thing to do is to break the documents or sentences into chunks called tokens. Each token carries a semantic meaning associated with it. Tokenization is one of the fundamental things to do in any text-processing activity. 

Tokenization can be thought of as a segmentation technique wherein you are trying to break down larger pieces of text chunks into smaller meaningful ones. Tokens generally comprise words and numbers, but they can be extended to include punctuation marks, symbols, and, at times, understandable emoticons.

In [1]:
sentence = "The capital of China is Beijing"
sentence.split()

['The', 'capital', 'of', 'China', 'is', 'Beijing']

In [2]:
sentence = "China's capital is Beijing"
sentence.split()

["China's", 'capital', 'is', 'Beijing']

In [3]:
sentence = "Beijing is where we'll go"
sentence.split()

['Beijing', 'is', 'where', "we'll", 'go']

In [4]:
sentence = "I'm going to travel to Beijing"
sentence.split()

["I'm", 'going', 'to', 'travel', 'to', 'Beijing']

In [5]:
sentence = "Most of the times umm I travel"
sentence.split()

['Most', 'of', 'the', 'times', 'umm', 'I', 'travel']

In [6]:
sentence = "Let's travel to Hong Kong from Beijing"
sentence.split()

["Let's", 'travel', 'to', 'Hong', 'Kong', 'from', 'Beijing']

In [7]:
sentence = "A friend is pursuing his M.S from Beijing"
sentence.split()

['A', 'friend', 'is', 'pursuing', 'his', 'M.S', 'from', 'Beijing']

In [8]:
sentence = "Beijing is a cool place!!! :-P <3 #Awesome"
sentence.split()

['Beijing', 'is', 'a', 'cool', 'place!!!', ':-P', '<3', '#Awesome']

# Regexp Tokenizer

In [9]:
# Regular expressions are sequences of characters that define a search pattern. 

# The \w+|\$[\d\.]+|\S+regular expression allows three alternative patterns:
#
# First alternative: \w+ that matches any word character (equal to [a-zA-Z0-9_]). 
# The +is a quantifier and matches between one and unlimited times as many times as possible.
#
# Second alternative: \$[\d\.]+. Here, \$matches the character $, \dmatches a digit 
# between 0 and 9, \. matches the character . (period), and +again acts as a quantifier 
# matching between one and unlimited times.
#
# Third alternative: \S+. Here, \Saccepts any non-whitespace character and +again 
# acts the same way as in the preceding two alternatives.

from nltk.tokenize import RegexpTokenizer

s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$3000.0',
 '-',
 '$8000.0',
 'in',
 'USA',
 '.']

# Blankline Tokenizer

In [10]:
# There are other tokenizers built on top of the RegexpTokenizer, such as the 
# BlankLine tokenizer, which tokenizes a string treating blank lines as delimiters 
# where blank lines are those that contain no characters except spaces or tabs.

from nltk.tokenize import BlanklineTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.\n\n I want a book as well"
tokenizer = BlanklineTokenizer()
tokenizer.tokenize(s)

['A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.',
 'I want a book as well']

# WordPunct Tokenizer

In [11]:
# The WordPunct tokenizer is another implementation on top of RegexpTokenizer, which 
# tokenizes a text into a sequence of alphabetic and nonalphabetic characters using the 
# regular expression \w+|[^\w\s]+.

from nltk.tokenize import WordPunctTokenizer
s = "A Rolex watch costs in the range of $3000.0 - $8000.0 in USA.\n I want a book as well"
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(s)

['A',
 'Rolex',
 'watch',
 'costs',
 'in',
 'the',
 'range',
 'of',
 '$',
 '3000',
 '.',
 '0',
 '-',
 '$',
 '8000',
 '.',
 '0',
 'in',
 'USA',
 '.',
 'I',
 'want',
 'a',
 'book',
 'as',
 'well']

# TreebankWord Tokenizer

In [12]:
# The Treebank tokenizer also uses regular expressions to tokenize text according to the 
# Penn Treebank (https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html). Here, words are mostly 
# split based on punctuation.

# The Treebank tokenizer does a great job of splitting contractions such as doesn't to 
# does and n't. It further identifies periods at the ends of lines and eliminates them. 
# Punctuation such as commas is split if followed by whitespaces.

from nltk.tokenize import TreebankWordTokenizer

s = "I'm going to buy a Rolex watch which doesn't cost more than $3000.0"
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(s)

['I',
 "'m",
 'going',
 'to',
 'buy',
 'a',
 'Rolex',
 'watch',
 'which',
 'does',
 "n't",
 'cost',
 'more',
 'than',
 '$',
 '3000.0']

# Tweet Tokenizer

In [13]:
# The rise of social media has given rise to an informal language wherein people tag each 
# other using their social media handles and use a lot of emoticons, hashtags, and 
# abbreviated text to express themselves. We need tokenizers in place that can parse such 
# text and make things more understandable. TweetTokenizer caters to this use case significantly. 

from nltk.tokenize import TweetTokenizer

s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer()
tokenizer.tokenize(s)

['@amankedia',
 "I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxxxxxxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']

In [14]:
from nltk.tokenize import TweetTokenizer

s = "@amankedia I'm going to buy a Rolexxxxxxxx watch!!! :-D #happiness #rolex <3"
tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)
tokenizer.tokenize(s)

["I'm",
 'going',
 'to',
 'buy',
 'a',
 'Rolexxx',
 'watch',
 '!',
 '!',
 '!',
 ':-D',
 '#happiness',
 '#rolex',
 '<3']