# Tokenization

* Tokenization is a process of splitting or breaking down a text into individual units, called tokens. These tokens are typically words, but they can also be subwords, sentences(phrases) or individual characters. Tokenization is a important NLP tasks  because it enables the computer to understand the structure and meaning of a text more easily.


In [1]:
corpus = """I'll recently watched o'clock this show's called mindhunters:). 
I totally loved it 😍. It was gr8 <3! #bingewatching #nothingtodo 😎"""
print(corpus)

I'll recently watched o'clock this show's called mindhunters:). 
I totally loved it 😍. It was gr8 <3! #bingewatching #nothingtodo 😎


### Tokenising on spaces using python

In [2]:
print(corpus.split())

["I'll", 'recently', 'watched', "o'clock", 'this', "show's", 'called', 'mindhunters:).', 'I', 'totally', 'loved', 'it', '😍.', 'It', 'was', 'gr8', '<3!', '#bingewatching', '#nothingtodo', '😎']


# Types of Tokenization in NLP

## 1) Word Tokenizer


* It splits or seprates the text into individual words,called tokens. It also seprates punctuation marks like .,!?, etc. and other characters also like #(hashtags) and emojis.

In [3]:
# nltk stands for Natural Language Toolkit

from nltk.tokenize import word_tokenize

word_tokenize(corpus)

['I',
 "'ll",
 'recently',
 'watched',
 "o'clock",
 'this',
 'show',
 "'s",
 'called',
 'mindhunters',
 ':',
 ')',
 '.',
 'I',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'It',
 'was',
 'gr8',
 '<',
 '3',
 '!',
 '#',
 'bingewatching',
 '#',
 'nothingtodo',
 '😎']

In [4]:
word = word_tokenize(corpus)

print(word)

['I', "'ll", 'recently', 'watched', "o'clock", 'this', 'show', "'s", 'called', 'mindhunters', ':', ')', '.', 'I', 'totally', 'loved', 'it', '😍', '.', 'It', 'was', 'gr8', '<', '3', '!', '#', 'bingewatching', '#', 'nothingtodo', '😎']


### Note:- 

* nltk's word tokenizer  breaks on whitespaces as well as it also breaks punctuation words such as "I'll" into "I" and "'ll", show's" into "show" and "'s". On the other hand it doesn't break "o'clock" and treats it as a separate token.

## 2) Word Punctuation Tokenizer

* It seprate text into individual words and also seprates all punctuation marks.

In [5]:
from nltk.tokenize import wordpunct_tokenize

wordpunct_tokenize(corpus)

['I',
 "'",
 'll',
 'recently',
 'watched',
 'o',
 "'",
 'clock',
 'this',
 'show',
 "'",
 's',
 'called',
 'mindhunters',
 ':).',
 'I',
 'totally',
 'loved',
 'it',
 '😍.',
 'It',
 'was',
 'gr8',
 '<',
 '3',
 '!',
 '#',
 'bingewatching',
 '#',
 'nothingtodo',
 '😎']

In [6]:
word_punctuation = wordpunct_tokenize(corpus)

print(word_punctuation)

['I', "'", 'll', 'recently', 'watched', 'o', "'", 'clock', 'this', 'show', "'", 's', 'called', 'mindhunters', ':).', 'I', 'totally', 'loved', 'it', '😍.', 'It', 'was', 'gr8', '<', '3', '!', '#', 'bingewatching', '#', 'nothingtodo', '😎']


## 3) Sentence Tokenizer


* It splits or seprate the text into sentences. It Seprates sentences by .(fullstop) and !(Exclamation mark).

In [7]:
from nltk.tokenize import sent_tokenize

sent_tokenize(corpus)

["I'll recently watched o'clock this show's called mindhunters:).",
 'I totally loved it 😍.',
 'It was gr8 <3!',
 '#bingewatching #nothingtodo 😎']

In [8]:
sentence = sent_tokenize(corpus)

print(sentence)

["I'll recently watched o'clock this show's called mindhunters:).", 'I totally loved it 😍.', 'It was gr8 <3!', '#bingewatching #nothingtodo 😎']


## 4) Tweer Tokenizer 

* Word tokenizer or punctuation word tokenizer it will seprate text emojis like "<3" into '<' and '3' and ":)" into ':' and ')' which is something that we don't want.


* Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can alone prove to be a really good predictor of the sentiment.


* Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook.


* Tweet tokenizer breakdown text into individual tokens except text emojis, #(hashtags) and (') apostrophe.

In [9]:
from nltk.tokenize import TweetTokenizer

TweetTokenizer().tokenize(corpus)

["I'll",
 'recently',
 'watched',
 "o'clock",
 'this',
 "show's",
 'called',
 'mindhunters',
 ':)',
 '.',
 'I',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'It',
 'was',
 'gr8',
 '<3',
 '!',
 '#bingewatching',
 '#nothingtodo',
 '😎']

In [10]:
tokenizer = TweetTokenizer()
tokenizer.tokenize(corpus)

["I'll",
 'recently',
 'watched',
 "o'clock",
 'this',
 "show's",
 'called',
 'mindhunters',
 ':)',
 '.',
 'I',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'It',
 'was',
 'gr8',
 '<3',
 '!',
 '#bingewatching',
 '#nothingtodo',
 '😎']

In [11]:
t = tokenizer.tokenize(corpus)
print(t)

["I'll", 'recently', 'watched', "o'clock", 'this', "show's", 'called', 'mindhunters', ':)', '.', 'I', 'totally', 'loved', 'it', '😍', '.', 'It', 'was', 'gr8', '<3', '!', '#bingewatching', '#nothingtodo', '😎']


## 5) Regular Expression Tokenizer

* Regex Tokenizer breakdown and output only that text which matches with regex pattern.

In [12]:
from nltk.tokenize import regexp_tokenize

text = corpus

pattern = "#\w+"

regexp_tokenize(text, pattern)

['#bingewatching', '#nothingtodo']

## 6) Tree Bank Word Tokenizer


* It breakdown text into tokens whether it is emoji, hastag or punctuation marks. 


* It will not breakdown fullstop(.)


* It will breakdown fullstop when there is a space before fullstop or the last fullstop of text.

In [13]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

In [14]:
text = 'My Name is Sameer.live in #Nepal. Hel?lo . jd.'

In [15]:
tokenizer.tokenize(text)

['My',
 'Name',
 'is',
 'Sameer.live',
 'in',
 '#',
 'Nepal.',
 'Hel',
 '?',
 'lo',
 '.',
 'jd',
 '.']

In [16]:
t = tokenizer.tokenize(text)

print(t)

['My', 'Name', 'is', 'Sameer.live', 'in', '#', 'Nepal.', 'Hel', '?', 'lo', '.', 'jd', '.']
