# Tokenization

* Tokenization is a process of splitting or breaking down a text into individual units, called tokens. These tokens are typically words, but they can also be subwords, sentences(phrases) or individual characters. Tokenization is a important NLP tasks  because it enables the computer to understand the structure and meaning of a text more easily.


In [1]:
corpus = """I'll recently watched o'clock this show's called mindhunters:). 
I totally loved it 😍. It was gr8 <3! #bingewatching #nothingtodo 😎"""
print(corpus)

I'll recently watched o'clock this show's called mindhunters:). 
I totally loved it 😍. It was gr8 <3! #bingewatching #nothingtodo 😎


### Tokenising on spaces using python

In [2]:
print(corpus.split())

["I'll", 'recently', 'watched', "o'clock", 'this', "show's", 'called', 'mindhunters:).', 'I', 'totally', 'loved', 'it', '😍.', 'It', 'was', 'gr8', '<3!', '#bingewatching', '#nothingtodo', '😎']


# Types of Tokenization in NLP

## 1) Word Tokenizer


* It splits or seprates the text into individual words,called tokens. It also seprates punctuation marks like .,!?, etc. and other characters also like #(hashtags) and emojis.

In [3]:
# nltk stands for Natural Language Toolkit

from nltk.tokenize import word_tokenize

word_tokenize(corpus)

['I',
 "'ll",
 'recently',
 'watched',
 "o'clock",
 'this',
 'show',
 "'s",
 'called',
 'mindhunters',
 ':',
 ')',
 '.',
 'I',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'It',
 'was',
 'gr8',
 '<',
 '3',
 '!',
 '#',
 'bingewatching',
 '#',
 'nothingtodo',
 '😎']

In [4]:
word = word_tokenize(corpus.lower(),language='English')

print(word)

['i', "'ll", 'recently', 'watched', "o'clock", 'this', 'show', "'s", 'called', 'mindhunters', ':', ')', '.', 'i', 'totally', 'loved', 'it', '😍', '.', 'it', 'was', 'gr8', '<', '3', '!', '#', 'bingewatching', '#', 'nothingtodo', '😎']


In [5]:
word = word_tokenize(corpus.upper())

print(word)

['I', "'LL", 'RECENTLY', 'WATCHED', "O'CLOCK", 'THIS', 'SHOW', "'S", 'CALLED', 'MINDHUNTERS', ':', ')', '.', 'I', 'TOTALLY', 'LOVED', 'IT', '😍', '.', 'IT', 'WAS', 'GR8', '<', '3', '!', '#', 'BINGEWATCHING', '#', 'NOTHINGTODO', '😎']


### Note:- 

* nltk's word tokenizer  breaks on whitespaces as well as it also breaks punctuation words such as "I'll" into "I" and "'ll", show's" into "show" and "'s". On the other hand it doesn't break "o'clock" and treats it as a separate token.

## 2) Word Punctuation Tokenizer

* It seprate text into individual words and also seprates all punctuation marks.

In [6]:
from nltk.tokenize import wordpunct_tokenize

wordpunct_tokenize(corpus)

['I',
 "'",
 'll',
 'recently',
 'watched',
 'o',
 "'",
 'clock',
 'this',
 'show',
 "'",
 's',
 'called',
 'mindhunters',
 ':).',
 'I',
 'totally',
 'loved',
 'it',
 '😍.',
 'It',
 'was',
 'gr8',
 '<',
 '3',
 '!',
 '#',
 'bingewatching',
 '#',
 'nothingtodo',
 '😎']

In [7]:
word_punctuation = wordpunct_tokenize(corpus)

print(word_punctuation)

['I', "'", 'll', 'recently', 'watched', 'o', "'", 'clock', 'this', 'show', "'", 's', 'called', 'mindhunters', ':).', 'I', 'totally', 'loved', 'it', '😍.', 'It', 'was', 'gr8', '<', '3', '!', '#', 'bingewatching', '#', 'nothingtodo', '😎']


## 3) Sentence Tokenizer


* It splits or seprate the text into sentences. It Seprates sentences by .(fullstop) and !(Exclamation mark).

In [8]:
from nltk.tokenize import sent_tokenize

sent_tokenize(corpus)

["I'll recently watched o'clock this show's called mindhunters:).",
 'I totally loved it 😍.',
 'It was gr8 <3!',
 '#bingewatching #nothingtodo 😎']

In [9]:
sentence = sent_tokenize(corpus)

print(sentence,'\n')

# Print the length of senetences that are genrated after applying sent_tokenize
print(len(sentence))

["I'll recently watched o'clock this show's called mindhunters:).", 'I totally loved it 😍.', 'It was gr8 <3!', '#bingewatching #nothingtodo 😎'] 

4


## 4) Tweet Tokenizer 

* Word tokenizer or punctuation word tokenizer it will seprate text emojis like "<3" into '<' and '3' and ":)" into ':' and ')' which is something that we don't want.


* Emojis have their own significance in areas like sentiment analysis where a happy face and sad face can alone prove to be a really good predictor of the sentiment.


* Similarly, the hashtags are broken into two tokens. A hashtag is used for searching specific topics or photos in social media apps such as Instagram and facebook.


* Tweet tokenizer breakdown text into individual tokens except text emojis, #(hashtags) and (') apostrophe.

In [10]:
from nltk.tokenize import TweetTokenizer

TweetTokenizer().tokenize(corpus)

["I'll",
 'recently',
 'watched',
 "o'clock",
 'this',
 "show's",
 'called',
 'mindhunters',
 ':)',
 '.',
 'I',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'It',
 'was',
 'gr8',
 '<3',
 '!',
 '#bingewatching',
 '#nothingtodo',
 '😎']

In [11]:
tokenizer = TweetTokenizer()
tokenizer.tokenize(corpus)

["I'll",
 'recently',
 'watched',
 "o'clock",
 'this',
 "show's",
 'called',
 'mindhunters',
 ':)',
 '.',
 'I',
 'totally',
 'loved',
 'it',
 '😍',
 '.',
 'It',
 'was',
 'gr8',
 '<3',
 '!',
 '#bingewatching',
 '#nothingtodo',
 '😎']

In [12]:
t = tokenizer.tokenize(corpus)
print(t)

["I'll", 'recently', 'watched', "o'clock", 'this', "show's", 'called', 'mindhunters', ':)', '.', 'I', 'totally', 'loved', 'it', '😍', '.', 'It', 'was', 'gr8', '<3', '!', '#bingewatching', '#nothingtodo', '😎']


## 5) Regular Expression Tokenizer

* Regex Tokenizer breakdown and output only that text which matches with regex pattern.

In [13]:
from nltk.tokenize import regexp_tokenize

text = corpus

pattern = "#\w+"

regexp_tokenize(text, pattern)

['#bingewatching', '#nothingtodo']

## 6) Tree Bank Word Tokenizer


* It breakdown text into tokens whether it is emoji, hastag or punctuation marks. 


* It will not breakdown fullstop(.)


* It will breakdown fullstop when there is a space before fullstop or the last fullstop of text.

In [14]:
from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

In [15]:
text = 'My Name is Sameer.live in #Nepal. Hel?lo . jd.'

In [16]:
tokenizer.tokenize(text)

['My',
 'Name',
 'is',
 'Sameer.live',
 'in',
 '#',
 'Nepal.',
 'Hel',
 '?',
 'lo',
 '.',
 'jd',
 '.']

In [20]:
tokenizer.tokenize(corpus)

['I',
 "'ll",
 'recently',
 'watched',
 "o'clock",
 'this',
 'show',
 "'s",
 'called',
 'mindhunters',
 ':',
 ')',
 '.',
 'I',
 'totally',
 'loved',
 'it',
 '😍.',
 'It',
 'was',
 'gr8',
 '<',
 '3',
 '!',
 '#',
 'bingewatching',
 '#',
 'nothingtodo',
 '😎']

In [17]:
t = tokenizer.tokenize(text)

print(t)

['My', 'Name', 'is', 'Sameer.live', 'in', '#', 'Nepal.', 'Hel', '?', 'lo', '.', 'jd', '.']


# Stopwords


* Stop words are common words that are often removed from a text before analysis because they don't add much value in terms of meaning. Examples of stop words include "the", "and", "a", and "an".


* Removing stop words can help to improve the efficiency and accuracy of NLP tasks, such as sentiment analysis or topic modeling, because it reduces noise and helps to focus on the more important words in the text.


* However, the set of stopwords used can vary depending on the task and language being analyzed.

In [18]:
import nltk
from nltk.corpus import stopwords

# Extract all stopwords of english
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [19]:
# Use NLTK's built-in stop words corpus to remove stop words from a piece of text:-

# Load stop words
stop_words = set(stopwords.words('english'))

# Text to be processed
text = "This is a sample sentence to demonstrate stop word removal."

# Tokenize the text
tokens = nltk.word_tokenize(text)

# Remove stop words
filtered_tokens = [word for word in tokens if not word.lower() in stop_words]

# Print the filtered tokens
print(filtered_tokens)

# Print the length of filtered_tokens
print(len(filtered_tokens))

['sample', 'sentence', 'demonstrate', 'stop', 'word', 'removal', '.']
7
