### Tokenization

##### Tokenization refers to the process of converting a sequence of text into smaller parts, known as tokens. These tokens can be as small as characters or as long as words. The primary reason this process matters is that it helps machines understand human language by breaking it down into bite-sized pieces, which are easier to analyze.

#### NLTK Library 

##### NLTK stands for Natural Language Toolkit.NLTK is used by researchers, developers, and data scientists to: Develop NLP applications, Analyze text data, Support research and teaching in NLP, and Prototype and build research systems.  NLTK was developed by Steven Bird and Edward Loper at the University of Pennsylvania. It's written in the Python programming language. 

In [1]:
%pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   ---------------------------------------- 1.5/1.5 MB 26.5 MB/s eta 0:00:00
Installing collected packages: nltk
Successfully installed nltk-3.9.1
Note: you may need to restart the kernel to use updated packages.


In [19]:
corpus = '''HEllO Welcome to Ansh's tokenization tutorial.
This is a demo text to understand how 
tokenization works.'''

In [20]:
print(corpus)

HEllO Welcome to Ansh's tokenization tutorial.
This is a demo text to understand how 
tokenization works.


In [21]:
# Tokenization
import nltk

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('wordnet')
nltk.download('omw-1.4')




[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ansh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Ansh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Ansh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Ansh\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [28]:
# 1. convert sentence to parahgraphs
from nltk.tokenize import sent_tokenize
documents = sent_tokenize(corpus)
print(documents)

["HEllO Welcome to Ansh's tokenization tutorial.", 'This is a demo text to understand how \ntokenization works.']


In [23]:
type(documents)

list

In [24]:
for sentence in documents :
    print(sentence)

HEllO Welcome to Ansh's tokenization tutorial.
This is a demo text to understand how 
tokenization works.


In [25]:
# 2. Parahgraphs to words
#    Sentence to words

from nltk.tokenize import word_tokenize
words = word_tokenize(corpus)
print(words)


['HEllO', 'Welcome', 'to', 'Ansh', "'s", 'tokenization', 'tutorial', '.', 'This', 'is', 'a', 'demo', 'text', 'to', 'understand', 'how', 'tokenization', 'works', '.']


In [26]:
for sentence in documents:
    print(word_tokenize(sentence))

['HEllO', 'Welcome', 'to', 'Ansh', "'s", 'tokenization', 'tutorial', '.']
['This', 'is', 'a', 'demo', 'text', 'to', 'understand', 'how', 'tokenization', 'works', '.']


In [29]:
# If we want to treat punctutations as seprate word (like apostrophe) we can use :
from nltk.tokenize import wordpunct_tokenize
wordP_document = wordpunct_tokenize(corpus)
print(wordP_document)

# word tokenizer is based on treebank word tokenizer


['HEllO', 'Welcome', 'to', 'Ansh', "'", 's', 'tokenization', 'tutorial', '.', 'This', 'is', 'a', 'demo', 'text', 'to', 'understand', 'how', 'tokenization', 'works', '.']
