### Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Generally 'space' is used to perform the word tokenization and characters like periods, exclamation point and newline char are used for Sentence Tokenization. We have to choose the appropriate method as per the task in hand. While performing the tokenization few characters like spaces, punctuations are ignored and will not be the part of final list of tokens.



## Tokenization Using Python's Inbuilt Method

In [None]:
#syntax - string.split(seperator,maxsplit)
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
# Split text by whitespace(word)
tokens = text.split()
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge,', 'library', 'and', 'purpose', 'of', 'modeling.']


In [None]:
# sentence Tokenzation
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
text.split(". ",0)

['Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method.']

In [None]:
text.split(". ",1)

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method.']

### Tokenization Using Regular Expressions(RegEx)

In [None]:
import re

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = re.findall("[\w]+", text)
print(tokens)

['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', 'library', 'and', 'purpose', 'of', 'modeling']


In [None]:
# Sentence Tokenization
text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
tokens_sent = re.compile('[.!?] ').split(text) # Using compile method to combine RegEx patterns
tokens_sent

['Characters like periods, exclamation point and newline char are used to separate the sentences',
 'But one drawback with split() method, that we can only use one separator at a time',
 'So sentence tonenization wont be foolproof with split() method.']

### Tokenization Using NLTK

Natural Language Toolkit (NLTK) is library written in python for natural language processing.
NLTK has module word_tokenize() for word tokenization and sent_tokenize() for sentence tokenization.

In [None]:
!pip install  nltk



Whitespace Tokenizer:

nltk.tokenize.WhitespaceTokenizer: Tokenizes text based on whitespace characters (spaces, tabs, and newlines).

Word Tokenizer:

nltk.tokenize.word_tokenize: Tokenizes text into words using rules based on punctuation and spaces.

Sentence Tokenizer:

nltk.tokenize.sent_tokenize: Tokenizes text into sentences. Uses an unsupervised machine learning model, such as the Punkt Tokenizer.

Regexp Tokenizer:

nltk.tokenize.RegexpTokenizer: Tokenizes text based on a regular expression pattern. You can specify custom patterns to match the tokens.

Treebank Tokenizer:

nltk.tokenize.TreebankWordTokenizer: Tokenizes text using the Penn Treebank conventions. It is a good choice for working with the Penn Treebank corpus.

MWETokenizer (Multi-Word Expression Tokenizer):

nltk.tokenize.MWETokenizer: Tokenizes text containing multi-word expressions. It considers certain sequences of words as a single token.

In [None]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
tokens = word_tokenize(text)
print(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['There', 'are', 'multiple', 'ways', 'we', 'can', 'perform', 'tokenization', 'on', 'given', 'text', 'data', '.', 'We', 'can', 'choose', 'any', 'method', 'based', 'on', 'langauge', ',', 'library', 'and', 'purpose', 'of', 'modeling', '.']


In [None]:
from nltk.tokenize import sent_tokenize

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""
sent_tokenize(text)

['Characters like periods, exclamation point and newline char are used to separate the sentences.',
 'But one drawback with split() method, that we can only use one separator at a time!',
 'So sentence tonenization wont be foolproof with split() method.']

### Tokenization Using spaCy

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython
in spaCy we create language model object, which then used for word and sentence tokenization

In [None]:
!pip install spacy
!python -m spacy download en

2024-01-23 10:36:48.902690: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-23 10:36:48.902785: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-23 10:36:48.904525: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-23 10:36:48.914705: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are 

In [None]:
from spacy.lang.en import English
nlp=English()
text = """There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling."""
my_doc = nlp(text)
print(my_doc)
# Above step has already tokenized our text but its in doc format, so lets write fo loop to create list of it
token_list = []
for token in my_doc:
    token_list.append(token.text)

token_list

There are multiple ways we can perform tokenization on given text data. We can choose any method based on langauge, library and purpose of modeling.


['There',
 'are',
 'multiple',
 'ways',
 'we',
 'can',
 'perform',
 'tokenization',
 'on',
 'given',
 'text',
 'data',
 '.',
 'We',
 'can',
 'choose',
 'any',
 'method',
 'based',
 'on',
 'langauge',
 ',',
 'library',
 'and',
 'purpose',
 'of',
 'modeling',
 '.']

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

# Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer')

# Add component to the pipeline
nlp.add_pipe('sentencizer')

text = """Characters like periods, exclamation point and newline char are used to separate the sentences. But one drawback with split() method, that we can only use one separator at a time! So sentence tonenization wont be foolproof with split() method."""

# nlp object is used to create documents with linguistic annotations
doc = nlp(text)

# Create a list of sentence tokens
sentence_list = [sentence.text for sentence in doc.sents]
print(sentence_list)


['Characters like periods, exclamation point and newline char are used to separate the sentences.', 'But one drawback with split() method, that we can only use one separator at a time!', 'So sentence tonenization wont be foolproof with split() method.']
