# Tokenization

TextTokenizer is a class that encapsulates tokenization logic. It allows users to select a tokenization method (**nltk**, **spacy**, or **regex**) at initialization.

**Methods**:
1.   tokenize_nltk: Uses NLTK's word_tokenize.
2.   tokenize_spacy: Uses spaCy's tokenizer.
3.   tokenize_regex: Uses a regex pattern to extract word tokens.


In [1]:
# Import necessary libraries
import re
import string
import nltk
import spacy
from nltk.tokenize import word_tokenize

In [2]:
# Download NLTK resources if not already downloaded
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

**Sentance tokenization using NLTK**

In [3]:
text = "Natural Language Processing (NLP) bridges human language and machines. \
        It enables computers to understand text and speech. \
        NLP helps in tasks like sentiment analysis and machine translation. \
        Chatbots and virtual assistants rely heavily on NLP. \
        This technology powers text summarization tools. \
        It also detects spam emails effectively. \
        NLP is improving rapidly with AI advancements. Its applications are transforming industries worldwide."

sent = nltk.sent_tokenize(text)
print("NLTK Tokenization:", sent)

print("Num of sentance: ", len(sent))


NLTK Tokenization: ['Natural Language Processing (NLP) bridges human language and machines.', 'It enables computers to understand text and speech.', 'NLP helps in tasks like sentiment analysis and machine translation.', 'Chatbots and virtual assistants rely heavily on NLP.', 'This technology powers text summarization tools.', 'It also detects spam emails effectively.', 'NLP is improving rapidly with AI advancements.', 'Its applications are transforming industries worldwide.']
Num of sentance:  8


**Word tokenization using NLTK**

In [5]:
print("Word Tokenization:", nltk.word_tokenize(text))


Word Tokenization: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'bridges', 'human', 'language', 'and', 'machines', '.', 'It', 'enables', 'computers', 'to', 'understand', 'text', 'and', 'speech', '.', 'NLP', 'helps', 'in', 'tasks', 'like', 'sentiment', 'analysis', 'and', 'machine', 'translation', '.', 'Chatbots', 'and', 'virtual', 'assistants', 'rely', 'heavily', 'on', 'NLP', '.', 'This', 'technology', 'powers', 'text', 'summarization', 'tools', '.', 'It', 'also', 'detects', 'spam', 'emails', 'effectively', '.', 'NLP', 'is', 'improving', 'rapidly', 'with', 'AI', 'advancements', '.', 'Its', 'applications', 'are', 'transforming', 'industries', 'worldwide', '.']


**Tokenization using NLTK, SpaCy and Regex**

In [8]:
# Define tokenization functions
def nltk_tokenize(text: str):
    """Tokenize text using NLTK."""
    return word_tokenize(text)

def spacy_tokenize(text: str):
    """Tokenize text using spaCy."""
    nlp = spacy.load("en_core_web_sm")
    return [token.text for token in nlp(text)]

def regex_tokenize(text: str):
    """Tokenize text using regex."""
    return re.findall(r'\b\w+\b', text)


In [9]:
# Example usage
if __name__ == "__main__":
    # Sample text
    text = "Hello, world! Tokenization is an essential step in NLP."

    print("NLTK Tokenization:", nltk_tokenize(text))
    print("spaCy Tokenization:", spacy_tokenize(text))
    print("Regex Tokenization:", regex_tokenize(text))


NLTK Tokenization: ['Hello', ',', 'world', '!', 'Tokenization', 'is', 'an', 'essential', 'step', 'in', 'NLP', '.']
spaCy Tokenization: ['Hello', ',', 'world', '!', 'Tokenization', 'is', 'an', 'essential', 'step', 'in', 'NLP', '.']
Regex Tokenization: ['Hello', 'world', 'Tokenization', 'is', 'an', 'essential', 'step', 'in', 'NLP']
