In [1]:
from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize, TweetTokenizer, MWETokenizer, TreebankWordTokenizer
import nltk.data
import spacy
import gensim
from nltk.tokenize import word_tokenize, sent_tokenize, regexp_tokenize, TweetTokenizer, MWETokenizer, TreebankWordTokenizer
from textblob import TextBlob
from tensorflow.keras.preprocessing.text import text_to_word_sequence

In [2]:
# Ensure required resources are downloaded
nltk.download('punkt')

# Sample text with emojis, punctuation, negation, and special characters
text = """John's dog doesn't like playing outside; however, it enjoys running—especially in the morning! Also, $50 isn't too much for a toy, right?"""

[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


In [3]:
# a. Word Tokenization (NLTK)
word_tokens = word_tokenize(text)

In [4]:
# b. Sentence Tokenization (NLTK)
sentence_tokens = sent_tokenize(text)

In [5]:
# c. Punctuation-based Tokenizer
punct_tokens = regexp_tokenize(text, r"\w+|[^\w\s]")  # Splits words but keeps punctuation as separate tokens

In [6]:
# d. Treebank Word Tokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(text)

In [7]:
# e. Tweet Tokenizer (handles emojis and hashtags well)
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(text)

In [8]:
# f. Multi-Word Expression Tokenizer
mwe_tokenizer = MWETokenizer([("AI", "drones"), ("disaster", "management")])
mwe_tokens = mwe_tokenizer.tokenize(word_tokenize(text))

In [9]:
# g. TextBlob Word Tokenizer
textblob_tokens = TextBlob(text).words

In [10]:
# h. spaCy Tokenizer
nlp = spacy.load("en_core_web_sm")
spacy_tokens = [token.text for token in nlp(text)]

In [11]:
# i. Gensim word tokenizer
gensim_tokens = list(gensim.utils.tokenize(text, lower=True))

In [12]:
# j. Tokenization with Keras
keras_tokens = text_to_word_sequence(text)

In [13]:
# k. Includes the words with apostrophe
whitespace_tokens = text.split()


In [14]:
# Display results
print("Word Tokenization:", word_tokens)
print("Sentence Tokenization:", sentence_tokens)
print("Punctuation-based Tokenizer:", punct_tokens)
print("Treebank Word Tokenizer:", treebank_tokens)
print("Tweet Tokenizer:", tweet_tokens)
print("Multi-Word Expression Tokenizer:", mwe_tokens)
print("TextBlob Tokenizer:", list(textblob_tokens))
print("spaCy Tokenizer:", spacy_tokens)
print("Gensim Tokenizer:", gensim_tokens)
print("Keras Tokenizer:", keras_tokens)
print("Whitespace Tokenizer:", whitespace_tokens)

Word Tokenization: ['John', "'s", 'dog', 'does', "n't", 'like', 'playing', 'outside', ';', 'however', ',', 'it', 'enjoys', 'running—especially', 'in', 'the', 'morning', '!', 'Also', ',', '$', '50', 'is', "n't", 'too', 'much', 'for', 'a', 'toy', ',', 'right', '?']
Sentence Tokenization: ["John's dog doesn't like playing outside; however, it enjoys running—especially in the morning!", "Also, $50 isn't too much for a toy, right?"]
Punctuation-based Tokenizer: ['John', "'", 's', 'dog', 'doesn', "'", 't', 'like', 'playing', 'outside', ';', 'however', ',', 'it', 'enjoys', 'running', '—', 'especially', 'in', 'the', 'morning', '!', 'Also', ',', '$', '50', 'isn', "'", 't', 'too', 'much', 'for', 'a', 'toy', ',', 'right', '?']
Treebank Word Tokenizer: ['John', "'s", 'dog', 'does', "n't", 'like', 'playing', 'outside', ';', 'however', ',', 'it', 'enjoys', 'running—especially', 'in', 'the', 'morning', '!', 'Also', ',', '$', '50', 'is', "n't", 'too', 'much', 'for', 'a', 'toy', ',', 'right', '?']
Twee

**1️. Word Tokenization**

🔹 Insight:

Splits text into individual words.

Handles spaces, but may struggle with contractions and punctuation.

🔹 Applications:

Sentiment Analysis (Identifying positive/negative words).

Information Retrieval (Extracting keywords).

Word Frequency Analysis (For text mining & corpus analysis).

**2. Sentence Tokenization**

🔹 Insight:

Splits text into sentences using punctuation like ".", "!", and "?".

Some languages (e.g., Chinese) don’t have clear sentence boundaries, making this tricky.

🔹 Applications:

Text Summarization (Splitting long documents into meaningful chunks).

Question-Answering Systems (Breaking text into manageable responses).

Chatbots (Processing sentences independently).

**3. Punctuation-Based Tokenizer**

🔹 Insight:

Splits text based on punctuation marks.

Useful for detailed text analysis, but might split meaningful entities.

🔹 Applications:

Processing Code & Logs (Where punctuation has meaning).

Text Cleaning (Removing unnecessary punctuation from text).

Text Compression (Removing redundant punctuation).

**4. Treebank Word Tokenizer**

🔹 Insight:

Uses Penn Treebank rules for splitting text.

Handles contractions (e.g., don’t → do + n't).

🔹 Applications:

Part-of-Speech Tagging (Breaking words properly for linguistic parsing).

Named Entity Recognition (NER) (More accurate entity extraction).

Parsing Text for Syntax Analysis (Used in deep NLP models).

**5. Tweet Tokenizer**

🔹 Insight:

Special tokenizer for social media text (handles hashtags, emojis, mentions).

Avoids breaking URLs and special symbols.

🔹 Applications:

Social Media Sentiment Analysis (Extracting meaningful words from tweets/posts).

Fake News Detection (Analyzing text patterns in social media).

Hashtag and Mention Analysis (Tracking trends).

**6. Multi-Word Expression Tokenizer**

🔹 Insight:

Recognizes multi-word expressions like New York, machine learning, data science.

Uses predefined phrases or statistical methods to detect them.

🔹 Applications:

Information Extraction (Ensuring terms like "artificial intelligence" aren’t split).

Named Entity Recognition (NER) (Handling multi-word entity names).

Machine Translation (Preserving phrase meanings).

**7. TextBlob Word Tokenizer**

🔹 Insight:

Uses TextBlob's NLP engine to tokenize text efficiently.

Simple, accurate, and widely used in beginner-level NLP tasks.

🔹 Applications:

Sentiment Analysis (Often used with TextBlob's built-in sentiment tools).

Spelling Correction (Tokenization is a preprocessing step).

Basic Chatbot Development (Easy and fast processing).

**8. spaCy Tokenizer**

🔹 Insight:

Highly efficient, optimized tokenizer that works well for large-scale NLP.

Handles complex linguistic rules automatically.

🔹 Applications:

Large-Scale NLP (Fast and accurate processing of big text data).

Dependency Parsing (Understanding sentence structure).

Legal/Medical Text Analysis (Extracting meaningful content).

**9. Gensim Word Tokenizer**

🔹 Insight:

Designed for tokenizing text for topic modeling and document similarity.

Works well with word embeddings and vectorization.

🔹 Applications:

Topic Modeling (LDA, Word2Vec, Doc2Vec).

Document Similarity Search (Used in search engines).

Text Clustering (Grouping similar documents).

**10. Tokenization with Keras**

🔹 Insight:

Prepares text for deep learning models in Keras and TensorFlow.

Converts words into sequences (IDs) for embedding layers.

🔹 Applications:

Training NLP models (RNNs, LSTMs, Transformers).

Text Classification (Spam detection, emotion recognition).

Chatbot Training (Handling large datasets efficiently).