In [2]:
paragraph = "Football, known as the beautiful game, unites fans across the globe in passion and excitement. From the roar of the crowd to the precision of a well-executed pass, every moment on the pitch holds its own story. In the heat of competition, players strive for glory, chasing dreams of victory and championship titles. The drama unfolds with each match, as teams battle for supremacy, fueled by skill, determination, and sheer grit. Whether it's the World Cup final or a local derby, the spirit of football transcends boundaries, creating memories that last a lifetime."

### a. Word Tokenization
##### NLTK's word_tokenize function breaks the paragraph into a list of words based on spaces and punctuation, important step in text analysis. Used for sentiment analysis and pre-POS tagging.

In [3]:
import nltk
nltk.download('punkt')
nltk_tokens = nltk.word_tokenize(paragraph)
print(nltk_tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Football', ',', 'known', 'as', 'the', 'beautiful', 'game', ',', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement', '.', 'From', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well-executed', 'pass', ',', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story', '.', 'In', 'the', 'heat', 'of', 'competition', ',', 'players', 'strive', 'for', 'glory', ',', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles', '.', 'The', 'drama', 'unfolds', 'with', 'each', 'match', ',', 'as', 'teams', 'battle', 'for', 'supremacy', ',', 'fueled', 'by', 'skill', ',', 'determination', ',', 'and', 'sheer', 'grit', '.', 'Whether', 'it', "'s", 'the', 'World', 'Cup', 'final', 'or', 'a', 'local', 'derby', ',', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', ',', 'creating', 'memories', 'that', 'last', 'a', 'lifetime', '.']


### b. Sentance Tokenization
##### NLTK's sent_tokenize function divides the paragraph into a list of sentences, helps in machine translation, sentiment analysis and helps understand context of a sentance.




In [4]:
sent_tokens = nltk.sent_tokenize(paragraph)
print(sent_tokens)

['Football, known as the beautiful game, unites fans across the globe in passion and excitement.', 'From the roar of the crowd to the precision of a well-executed pass, every moment on the pitch holds its own story.', 'In the heat of competition, players strive for glory, chasing dreams of victory and championship titles.', 'The drama unfolds with each match, as teams battle for supremacy, fueled by skill, determination, and sheer grit.', "Whether it's the World Cup final or a local derby, the spirit of football transcends boundaries, creating memories that last a lifetime."]


### c. Punctuation-based Tokenizer
##### This regular expression captures either words or punctuation marks, effectively tokenizing the paragraph, isolate words and phrases delimited by punctuation. Useful in text cleaning, where you want to separate punctuation from words, or for tasks focusing on specific patterns around punctuation.

In [5]:
import re
punct_tokens = re.findall(r'\b\w+\b|[.,;!?]', paragraph)
print(punct_tokens)

['Football', ',', 'known', 'as', 'the', 'beautiful', 'game', ',', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement', '.', 'From', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well', 'executed', 'pass', ',', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story', '.', 'In', 'the', 'heat', 'of', 'competition', ',', 'players', 'strive', 'for', 'glory', ',', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles', '.', 'The', 'drama', 'unfolds', 'with', 'each', 'match', ',', 'as', 'teams', 'battle', 'for', 'supremacy', ',', 'fueled', 'by', 'skill', ',', 'determination', ',', 'and', 'sheer', 'grit', '.', 'Whether', 'it', 's', 'the', 'World', 'Cup', 'final', 'or', 'a', 'local', 'derby', ',', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', ',', 'creating', 'memories', 'that', 'last', 'a', 'lifetime', '.']


### d. Treebank Word Tokenizer
##### NLTK's TreebankWordTokenizer uses the Penn Treebank conventions to tokenize words(hyphenated words).  Suitable for tasks where handling contractions and hyphenated words is important, such as in linguistic analysis.

In [6]:
from nltk.tokenize import TreebankWordTokenizer
treebank_tokenizer = TreebankWordTokenizer()
treebank_tokens = treebank_tokenizer.tokenize(paragraph)
print(treebank_tokens)

['Football', ',', 'known', 'as', 'the', 'beautiful', 'game', ',', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement.', 'From', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well-executed', 'pass', ',', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story.', 'In', 'the', 'heat', 'of', 'competition', ',', 'players', 'strive', 'for', 'glory', ',', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles.', 'The', 'drama', 'unfolds', 'with', 'each', 'match', ',', 'as', 'teams', 'battle', 'for', 'supremacy', ',', 'fueled', 'by', 'skill', ',', 'determination', ',', 'and', 'sheer', 'grit.', 'Whether', 'it', "'s", 'the', 'World', 'Cup', 'final', 'or', 'a', 'local', 'derby', ',', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', ',', 'creating', 'memories', 'that', 'last', 'a', 'lifetime', '.']


### e. Tweet Tokenizer
##### NLTK's TweetTokenizer is designed to handle tweets, preserving hashtags and mentions. Ideal for sentiment analysis, topic modeling, and other NLP tasks involving Twitter data.

In [7]:
from nltk.tokenize import TweetTokenizer
tweet_tokenizer = TweetTokenizer()
tweet_tokens = tweet_tokenizer.tokenize(paragraph)
print(tweet_tokens)


['Football', ',', 'known', 'as', 'the', 'beautiful', 'game', ',', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement', '.', 'From', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well-executed', 'pass', ',', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story', '.', 'In', 'the', 'heat', 'of', 'competition', ',', 'players', 'strive', 'for', 'glory', ',', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles', '.', 'The', 'drama', 'unfolds', 'with', 'each', 'match', ',', 'as', 'teams', 'battle', 'for', 'supremacy', ',', 'fueled', 'by', 'skill', ',', 'determination', ',', 'and', 'sheer', 'grit', '.', 'Whether', "it's", 'the', 'World', 'Cup', 'final', 'or', 'a', 'local', 'derby', ',', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', ',', 'creating', 'memories', 'that', 'last', 'a', 'lifetime', '.']


### f. Multi-Word Expression Tokenizer
##### NLTK's MWETokenizer allows tokenization of specific multi-word expressions. Useful in tasks where understanding multi-word phrases is essential, like in specialized domain language processing.


In [8]:
from nltk.tokenize import MWETokenizer
mwetokenizer = MWETokenizer([('rhythmic', 'symphony'), ('water\'s', 'edge')])
mwe_tokens = mwetokenizer.tokenize(nltk.word_tokenize(paragraph))
print(mwe_tokens)


['Football', ',', 'known', 'as', 'the', 'beautiful', 'game', ',', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement', '.', 'From', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well-executed', 'pass', ',', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story', '.', 'In', 'the', 'heat', 'of', 'competition', ',', 'players', 'strive', 'for', 'glory', ',', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles', '.', 'The', 'drama', 'unfolds', 'with', 'each', 'match', ',', 'as', 'teams', 'battle', 'for', 'supremacy', ',', 'fueled', 'by', 'skill', ',', 'determination', ',', 'and', 'sheer', 'grit', '.', 'Whether', 'it', "'s", 'the', 'World', 'Cup', 'final', 'or', 'a', 'local', 'derby', ',', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', ',', 'creating', 'memories', 'that', 'last', 'a', 'lifetime', '.']


###g. TextBlob Word Tokenize
##### TextBlob's words attribute provides a convenient way to access the words in the paragraph.  Suitable for quick and simple NLP tasks, especially in educational or prototyping contexts.

In [9]:
from textblob import TextBlob
blob = TextBlob(paragraph)
textblob_tokens = blob.words
print(textblob_tokens)


['Football', 'known', 'as', 'the', 'beautiful', 'game', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement', 'From', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well-executed', 'pass', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story', 'In', 'the', 'heat', 'of', 'competition', 'players', 'strive', 'for', 'glory', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles', 'The', 'drama', 'unfolds', 'with', 'each', 'match', 'as', 'teams', 'battle', 'for', 'supremacy', 'fueled', 'by', 'skill', 'determination', 'and', 'sheer', 'grit', 'Whether', 'it', "'s", 'the', 'World', 'Cup', 'final', 'or', 'a', 'local', 'derby', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', 'creating', 'memories', 'that', 'last', 'a', 'lifetime']


### h. spaCy Tokenizer
#####  spaCy tokenizes the paragraph using a sophisticated language model and provides detailed information about each token. Valuable in various NLP tasks, including named entity recognition, dependency parsing, and other advanced applications.


In [10]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(paragraph)
spacy_tokens = [token.text for token in doc]
print(spacy_tokens)

['Football', ',', 'known', 'as', 'the', 'beautiful', 'game', ',', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement', '.', 'From', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well', '-', 'executed', 'pass', ',', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story', '.', 'In', 'the', 'heat', 'of', 'competition', ',', 'players', 'strive', 'for', 'glory', ',', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles', '.', 'The', 'drama', 'unfolds', 'with', 'each', 'match', ',', 'as', 'teams', 'battle', 'for', 'supremacy', ',', 'fueled', 'by', 'skill', ',', 'determination', ',', 'and', 'sheer', 'grit', '.', 'Whether', 'it', "'s", 'the', 'World', 'Cup', 'final', 'or', 'a', 'local', 'derby', ',', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', ',', 'creating', 'memories', 'that', 'last', 'a', 'lifetime', '.']


### i. Gensim Word Tokenizer
##### Gensim's tokenizer is part of the Gensim library, known for topic modeling and document similarity analysis. Used in topic modeling, document clustering, and other tasks related to semantic analysis.

In [11]:
from gensim.utils import tokenize
gensim_tokens = list(tokenize(paragraph))
print(gensim_tokens)


['Football', 'known', 'as', 'the', 'beautiful', 'game', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement', 'From', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well', 'executed', 'pass', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story', 'In', 'the', 'heat', 'of', 'competition', 'players', 'strive', 'for', 'glory', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles', 'The', 'drama', 'unfolds', 'with', 'each', 'match', 'as', 'teams', 'battle', 'for', 'supremacy', 'fueled', 'by', 'skill', 'determination', 'and', 'sheer', 'grit', 'Whether', 'it', 's', 'the', 'World', 'Cup', 'final', 'or', 'a', 'local', 'derby', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', 'creating', 'memories', 'that', 'last', 'a', 'lifetime']


### j. Tokenization with Keras
##### Keras' text_to_word_sequence method tokenizes the paragraph into words while converting everything to lowercase. Used in text classification, language modeling, and sequence-to-sequence tasks.

In [12]:
from keras.preprocessing.text import text_to_word_sequence
keras_tokens = text_to_word_sequence(paragraph)
print(keras_tokens)

['football', 'known', 'as', 'the', 'beautiful', 'game', 'unites', 'fans', 'across', 'the', 'globe', 'in', 'passion', 'and', 'excitement', 'from', 'the', 'roar', 'of', 'the', 'crowd', 'to', 'the', 'precision', 'of', 'a', 'well', 'executed', 'pass', 'every', 'moment', 'on', 'the', 'pitch', 'holds', 'its', 'own', 'story', 'in', 'the', 'heat', 'of', 'competition', 'players', 'strive', 'for', 'glory', 'chasing', 'dreams', 'of', 'victory', 'and', 'championship', 'titles', 'the', 'drama', 'unfolds', 'with', 'each', 'match', 'as', 'teams', 'battle', 'for', 'supremacy', 'fueled', 'by', 'skill', 'determination', 'and', 'sheer', 'grit', 'whether', "it's", 'the', 'world', 'cup', 'final', 'or', 'a', 'local', 'derby', 'the', 'spirit', 'of', 'football', 'transcends', 'boundaries', 'creating', 'memories', 'that', 'last', 'a', 'lifetime']
