# Introduction to Tokenization

Tokenization is the process of splitting text into smaller units called tokens. These tokens can be words, subwords, or even characters, depending on the method used.

### Why is it important?
- It’s the first step in most **NLP** pipelines.
- Different tokenization methods can significantly impact downstream tasks like machine translation, sentiment analysis, etc.

In [8]:
# uncoment first time
# import nltk
# nltk.download('punkt_tab')

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import word_tokenize
import spacy

# from utils import *
from utils_viz import visualize_tokens

Load sample text

In [48]:
text = "Although she’d planned to visit the museum—founded in 1876—on Monday, the 3.5-inch snowfall forced her to reschedule for next week’s tour."
# text = "sudo rm -rf ~/memories/childhood.cringe -> Error: File is read-only."

In [72]:
# whitespace tokenization
tokens_whitespace = text.split()

# nltk tokenization
tokens_nltk = word_tokenize(text)

# spacy tokenization
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
tokens_spacy = [token.text for token in doc]

In [74]:
for tokens,tokenizers in zip([tokens_whitespace, tokens_nltk, tokens_spacy],['whitespace','nltk','spacy']):
    print(f"\nTokenizer: {tokenizers}")
    print(f"length: {len(tokens)}")
    visualize_tokens(tokens)


Tokenizer: whitespace
length: 21



Tokenizer: nltk
length: 27



Tokenizer: spacy
length: 31
