## Tokenization: Explained with Simple Examples

### What is Tokenization?

Tokenization is the process of converting a stream of text into individual words, sentences, or subwords (tokens). This is an essential step in Natural Language Processing (NLP) as it breaks down text into manageable units for further analysis.

**2. Types of Tokenization**

    Word Tokenization: Splitting the text into individual words.
    Sentence Tokenization: Splitting the text into individual sentences.

**Why It's Useful:**  
Think of a sentence as a big puzzle. Tokenization helps us by breaking it down into smaller pieces (tokens), making it easier to analyze and understand each piece.




In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
#en_core_web_sm for English text processing.

In [None]:
# word tokenization
text = "I love Natural Language Processing (NLP). I am using Spacy for it"

# Process the text
doc = nlp(text)

# Tokenize words
tokens = [token.text for token in doc]
print(tokens)


['I', 'love', 'Natural', 'Language', 'Processing', '(', 'NLP', ')', '.', 'I', 'am', 'using', 'Spacy', 'for', 'it']


In [None]:
#Tokenize sentence
text = "I love Natural Language Processing (NLP). I am using Spacy for it"

doc = nlp(text)

# Tokenize sentence
sentences = [sent.text for sent in doc.sents]
print(sentences)


['I love Natural Language Processing (NLP).', 'I am using Spacy for it']


In [None]:
#Tokenize Only Words Without Punctuation
text = "I love Natural Language Processing (NLP). I am using Spacy for it"
doc = nlp(text)

# Tokenize only words (without punctuation)
tokens_without_punctuation = [token.text for token in doc if token.is_alpha]
print(tokens_without_punctuation)

['I', 'love', 'Natural', 'Language', 'Processing', 'NLP', 'I', 'am', 'using', 'Spacy', 'for', 'it']


In [None]:
#Tokenize Without Punctuation and Numbers

text = "The price of the book is $50  and the tax is $5."
doc = nlp(text)

tokens_without_punctuation_and_numbers = [token.text for token in doc if token.is_alpha]
print(tokens_without_punctuation_and_numbers)

# Tokenize with words and punctuation without numbers
tokens_without_numbers = [token.text for token in doc if not token.is_digit]
print(tokens_without_numbers)

# Tokenize with words and numbers without punctuation
tokens_without_punctuation = [token.text for token in doc if not token.is_punct]
print(tokens_without_punctuation)

['The', 'price', 'of', 'the', 'book', 'is', 'and', 'the', 'tax', 'is']
['The', 'price', 'of', 'the', 'book', 'is', '$', ' ', 'and', 'the', 'tax', 'is', '$', '.']
['The', 'price', 'of', 'the', 'book', 'is', '$', '50', ' ', 'and', 'the', 'tax', 'is', '$', '5']


In [None]:
#Tokenizing Words with Apostrophes

text = "I'm learning NLP and it's amazing!"
doc = nlp(text)

# Tokenize words with apostrophes
tokens_with_apostrophes = [token.text for token in doc]
print(tokens_with_apostrophes)

['I', "'m", 'learning', 'NLP', 'and', 'it', "'s", 'amazing', '!']


In [None]:
from typing import Text
#Tokenize Only Phone Numbers
text = "Call me at 123-456-7890 or 987-654-3210."
doc = nlp(text)

# Tokenize only phone numbers (looking for patterns like digits and hyphens)
tokens_with_phone_numbers = [token.text for token in doc if token.like_num]
print(tokens_with_phone_numbers)

['123', '456', '7890', '987', '654', '3210']


In [None]:
import spacy
import re
# using re for this

# Example with a phone number
text = "Call me at 123-456-7890 or 987-654-3210."
pattern = r'\d{3}-\d{3}-\d{4}'

# Tokenize using regex to capture phone numbers
tokens_with_phone_numbers = [match.group() for match in re.finditer(pattern, text)]
print(tokens_with_phone_numbers)

['123-456-7890', '987-654-3210']


In [None]:
tweet_text = "Hello world! 😊 I'm loving #NLP and it's so fun! @user"

# Process the text
doc = nlp(tweet_text)

tokens_tweet = [token.text for token in doc]
print(tokens_tweet)


['Hello', 'world', '!', '😊', 'I', "'m", 'loving', '#', 'NLP', 'and', 'it', "'s", 'so', 'fun', '!', '@user']
