# Tokenization 

Tokenization is the process of breaking down a text into smaller units, such as words or subwords, called tokens. 
These tokens are the basic building blocks used in natural language processing (NLP) tasks such as text analysis,
sentiment analysis, machine translation, and more. Tokenization is a crucial preprocessing step in NLP 
because it enables computers to understand and process human language by dividing it into meaningful components.

There are different approaches to tokenization depending on the specific requirements of the task and 
the characteristics of the text data.


### Word Tokenization:
Word tokenization, also known as word segmentation, divides a text into individual words based on spaces or punctuation
marksFor example, the sentence "Tokenization is important for NLP tasks" 
would be tokenized into ["Tokenization", "is", "important", "for", "NLP", "tasks"].

### Sentence Tokenization:
Sentence tokenization splits a text into individual sentences based on punctuation marks such as periods, exclamation marks,
or question marks. For example, the paragraph "This is sentence one. This is sentence two!"
would be tokenized into ["This is sentence one.", "This is sentence two!"].

### Subword Tokenization:
Subword tokenization divides words into smaller meaningful units, such as prefixes, suffixes, or root words.
This technique is useful for handling languages with complex morphology or for dealing with out-of-vocabulary words. 
Examples include Byte-Pair Encoding (BPE) and WordPiece tokenization.

### Customized Tokenization:
Customized tokenization involves creating tokenization rules specific to the requirements of a particular task or domain. 
This may include handling special characters, emojis, or domain-specific abbreviations.

### TweetTokenizer
TweetTokenizer is a specific tokenizer provided by NLTK that is designed to handle tokenization of tweets.
Tweets often contain non-standard words, hashtags, mentions, URLs, and emojis,
which can pose challenges for standard tokenization methods.

In [4]:
# Word Tokenization
import nltk
from nltk.tokenize import word_tokenize

text = 'I am learning the Natural Language Processing'
print(word_tokenize(text))

['I', 'am', 'learning', 'the', 'Natural', 'Language', 'Processing']


In [9]:
import nltk
from nltk.tokenize import word_tokenize

# Sample text
text = "Tokenization is important for NLP tasks."

# Word tokenization
tokens = word_tokenize(text)

# Print the tokens
print(tokens)

['Tokenization', 'is', 'important', 'for', 'NLP', 'tasks', '.']


In [6]:
# sentence tokenization 
from nltk.tokenize import sent_tokenize

text = "Our company annual growth rate is 25.3% . good job Mr.Bhimrao"
print(sent_tokenize(text))

['Our company annual growth rate is 25.3% .', 'good job Mr.Bhimrao']


In [7]:
# Subword Tokenization
import nltk
from nltk.tokenize import word_tokenize

# Sample text
text = "Subword tokenization is useful for handling complex words."

# Tokenize the text into words
words = word_tokenize(text)

# Define a function for subword tokenization
def subword_tokenize(word):
    # Split the word into subwords (e.g., by adding a space between each character)
    subwords = [word[i:j] for i in range(len(word)) for j in range(i + 1, len(word) + 1)]
    return subwords

# Tokenize each word into subwords
subword_tokens = [subword_tokenize(word) for word in words]

# Flatten the list of subword tokens
flat_subword_tokens = [token for sublist in subword_tokens for token in sublist]

# Print the subword tokens
print(flat_subword_tokens)

['S', 'Su', 'Sub', 'Subw', 'Subwo', 'Subwor', 'Subword', 'u', 'ub', 'ubw', 'ubwo', 'ubwor', 'ubword', 'b', 'bw', 'bwo', 'bwor', 'bword', 'w', 'wo', 'wor', 'word', 'o', 'or', 'ord', 'r', 'rd', 'd', 't', 'to', 'tok', 'toke', 'token', 'tokeni', 'tokeniz', 'tokeniza', 'tokenizat', 'tokenizati', 'tokenizatio', 'tokenization', 'o', 'ok', 'oke', 'oken', 'okeni', 'okeniz', 'okeniza', 'okenizat', 'okenizati', 'okenizatio', 'okenization', 'k', 'ke', 'ken', 'keni', 'keniz', 'keniza', 'kenizat', 'kenizati', 'kenizatio', 'kenization', 'e', 'en', 'eni', 'eniz', 'eniza', 'enizat', 'enizati', 'enizatio', 'enization', 'n', 'ni', 'niz', 'niza', 'nizat', 'nizati', 'nizatio', 'nization', 'i', 'iz', 'iza', 'izat', 'izati', 'izatio', 'ization', 'z', 'za', 'zat', 'zati', 'zatio', 'zation', 'a', 'at', 'ati', 'atio', 'ation', 't', 'ti', 'tio', 'tion', 'i', 'io', 'ion', 'o', 'on', 'n', 'i', 'is', 's', 'u', 'us', 'use', 'usef', 'usefu', 'useful', 's', 'se', 'sef', 'sefu', 'seful', 'e', 'ef', 'efu', 'eful', 'f', 

In [8]:
# Customized Tokenization
import re

# Sample text
text = "This is a sample sentence with some_special_characters. It also includes 'quotes' and emojis 😊🚀."

# Define a custom tokenization function
def custom_tokenize(text):
    # Define regex patterns for tokenization
    patterns = [
        r'\w+',                      # Matches alphanumeric characters
        r'[\u00A1-\u1FFF\u2C00-\uD7FF\w]+',  # Matches characters from various scripts (e.g., emojis)
        r'\d+',                      # Matches digits
        r'[^\w\s]'                   # Matches special characters
    ]
    
    # Combine patterns into a single regex pattern
    combined_pattern = '|'.join('(?:{})'.format(pattern) for pattern in patterns)
    
    # Tokenize text using the combined regex pattern
    tokens = re.findall(combined_pattern, text)
    
    return tokens

# Tokenize the text using the custom tokenization function
tokens = custom_tokenize(text)

# Print the tokens
print(tokens)

['This', 'is', 'a', 'sample', 'sentence', 'with', 'some_special_characters', '.', 'It', 'also', 'includes', "'", 'quotes', "'", 'and', 'emojis', '😊', '🚀', '.']


In [2]:
# TweetTokenizer
import nltk
from nltk.tokenize import TweetTokenizer

# Create a TweetTokenizer instance
tokenizer = TweetTokenizer()

# Sample tweet
tweet = "Just bought a new phone! #excited 📱😊"

# Tokenize the tweet
tokens = tokenizer.tokenize(tweet)

# Print the tokens
print(tokens)

['Just', 'bought', 'a', 'new', 'phone', '!', '#excited', '📱', '😊']


In [3]:
# Customize TweetTokenizer
tokenizer = TweetTokenizer(reduce_len=True, preserve_case=False)

# Tokenize the tweet
tokens = tokenizer.tokenize(tweet)

# Print the tokens
print(tokens)


['just', 'bought', 'a', 'new', 'phone', '!', '#excited', '📱', '😊']


In [12]:
# RegEx Tokenizations
from nltk.tokenize import regexp_tokenize

text = "NLP is fun and can deal with texts and sound, but can not deal with images. We have session at 11 AM!. We can learn a lot of $"

# Print word by all small case and start from small a to z.
tokens = regexp_tokenize(text, pattern='[a-z]+')

print(tokens)


['is', 'fun', 'and', 'can', 'deal', 'with', 'texts', 'and', 'sound', 'but', 'can', 'not', 'deal', 'with', 'images', 'e', 'have', 'session', 'at', 'e', 'can', 'learn', 'a', 'lot', 'of']


In [14]:
# Extra quote ' get's you word like can't
regexp_tokenize(text,"[a-z']+")

['is',
 'fun',
 'and',
 'can',
 'deal',
 'with',
 'texts',
 'and',
 'sound',
 'but',
 'can',
 'not',
 'deal',
 'with',
 'images',
 'e',
 'have',
 'session',
 'at',
 'e',
 'can',
 'learn',
 'a',
 'lot',
 'of']

In [15]:
# print word by word that contains all caps and from caps A to Z 
regexp_tokenize(text,"[A-Z]+")

['NLP', 'W', 'AM', 'W']

In [16]:
# Every thing in one line 
regexp_tokenize(text,"[\a-z']+")

['NLP is fun and can deal with texts and sound, but can not deal with images. We have session at 11 AM!. We can learn a lot of $']

In [18]:
# Anything start with caret is not equal 
regexp_tokenize(text,"[^a-z']+")

['NLP ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ', ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '. W',
 ' ',
 ' ',
 ' ',
 ' 11 AM!. W',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' $']

In [19]:
# only numbers
regexp_tokenize(text,"[0-9]+")

['11']

In [20]:
# without numbers 
regexp_tokenize(text,"[^0-9]+")

['NLP is fun and can deal with texts and sound, but can not deal with images. We have session at ',
 ' AM!. We can learn a lot of $']

In [21]:
# onlt $ symbol 
regexp_tokenize(text,"[$]")

['$']