## 1. word_tokenize - 

In [1]:
from nltk.tokenize import word_tokenize

text = "The quick brown fox, said hello! to the lazy dog."
tokens = word_tokenize(text)

print(tokens)

['The', 'quick', 'brown', 'fox', ',', 'said', 'hello', '!', 'to', 'the', 'lazy', 'dog', '.']


- It splits the text into words based on whitespace and punctuation characters

## 2. sent_tokenize -

In [2]:
from nltk.tokenize import sent_tokenize

text = "This is a sample sentence. It's meant to be tokenized using a sentence tokenizer."

tokens = sent_tokenize(text)

print(tokens)

['This is a sample sentence.', "It's meant to be tokenized using a sentence tokenizer."]


- Splits text into sentences using a combination of heuristics, such as looking for periods followed by whitespace.

## 3. WhitespaceTokenizer -

In [3]:
from nltk.tokenize import WhitespaceTokenizer

text = "The quick brown fox, said hello! to the lazy dog."
tokenizer = WhitespaceTokenizer()
tokens = tokenizer.tokenize(text)

print(tokens)

['The', 'quick', 'brown', 'fox,', 'said', 'hello!', 'to', 'the', 'lazy', 'dog.']


- This tokenizer simply splits text into tokens based on whitespace characters such as spaces, tabs, and line breaks.

## 4. WordPunctTokenizer -

In [4]:
from nltk.tokenize import WordPunctTokenizer

text = "I can't believe it's not butter!"
tokenizer = WordPunctTokenizer()
tokens = tokenizer.tokenize(text)

print(tokens)

['I', 'can', "'", 't', 'believe', 'it', "'", 's', 'not', 'butter', '!']


- This tokenizer splits text into tokens based on whitespace and punctuation characters. 
- It also splits punctuation characters into separate tokens, so that punctuation marks such as commas and periods are treated as separate tokens.

## 5. TreebankWordTokenizer -

In [5]:
from nltk.tokenize import TreebankWordTokenizer

text = "What's the matter with kids now-a-days?"
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)

print(tokens)

['What', "'s", 'the', 'matter', 'with', 'kids', 'now-a-days', '?']


- This tokenizer is a rule-based tokenizer that follows the conventions used in the Penn Treebank corpus. 
- It splits off trailing punctuation
- Handles contractions and hyphenated words. (e.g. "can't", "won't", "mother-in-law")
- Supports multiple languages(e.g. German, French, Spanish).

## 6. RegexpTokenizer -

In [6]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\b[A-Z][a-z]*\b|\d{1,2}\.\d{1,2}\.\d{2,4}|\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')
text = "John Smith, PhD, was born on 1.2.1980 in Los Angeles, California. He now works at Google Inc. in New York City."
tokens = tokenizer.tokenize(text)
print(tokens)

['John', 'Smith', '1.2.1980', 'Los', 'Angeles', 'California', 'He', 'Google', 'Inc', 'New', 'York', 'City']


- By specifying a regular expression pattern that matches the desired tokens, RegexpTokenizer can be customized to suit the specific needs of a particular task or dataset
- It allows for more fine-grained control over the tokenization process compared to the built-in word_tokenize function.
Here the regular expression matches three types of tokens :
    - __A__) <U>Proper nouns</U>: A word starting with an uppercase letter, followed by zero or more lowercase letters (e.g. "John", "Smith", "Los", "Angeles", "California", "Google", "New", "York", "City").
    - __B__) <U>Dates in the format "d.d.yyyy"</U> : A sequence of two numbers, separated by a period, representing the day and month, followed by a sequence of four numbers representing the year (e.g. "1.2.1980").
    - __C__) <U>IP addresses</U> : A sequence of four numbers, separated by periods, each number having one to three digits (e.g. "1.2.198.0").

## 7. TweetTokenizer -

In [7]:
from nltk.tokenize import TweetTokenizer

text = "I just had the best pizza ever delivered in roguhly 20-30 mins! 🍕😍 #yum #pizzalove"
tokenizer = TweetTokenizer()
tokens = tokenizer.tokenize(text)

print(tokens)

['I', 'just', 'had', 'the', 'best', 'pizza', 'ever', 'delivered', 'in', 'roguhly', '20-30', 'mins', '!', '🍕', '😍', '#yum', '#pizzalove']


- This is designed specifically for processing tweets. It is capable of tokenizing the text of tweets, extracting hashtags, usernames, URLs, and emojis in a single step.

## 8. ToktokTokenizer -

In [8]:
import time
from nltk.tokenize import word_tokenize, ToktokTokenizer

text = "Mr.Troy is from the U.S.A,and he graduated in 2010. His Email address is troy@win_it.com. #NLP is a fascinating field!"

start_word = time.time()
tokens_word_tokenize = word_tokenize(text)
end_word = time.time()
print("Tokens with word_tokenize:\n", tokens_word_tokenize,"\n")
print("Time taken with word_tokenize: ", end_word-start_word, " seconds")
print("*"*125)
start_toktok = time.time()
tokenizer = ToktokTokenizer()
tokens_toktoktokenizer = tokenizer.tokenize(text)
end_toktok = time.time()
print("Tokens with ToktokTokenizer:\n", tokens_toktoktokenizer,"\n")
print("Time taken with ToktokTokenizer: ", end_toktok-start_toktok, " seconds \n")
print("TokTokTokenizer is ",round(((end_word-start_word)/(end_toktok-start_toktok)),2),"x is faster")

Tokens with word_tokenize:
 ['Mr.Troy', 'is', 'from', 'the', 'U.S.A', ',', 'and', 'he', 'graduated', 'in', '2010', '.', 'His', 'Email', 'address', 'is', 'troy', '@', 'win_it.com', '.', '#', 'NLP', 'is', 'a', 'fascinating', 'field', '!'] 

Time taken with word_tokenize:  0.0013442039489746094  seconds
*****************************************************************************************************************************
Tokens with ToktokTokenizer:
 ['Mr.Troy', 'is', 'from', 'the', 'U.S.A', ',', 'and', 'he', 'graduated', 'in', '2010.', 'His', 'Email', 'address', 'is', 'troy@win_it.com.', '#NLP', 'is', 'a', 'fascinating', 'field', '!'] 

Time taken with ToktokTokenizer:  0.00046753883361816406  seconds 

TokTokTokenizer is  2.88 x is faster


 - The ToktokTokenizer is a fast and efficient tokenizer that is well-suited for tokenizing  large-scale text in languages with whitespace-delimited scripts.

## 9. MWETokenizer -

In [9]:
from nltk.tokenize import MWETokenizer

# Define a list of multi-word expressions (MWEs)
mwe_list = [("New", "York"), ("San", "Francisco"), ("happy", "birthday")]

# Create an instance of the MWETokenizer, passing in the list of MWEs
mwe_tokenizer = MWETokenizer(mwe_list)

# Tokenize a sentence using the MWETokenizer
text = "I love New York and San Francisco. Happy birthday to you!"
tokens = mwe_tokenizer.tokenize(text.split())

print(tokens)

['I', 'love', 'New_York', 'and', 'San', 'Francisco.', 'Happy', 'birthday', 'to', 'you!']


- MWET stands for "Multi-Word Expression Tokenizer". It is used for identifing and tokenizing multi-word expressions (MWEs) as a single token, rather than treating them as separate words.
- MWEs are expressions made up of multiple words that have a single, specific meaning.(ex. "Happy Birthday","Hunger Games")
- The MWETokenizer can be useful in a variety of natural language processing tasks, such as named entity recognition, sentiment analysis, and text classification. By preserving the meaning of MWEs in the tokenization process, we can improve the accuracy and effectiveness of these tasks.