### Tokenisation Revisited (concepts and code demos)

#### 1. Whitespace Tokenization

Whitespace tokenization breaks a text document based on spaces and line breaks. It considers every section of input text divided by whitespace as a distinct token. Whitespace tokenization serves as a quick way to break down text into words or phrases. This method is especially used for processing structured data including logs files or code, in which whitespaces have specific meaning.

Whitespace tokenization is less worried about the grammatical structure of input texts. Its main focus is on the visual separation of text. Whitespace tokenization is used in the NLP tasks where sentence boundaries are not as important. Examples include keyword extraction or simple text pre-processing, where the focus is on quickly splitting the input text into smaller and more manageable tasks. Below is the code demo.

In [6]:
from nltk.tokenize import WhitespaceTokenizer
from typing import List

def whitespace_tokenizer(text: str) -> List[str]:
    """
    Tokenizes the input text based on whitespace characters (spaces, tabs, newlines).
    
    Parameters:
    -----------
    text : str
        The input text to be tokenized.
    
    Returns:
    --------
    List[str]
        A list of tokens.
    """
    tokenizer = WhitespaceTokenizer()
    tokens = tokenizer.tokenize(text)
    return tokens

In [7]:
# Example usage
if __name__ == "__main__":
    sample_text = """Whitespace tokenization breaks a text document based on spaces and line breaks. 
    It considers every section of input text divided by whitespace as a distinct token. 
    Whitespace tokenization serves as a quick way to break down text into words or phrases. 
    This method is especially used for processing structured data including logs files or code, 
    in which whitespaces have specific meaning.
        
    """
    tokens = whitespace_tokenize(sample_text)
    print("Tokens:", tokens)

Tokens: ['Whitespace', 'tokenization', 'breaks', 'a', 'text', 'document', 'based', 'on', 'spaces', 'and', 'line', 'breaks.', 'It', 'considers', 'every', 'section', 'of', 'input', 'text', 'divided', 'by', 'whitespace', 'as', 'a', 'distinct', 'token.', 'Whitespace', 'tokenization', 'serves', 'as', 'a', 'quick', 'way', 'to', 'break', 'down', 'text', 'into', 'words', 'or', 'phrases.', 'This', 'method', 'is', 'especially', 'used', 'for', 'processing', 'structured', 'data', 'including', 'logs', 'files', 'or', 'code,', 'in', 'which', 'whitespaces', 'have', 'specific', 'meaning.']


#### 2. Dictionary Based Tokenization

In Dictionary-based tokenization, the tokens are based on the words that are already available in the dictionary. In case the token is not found in dictionary, special rules are applied to tokenize it. 

Dictionary-based tokenization is used where identifying extract predefined phrases or terms is more important. It ensures used in domain-specific texts, where consistency in recognizing specific tokens is more important than simple tokens. This method is particularly useful in NLP tasks like medical text processing and parsing of legal documents, where precise recognition term is critical. Belos is the code demo of dictionary-based tokenization.

In [None]:
import nltk
from nltk.tokenize import MWETokenizer
from typing import List

# NLTK data
nltk.download('punkt')

def dictionary_based_tokenizer(text: str, dictionary: List[List[str]]) -> List[str]:
    """
    Tokenizes the input text based on a predefined dictionary.
    
    Parameters:
    -----------
    text : str
        The input text string.
    dictionary : List[List[str]]
        A list of expressions, recognized as single tokens.
    
    Returns:
    --------
    List[str]
        A list of tokens.
    """
    # The '_' argument joinS the words with an underscore (_).
    tokenizer = MWETokenizer(dictionary, separator='_')
    tokens = tokenizer.tokenize(nltk.word_tokenize(text))
    return tokens

In [10]:
# Example usage
if __name__ == "__main__":
    # Define a dictionary of multi-word expressions
    dictionary = [
        ['heart', 'attack'],
        ['machine', 'learning'],
        ['New', 'York'],
    ]
    
    sample_text = "John suffered a messive heart attack. He was enroled in machine learning in Washington DC."
    tokens = dictionary_based_tokenizer(sample_text, dictionary)
    print("Tokens:", tokens)

Tokens: ['John', 'suffered', 'a', 'messive', 'heart_attack', '.', 'He', 'was', 'enroled', 'in', 'machine_learning', 'in', 'Washington', 'DC', '.']


Note: You may notice in the output that 'heart attack' is appearing as 'heart_attack' and 'machine learning' as 'machine_learning'

#### 3. Rule Based Tokenization

Rule-based tokenization is particularly useful in the text pre-processing tasks where precise structures or formats must be preserved. Regular expressions can be used to fine-tune to handle various linguistic nuances or specific text formats. Below we have used RegexpTokenizer from the nltk along with required regular expression patterns. 

In [17]:
from nltk.tokenize import RegexpTokenizer
from typing import List

def rule_based_tokenizer(text: str) -> List[str]:
    """
    Tokenizes the input text based on regular expressions.
    
    Parameters:
    -----------
    text : str
        The input string.
    
    Returns:
    --------
    List[str]
        A list of tokens.
    """
    # Define a regular expressions for specific tasks.
    # Rule 1: Match abbreviations like "U.K.", "Dr."
    # Rule 2: Match words including those with apostrophes like "she's", "haven't")
    # Rule 3: Separately match punctuations. 
    pattern = r'''(?x)               # Set flag to allow verbose regex
                  (?:[A-Za-z]\.)+    # Abbreviations like U.K.
                | \w+(?:'\w+)?       # Words with apostrophes
                | [^\w\s]            # Separate punctuations
                '''
    
    # Write a RegexpTokenizer.
    tokenizer = RegexpTokenizer(pattern)
    
    # Tokenize.
    tokens = tokenizer.tokenize(text)
    
    return tokens

In [24]:
# Example usage
if __name__ == "__main__":
    sample_text = """Dr. John's a rocket engineer in the U.K. 
                   U.A.E. govt. has awarded him many awards."""
    tokens = rule_based_tokenizer(sample_text)
    print("Tokens:", tokens)

Tokens: ['Dr', '.', "John's", 'a', 'rocket', 'engineer', 'in', 'the', 'U.K.', 'U.A.E.', 'govt', '.', 'has', 'awarded', 'him', 'many', 'awards', '.']


- Please note, in 'Dr', 'U.K.', and 'U.A.E.', the tokenizer has preserved common abbreviations and acronyms and  treated them as single tokens. 
- 'govt': This is treated as a single token, preserving abbreviations that are commonly used in written English.
- For "John's", the tokenizer preserves the apostrophe and the following 's' as part of the token.

#### 4. Punctuation-Based Tokenizer

This tokenizer splits the input text on whitespace and punctuations while retaining the punctuations.Punctuation-based tokenization overcomes the issue above and provides a meaningful token.Punctuation-based tokenization solves the problem of simply splitting the input text on whitespace. Below is a simple code demo.

In [32]:
#import wordpunct_tokenize from nltk.
from nltk.tokenize import wordpunct_tokenize

text = "Mr.Johnson owns three houses in New York on: 22nd street, downton,and 42nd street."

tokens = wordpunct_tokenize(text)
tokens

['Mr',
 '.',
 'Johnson',
 'owns',
 'three',
 'houses',
 'in',
 'New',
 'York',
 'on',
 ':',
 '22nd',
 'street',
 ',',
 'downton',
 ',',
 'and',
 '42nd',
 'street',
 '.']

- Please note: The above output retains punctuation marks as separate tokens. It helps in understanding the sentence structure. The presence of punctuation tokens help to preserve the meaning and context of the original sentence.  

#### 5. Tweet Tokenizer

Tweets can be treated as special texts like several others because they have a typical structure. The generic tokenizers may fail to produce feasible tokens when applied to the datasets containing tweets. nltk has a special tokenizer for such cases. The Tweet tokenizer is a rule-based tokenizer that can be used to remove problematic characters, HTML code, and Twitter handles. This tokenizer can normalize text length of input text by reducing the occurrence of repeated letters. Additionally, tweet tokenizer is equipped to effectively deal with hashtags and mentions while preserving their significance. The tokenizer can also effectively manage emoticons and special symbols. Note that hash tags and emoticons are common in tweets. Below is a small but reusable code demo with this tokenizer.  

In [36]:
from nltk.tokenize import TweetTokenizer

def punctuation_based_tokenizer(text):
    """
    Tokenizes the input text using NLTK's TweetTokenizer, which handles punctuation and special characters
    commonly found in tweets.

    Parameters:
    text (str): The input text to be tokenized.

    Returns:
    list: A list of tokens, including words and punctuation marks.
    """
    # Initialize the tokenizer.
    tokenizer = TweetTokenizer()
    
    # Tokenize. 
    tokens = tokenizer.tokenize(text)
    
    return tokens

In [37]:
# Example usage
if __name__ == "__main__":
        
    Sample_text = """Mr.Johnson owns three houses in New York! 
    22nd street, downton,and 42nd street." #marvellous @Rita"
    """
    tokens = punctuation_based_tokenizer(Sample_text)
    print(tokens)

['Mr.Johnson', 'owns', 'three', 'houses', 'in', 'New', 'York', '!', '22nd', 'street', ',', 'downton', ',', 'and', '42nd', 'street', '.', '"', '#marvellous', '@Rita', '"']


- Please note: Punctuation marks like !, ,, and . are preserved as separate tokens. 
- Hashtags (#marvellous) and mentions (@Rita), are preserved as distinct tokens.
- Mr.Johnson is kept intact.
- These are the examples to put an emphasis on the fact that TweetTokenizer is specifically tailored for the nuances of social media text. 

#### 6. Multi-Word Expression (MWE) Tokenizer

MWE (Multi-Word Expression) Tokenizer handles sequences of words that act as a single unit in the input text. Examples include idiomatic phrases, compound nouns, and fixed expressions that can be misunderstood if tokenized into separate words. Treating these expressions single tokens, preserves their intended meaning and helps in accurate text analysis. Below is the text demo of this tokenizer with a reusable function.

In [43]:
from nltk.tokenize import MWETokenizer

def mwe_tokenizer(text, mwe_list):
    """
    Tokenizes while preserving indicated multi-word expressions (MWEs).

    Parameters:
    text (str): The input text.
    mwe_list (list of tuples): The list of MWEs to be treated as single tokens.

    Returns:
    list: A list of tokens with MWEs preserved as single tokens.
    """
    # Initialize the MWETokenizer with the provided MWEs
    tokenizer = MWETokenizer(mwe_list)
    
    # Tokenize the input text
    tokens = tokenizer.tokenize(text.split())
    
    return tokens

In [48]:
if __name__ == "__main__":
    # Define multi-word expressions.
    mwe_list = [
        ('M', 'W', 'E'),
        ('Multi', 'Word', 'Tokenizer'),
        ('Natural', 'Language', 'Processing'),
        ('pre', 'processing')
    ]

In [49]:
 # Input text
text = "M W E Tokenizer is an advanced tokenizer in Natural Language Processing pre processing tasks"
    
# Perform tokenization with MWEs
tokenized = mwe_tokenizer(text, mwe_list)
    
# Output the tokenized result
print(tokenized)

['M_W_E', 'Tokenizer', 'is', 'an', 'advanced', 'tokenizer', 'in', 'Natural_Language_Processing', 'pre_processing', 'tasks']


- Note that M_W_E, Natural Language Processing, and pre processing are combined as single units as per the given mwe_list.

- More task specific tokenizers are available like Penn TreeBank/Default Tokenization, Spacy Tokenizer, Moses Tokenizer, Subword Tokenization. The readers are expected to explore them using other books and web resources. 

Code Snippet 5.1