### Dictionary Based Tokenization in NLP

**Natural Language Processing (NLP)** is a subfield of artificial intelligence that aims to enable computers to process, understand, and generate human language. One of the critical tasks in NLP is tokenization, which is the process of splitting text into smaller meaningful units, known as tokens. Dictionary-based tokenization is a common method used in NLP to segment text into tokens based on a pre-defined dictionary.

Dictionary-based tokenization is a technique in natural language processing (NLP) that involves splitting a text into individual tokens based on a predefined dictionary of multi-word expressions. This is useful when the standard word tokenization techniques may not be sufficient for certain applications, such as sentiment analysis or named entity recognition, where multi-word expressions need to be treated as a single token.

Dictionary-based tokenization divides the text into tokens by using a predefined dictionary of multi-word expressions. A dictionary is a list of words, phrases, and other linguistic constructions along with the definitions, speech patterns, and other pertinent data that go with them. Each word in the text is compared to the terms in the dictionary as part of the dictionary-based tokenization process, and the text is then divided into tokens based on the matches discovered.  We can tokenize the name, and phrases by creating a custom dictionary. 


A **token** in natural language processing is a group of characters that stands for a single meaning. Words, phrases, integers, and punctuation marks can all be used as tokens. Several NLP activities, including text classification, sentiment analysis, machine translation, and named entity recognition, depend on the tokenization process.

Several methods, including rule-based tokenization, machine learning-based tokenization, and hybrid tokenization, can be used to conduct the dictionary-based tokenization process. Rule-based tokenization divides the text into tokens according to the text’s characteristics, such as punctuation, capitalization, and spacing. Tokenization that is based on machine learning entails training a model to separate text into tokens based on a set of training data. To increase accuracy and efficiency, hybrid tokenization blends rule-based and machine-learning-based methods.

### Steps needed for implementing Dictionary-based tokenization:

- **Step 1:** Collect a dictionary of words and their corresponding parts of speech. The dictionary can be created manually or obtained from a pre-existing source such as WordNet or Wikipedia.

- **Step 2:** Preprocess the text by removing any noise such as punctuation marks, stop words, and HTML tags.

- **Step 3:** Tokenize the text into words using a whitespace tokenizer or a sentence tokenizer.

- **Step 4:** Identify the parts of speech of each word in the text using a part-of-speech tagger such as the Stanford POS Tagger.

- **Step 5:** Segment the text into tokens by comparing each word in the text with the words in the dictionary. If a match is found, the corresponding word in the dictionary is used as a token. Otherwise, the word is split into smaller sub-tokens based on its parts of speech.

**For example, consider the following sentence:**

Jammu Kashmir is an integral part of India.

My name is Anurag kumar Vishwakarma.

He is from Utter Pradesh.

The steps involved in the dictionary-based tokenization of this sentence are as follows:

**Step 1: Import the necessary libraries**

In [1]:
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer

**Step 2: Create a custom dictionary using the name or phrases**
Collect a dictionary of words having joint words like phrases or names. Let the dictionary contain the following **name or phrases.**

In [3]:
dictionary = [("Jammu", "Kashmir"), 
			("Anurag", "Kumar", "Vishwakarma"), 
			("Himachal", "Pradesh")]


**Step 3: Create an instance of MWETokenizer with the dictionary**

instance means object

In [4]:
Dictionary_tokenizer = MWETokenizer(dictionary, separator=' ')


**Step 4: Create a text dataset and tokenize with word_tokenize**

In [5]:
text = """ 
Jammu Kashmir is an integral part of India. 
My name is Anurag Kumar Vishwakarma. 
He is from Utter Pradesh. 
"""
tokens = word_tokenize(text) 
tokens


['Jammu',
 'Kashmir',
 'is',
 'an',
 'integral',
 'part',
 'of',
 'India',
 '.',
 'My',
 'name',
 'is',
 'Anurag',
 'Kumar',
 'Vishwakarma',
 '.',
 'He',
 'is',
 'from',
 'Utter',
 'Pradesh',
 '.']

**Step 5:**  Apply Dictionary based tokenization with Dictionary_tokenizer



In [7]:
dictionary_based_token  = Dictionary_tokenizer.tokenize(tokens)


In [8]:
dictionary_based_token

['Jammu Kashmir',
 'is',
 'an',
 'integral',
 'part',
 'of',
 'India',
 '.',
 'My',
 'name',
 'is',
 'Anurag Kumar Vishwakarma',
 '.',
 'He',
 'is',
 'from',
 'Utter',
 'Pradesh',
 '.']

**Full code implementations**


In [9]:
# import the necessary libraries 
from nltk import word_tokenize 
from nltk.tokenize import MWETokenizer 

# customn dictionary 
dictionary = [("Jammu", "Kashmir"), 
			("Pawan", "Kumar", "Gunjan"), 
			("Himachal", "Pradesh")] 

# Create an instance of MWETokenizer with the dictionary 
Dictionary_tokenizer = MWETokenizer(dictionary, separator=' ') 

# Text 
text = """ 
Jammu Kashmir is an integral part of India. 
My name is Pawan Kumar Gunjan. 
He is from Himachal Pradesh. 
"""

tokens = word_tokenize(text) 
print('General Word Tokenization \n',tokens) 

dictionary_based_token =Dictionary_tokenizer.tokenize(tokens) 
print('Dictionary based tokenization \n',dictionary_based_token)


General Word Tokenization 
 ['Jammu', 'Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan', 'Kumar', 'Gunjan', '.', 'He', 'is', 'from', 'Himachal', 'Pradesh', '.']
Dictionary based tokenization 
 ['Jammu Kashmir', 'is', 'an', 'integral', 'part', 'of', 'India', '.', 'My', 'name', 'is', 'Pawan Kumar Gunjan', '.', 'He', 'is', 'from', 'Himachal Pradesh', '.']
