# Natural Language Processing (NLP) Tutorial

## Introduction to Natural Language Processing:

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) that focuses on the interaction between computers and human language. It involves the development of algorithms and models that enable computers to understand, interpret, and generate human language in a meaningful way.

### History of NLP:

* 1940-1960: The early years of NLP were primarily focused on machine translation (MT). Notable developments include the first recognizable NLP application at Birkbeck College, London, in 1948.
* 1960-1980: This period saw the emergence of augmented transition networks (ATN), case grammar, and systems like SHRDLU and LUNAR, which laid the foundation for modern NLP.
* 1980-Current: Post-1980, NLP witnessed advancements in machine learning algorithms, leading to improved accuracy and efficiency in language processing tasks. The growth of the internet and availability of large text corpora further accelerated progress in NLP.

### Importance and Applications of NLP:

NLP has diverse applications across various domains, including:

* Information Retrieval: Extracting relevant information from large volumes of text data.
* Sentiment Analysis: Analyzing and understanding the sentiment or emotion expressed in text.
* Text Summarization: Generating concise summaries of lengthy documents or articles.
* Machine Translation: Translating text from one language to another.
* Speech Recognition: Converting spoken language into text.
* Chatbots: Building conversational agents for customer support and interaction.
* Named Entity Recognition (NER): Identifying and classifying named entities such as names, organizations, dates, and locations in text.

### NLP Fundamentals:

To understand NLP, it's essential to grasp some fundamental concepts:

* Syntax: The structure of language, including grammar rules and sentence structure.
* Semantics: The meaning of words, phrases, and sentences in context.
* Pragmatics: The study of how language is used in real-world situations to convey meaning and achieve communication goals.
* Tokenization: Breaking down text into smaller units such as words or sentences for analysis.
* Stemming and Lemmatization: Techniques to reduce words to their root forms, improving consistency in text analysis.

### Text Preprocessing in NLP:

Text preprocessing is a critical step in NLP that involves cleaning and preparing text data for analysis. This may include:

* Removing special characters, punctuation, and numerical values.
* Converting text to lowercase to standardize the text format.
* Removing stop words (commonly occurring words such as "the," "is," "and") that carry little semantic meaning.
* Tokenization to split text into individual words or phrases.
* Stemming and lemmatization to reduce words to their base or root forms.

### Feature Extraction in NLP:

Feature extraction involves transforming raw text data into numerical features that can be used for analysis and modeling. Common techniques include:

* Bag of Words (BoW): Representing text as a matrix of word frequencies or presence/absence indicators.
* Term Frequency-Inverse Document Frequency (TF-IDF): Assigning weights to words based on their frequency in a document relative to their frequency across the entire corpus.
* Word Embeddings: Representing words as dense vectors in a continuous vector space, capturing semantic relationships between words.

### Key NLP Tasks and Algorithms:

NLP comprises various tasks, each with its specific goal and set of algorithms:

* Sentiment Analysis: Determining the sentiment or emotion expressed in a piece of text. Algorithms such as Naïve Bayes, Support Vector Machines (SVM), and Recurrent Neural Networks (RNNs) are commonly used.
* Text Classification: Categorizing text into predefined classes or categories. Algorithms like Decision Trees, Random Forests, and Neural Networks are employed for text classification tasks.
* Named Entity Recognition (NER): Identifying and classifying named entities such as names, organizations, dates, and locations in text. Conditional Random Fields (CRFs) and Bidirectional LSTMs are popular for NER tasks.
* Machine Translation: Translating text from one language to another. Statistical methods, rule-based approaches, and sequence-to-sequence models with attention mechanisms are used for machine translation.

### Advanced NLP Techniques:

* Part-of-Speech (POS) Tagging: Assigning grammatical tags to words in a sentence.
* Dependency Parsing: Analyzing the grammatical structure and relationships between words in a sentence.
* Coreference Resolution: Identifying and resolving references to the same entity across text.
* Topic Modeling: Identifying themes or topics in a collection of documents using techniques like Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).
* Sequence-to-Sequence Models: Generating sequences of text, such as machine translation or text summarization, using models like Encoder-Decoder architectures with attention mechanisms.

### Deep Learning for NLP:

Deep learning has significantly advanced NLP by providing powerful models capable of learning complex patterns and representations from data. Key deep learning architectures for NLP include:

* Recurrent Neural Networks (RNNs): Suitable for sequential data processing tasks due to their ability to capture temporal dependencies.
* Long Short-Term Memory (LSTM) Networks: A type of RNN that addresses the vanishing gradient problem, enabling better learning of long-range dependencies.
* Transformer Models: Introduced by the "Attention is All You Need" paper, transformer models like BERT, GPT, and XLNet have achieved state-of-the-art performance on various NLP tasks by leveraging self-attention mechanisms.

### NLP Libraries and Tools:

* NLTK (Natural Language Toolkit): A comprehensive library for NLP tasks in Python, providing tools for tokenization, stemming, lemmatization, POS tagging, and more.
* SpaCy: An open-source NLP library that offers efficient tokenization, POS tagging, dependency parsing, and named entity recognition, optimized for production use.
* Gensim: Gensim is a Python library designed for topic modeling and document similarity analysis. It specializes in unsupervised learning algorithms for semantic analysis of text data. Gensim offers implementations of popular algorithms such as Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Word2Vec. It is widely used for tasks such as document clustering, semantic indexing, and similarity retrieval.
* TextBlob: TextBlob is a simplified and beginner-friendly NLP library built on top of NLTK and Pattern libraries. It provides a simple API for common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
* AllenNLP: AllenNLP is a powerful open-source NLP library built on top of PyTorch, designed for research and production-level NLP applications. It provides pre-built models and tools for various NLP tasks such as text classification, named entity recognition, semantic role labeling, and coreference resolution.
* Stanford NLP: The Stanford NLP toolkit is a suite of NLP tools developed by the Stanford NLP Group. It provides robust and efficient implementations of state-of-the-art NLP algorithms for tasks such as part-of-speech tagging, named entity recognition, dependency parsing, sentiment analysis, and coreference resolution.
* Transformers (Hugging Face): Transformers is a popular library developed by Hugging Face that provides pre-trained models and utilities for working with transformer-based architectures in NLP. It offers a wide range of pre-trained models such as BERT, GPT, RoBERTa, and T5, which can be fine-tuned for specific downstream tasks such as text classification, question answering, summarization, and translation.
* TensorFlow Text: TensorFlow Text is a library built on top of TensorFlow for text processing and sequence modeling tasks. It provides modules for tokenization, preprocessing, and feature extraction, as well as implementations of popular NLP algorithms such as Word2Vec, TF-IDF, and sequence-to-sequence models. TensorFlow Text integrates seamlessly with other TensorFlow components, allowing for efficient development and deployment of end-to-end NLP pipelines.
* FastText: FastText is a library developed by Facebook Research for efficient text classification and word representation learning. It offers implementations of fast and scalable algorithms for training word embeddings and text classifiers. FastText is known for its ability to handle large text corpora and perform well on tasks such as sentiment analysis, topic classification, and language identification.

### Conclusion

Natural Language Processing (NLP) has emerged as a transformative field within artificial intelligence, enabling machines to understand, interpret, and generate human language. With the rapid advancement of technology and the availability of powerful NLP libraries and tools, developers and researchers can now tackle a wide range of linguistic tasks with unprecedented accuracy and efficiency.

In this comprehensive overview, we explored the fundamental concepts, key components, and practical applications of NLP. We delved into the history of NLP, tracing its evolution from rule-based systems to modern machine learning approaches. We discussed essential NLP tasks such as tokenization, part-of-speech tagging, named entity recognition, sentiment analysis, and machine translation, highlighting their significance in various domains.

Furthermore, we examined popular NLP libraries and tools such as NLTK, SpaCy, Gensim, TextBlob, AllenNLP, Stanford NLP, Transformers (Hugging Face), TensorFlow Text, and FastText, each offering unique functionalities and capabilities for text processing and analysis. These libraries empower developers to build sophisticated NLP pipelines and applications with ease, leveraging pre-trained models and state-of-the-art algorithms.

As NLP continues to evolve, driven by advancements in deep learning, transfer learning, and natural language understanding, its impact on society, business, and academia will only grow stronger. From chatbots and virtual assistants to sentiment analysis and language translation, NLP technologies are reshaping how humans interact with machines and each other, unlocking new opportunities for communication, innovation, and discovery.

In conclusion, NLP represents a cornerstone of modern artificial intelligence, bridging the gap between human language and machine intelligence. By harnessing the power of NLP, we can unlock the full potential of textual data, enabling machines to comprehend, analyze, and generate natural language with unprecedented accuracy and sophistication. As we continue to push the boundaries of NLP research and development, the future holds immense promise for the transformative impact of natural language processing on society and beyond.

<center> <h1> WordNet Access </h1> </center>

In [None]:
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')

synsets= wordnet.synsets('car')

for synset in synsets:
  print(synset.definition())

print(wordnet.synset('car.n.01').examples())

[nltk_data] Downloading package wordnet to /root/nltk_data...


a motor vehicle with four wheels; usually propelled by an internal combustion engine
a wheeled vehicle adapted to the rails of railroad
the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
where passengers ride up and down
a conveyance for passengers or freight on a cable railway
['he needs a car to get to work']


In [None]:
import nltk
from nltk.corpus import wordnet

nltk.download('wordnet')

synsets= wordnet.synsets('Apple')

for synset in synsets:
  print(synset.definition())

print(wordnet.synset('Apple.n.01').examples())

fruit with red or yellow or green skin and sweet to tart crisp whitish flesh
native Eurasian tree widely cultivated in many varieties for its firm rounded edible fruits
[]


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


<center> <h1> Brown Corpus </h1> </center>

In [None]:
import nltk
from nltk.corpus import brown

nltk.download('brown')

categories= brown.categories()

words= brown.words(categories= 'news')
print(words[:50])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.', 'The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise']


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [None]:
from nltk.corpus import brown
categories= brown.categories()

words= brown.words(categories= 'fiction')
print(words[:50])

['Thirty-three', 'Scotty', 'did', 'not', 'go', 'back', 'to', 'school', '.', 'His', 'parents', 'talked', 'seriously', 'and', 'lengthily', 'to', 'their', 'own', 'doctor', 'and', 'to', 'a', 'specialist', 'at', 'the', 'University', 'Hospital', '--', 'Mr.', 'McKinley', 'was', 'entitled', 'to', 'a', 'discount', 'for', 'members', 'of', 'his', 'family', '--', 'and', 'it', 'was', 'decided', 'it', 'would', 'be', 'best', 'for']


<center> <h1> Movie Reviews </h1> </center>

In [None]:
import nltk
# from nltk.corpus import wordnet

# Download the movie_reviews corpus
nltk.download('movie_reviews')

from nltk.corpus import movie_reviews

# Access categories (sentiments) in the Movie Reviews Corpus
categories = movie_reviews.categories()

# Access fileids (individual movie reviews) in a specific category
fileids_positive = movie_reviews.fileids(categories= 'pos')
fileids_negative = movie_reviews.fileids(categories= 'neg')

# Access words in a specific movie review
words_positie=  movie_reviews.words(fileids= fileids_positive)
words_negative=  movie_reviews.words(fileids= fileids_negative)

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


In [None]:
words_positie

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

In [None]:
words_negative

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]

## Frequency distribution in NLP

### Introduction:

Natural Language Processing (NLP) stands at the convergence of computer science, artificial intelligence, and linguistics, aiming to equip machines with the capability to understand, interpret, and generate human language. One of the fundamental concepts in NLP is frequency distribution, a cornerstone for various linguistic analyses such as text analysis, sentiment analysis, and topic modeling. Frequency distribution provides insights into the distribution of words within a text or corpus, shedding light on the underlying structure and patterns of language usage.

### Understanding Frequency Distribution:

Frequency distribution in NLP refers to the statistical representation of how often each word occurs within a given text or corpus. It quantifies the occurrences of words, phrases, or other textual elements, providing a quantitative understanding of language usage. By analyzing frequency distributions, researchers can uncover key insights such as the most common words, thematic concentrations, and stylistic preferences within a text.

The formula for calculating the frequency of a word $w$ in a text or corpus is:

$$\text{Frequency(w)=Total number of words in the text/corpusNumber of occurrences of w} \times 100$$

### Applications of Frequency Distribution:

1. Text Analysis: Frequency distribution is crucial for understanding the thematic focus and vocabulary usage within a text. By identifying frequently occurring words, researchers can discern central topics and analyze the linguistic characteristics of the text.
2. Vocabulary Building: In language learning applications, frequency distribution is utilized to construct vocabulary lists, prioritizing commonly used words for learners to acquire first.
3. Data Preprocessing for Machine Learning: In machine learning tasks involving text data, frequency distribution is employed for data preprocessing. Commonly occurring words, known as stopwords, are often filtered out to improve the quality of analysis.
4. Sentiment Analysis: Frequency distributions of words can indicate the sentiment or emotional tone of a text. By analyzing the frequency of positive and negative words, sentiment analysis algorithms can infer the overall sentiment of a document.

### Techniques for Analyzing Frequency Distribution:

1. Frequency Distribution Tables: These tables list words along with their corresponding frequencies, providing a clear and concise summary of word occurrences within a text.
2. Histograms: Histograms visualize frequency distributions by plotting words or phrases on the x-axis and their frequencies on the y-axis. This graphical representation allows for a visual assessment of word frequency distributions.
3. Word Clouds: Word clouds visually represent word frequencies by displaying words in varying sizes based on their frequency. This technique offers an intuitive way to identify prominent themes or topics within a text.

### Challenges in Analyzing Frequency Distribution:

1. High Frequency but Low Information Words: Common words such as articles and prepositions often have high frequencies but may carry little semantic meaning. Analyzing frequency distributions requires consideration of both frequency and informativeness.
2. Polysemy and Homonymy: Words with multiple meanings (polysemy) or words that sound alike but have different meanings (homonymy) can introduce ambiguity in frequency distributions, requiring context-aware analysis.
3. Language and Context Dependence: The significance of word frequencies can vary across different languages and contexts. Analyzing frequency distributions necessitates understanding the specific linguistic and contextual factors at play.

### Conclusion:

Frequency distribution serves as a cornerstone in NLP, enabling researchers to gain valuable insights into language usage and textual characteristics. By analyzing the frequency of words, phrases, and other textual elements, researchers can uncover patterns, themes, and stylistic preferences within a text or corpus. Despite its simplicity, analyzing frequency distributions requires careful consideration of linguistic nuances, context, and potential challenges such as polysemy and homonymy. As NLP continues to advance, frequency distribution analysis will remain a fundamental technique for understanding and interpreting human language.

## Detailed Exploration on Tokenizers and its Types

Tokenization is the process of breaking down a text into smaller units called tokens, which can be words, phrases, symbols, or other meaningful elements. Tokenization is a crucial step in natural language processing (NLP) tasks as it forms the foundation for further analysis and processing of textual data. There are several types of tokenizers used in NLP, each with its own characteristics and applications.  

### Word Tokenization:

Definition: Word tokenization is a fundamental process in natural language processing (NLP) that involves breaking down a text into individual words or tokens. It forms the initial step in analyzing textual data and serves as the foundation for various NLP tasks.

#### Techniques:

1. Whitespace Tokenization:

* Description: Whitespace tokenization splits the text based on whitespace characters such as spaces, tabs, and newline characters.
* Implementation: It involves scanning the text and identifying whitespace characters as token boundaries.
* Example: Given the text "The quick brown fox jumps over the lazy dog," whitespace tokenization would produce the tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].
* Advantages: Simple and efficient approach suitable for many languages and text formats.

2. Punctuation Tokenization:

* Description: Punctuation tokenization divides the text based on punctuation marks such as periods, commas, and hyphens.
* Implementation: It identifies punctuation marks as token boundaries while preserving them as separate tokens.
* Example: For the text "Hello, world! How are you?", punctuation tokenization would generate the tokens: ["Hello", ",", "world", "!", "How", "are", "you", "?"].
* Advantages: Useful for tasks where punctuation marks carry semantic importance, such as sentiment analysis or parsing.

3. Language-Specific Tokenization:

* Description: Some languages, like Chinese and Japanese, lack spaces between words, making tokenization more challenging.
* Implementation: Specialized tokenization algorithms are required to handle these languages effectively, often based on linguistic rules or statistical models.
* Example: In Chinese, a sentence like "我爱自然语言处理" (meaning "I love natural language processing") would be tokenized as ["我", "爱", "自然", "语言", "处理"].
* Advantages: Ensures accurate tokenization for languages with unique writing systems or word segmentation conventions.

**Applications:** Word tokenization finds application across various NLP tasks, including:

* Text Classification: Breaking down text into words facilitates feature extraction and modeling for tasks such as sentiment analysis, spam detection, and topic classification.
* Named Entity Recognition (NER): Identifying individual words allows for the recognition and tagging of named entities such as person names, locations, and organizations.
* Language Modeling: Word tokenization is essential for training language models to predict the probability of word sequences, enabling tasks like speech recognition, machine translation, and autocomplete.

### Sentence Tokenization:

Definition: Sentence tokenization, also known as sentence segmentation, is the process of dividing a text into individual sentences. Each sentence typically represents a complete thought or unit of meaning within the text. Sentence tokenization is essential for various natural language processing (NLP) tasks that require sentence-level analysis or processing.

#### Common Techniques:

1. Rule-based Tokenization:

* Description: Rule-based tokenization employs predefined rules to identify sentence boundaries based on punctuation marks and other linguistic cues.
* Implementation: This approach typically involves scanning the text for punctuation marks that indicate the end of a sentence, such as periods (.), question marks (?), and exclamation marks (!). The presence of these punctuation marks, along with contextual rules, helps determine sentence boundaries.
* Example: For the text "The quick brown fox. Jumps over the lazy dog!", rule-based tokenization would identify two sentences: "The quick brown fox." and "Jumps over the lazy dog!".
* Advantages: Rule-based tokenization is straightforward to implement and can handle most common cases effectively.

2. Machine Learning-based Tokenization:

* Description: Machine learning-based tokenization utilizes machine learning models trained on large corpora to predict sentence boundaries based on patterns observed in the text.
* Implementation: This approach involves training a machine learning model, such as a classifier or sequence labeling model, on annotated data where sentence boundaries are marked. The trained model then predicts sentence boundaries for unseen text based on learned patterns.
* Example: An LSTM (Long Short-Term Memory) model trained on a corpus of text with annotated sentence boundaries can be used to predict sentence boundaries for new text inputs.
* Advantages: Machine learning-based tokenization can capture complex patterns and nuances in sentence structure, making it suitable for handling diverse text data.

**Applications:** Sentence tokenization is widely used in various NLP applications, including:

* Text Summarization: Sentence tokenization enables the identification of individual sentences for summarization, where key sentences are extracted to generate a concise summary of the text.
* Machine Translation: Sentence tokenization facilitates the division of source and target language texts into individual sentences, which are then translated independently.
* Named Entity Recognition (NER): In NER tasks, sentence tokenization helps identify boundaries for named entities, such as person names, locations, and organizations, within sentences.
* Information Extraction: Sentence tokenization assists in extracting structured information from text by segmenting the text into meaningful units for further analysis.

### Subword Tokenization:

Definition: Subword tokenization is a technique in natural language processing (NLP) that breaks down a text into smaller units, which can be parts of words or complete words. Unlike word tokenization, which divides text solely based on word boundaries, subword tokenization operates at a more granular level, allowing for the representation of morphologically rich languages and handling out-of-vocabulary words more effectively.

#### Techniques:

1. Byte Pair Encoding (BPE):

* Description: Byte Pair Encoding is a subword tokenization algorithm that merges the most frequent pairs of characters or character sequences iteratively to create a vocabulary of subword units.
* Implementation:
* Example: In the word "subword," the pairs "su," "bw," and "or" might be merged iteratively to form the subword units ["sub", "wo", "rd"].
* Advantages: BPE can capture both frequent words and rare morphological variations effectively, making it suitable for a wide range of languages and tasks.

2. WordPiece Tokenization:

* Description: WordPiece tokenization is a variant of BPE that also merges characters or character sequences iteratively but uses a different merging strategy to create subword units.
* Implementation:
* Example: In the word "subword," WordPiece tokenization might merge characters or sequences based on statistical measures or language-specific considerations to form subword units.
* Advantages: WordPiece tokenization may yield more optimized vocabularies compared to BPE in some cases, especially when considering linguistic characteristics or specific task requirements.

**Applications:** Subword tokenization is particularly useful in various NLP tasks and scenarios, including:

* Handling Morphologically Rich Languages: Subword tokenization is effective for languages with complex morphology, where words can be composed of multiple morphemes or affixes.
* Out-of-Vocabulary Words: Subword tokenization helps address the issue of out-of-vocabulary words by representing rare or unseen words as combinations of subword units.
* Machine Translation: Subword tokenization enables more robust translation models by providing a finer-grained representation of source and target languages, especially for languages with agglutinative or fusional morphology.
* Speech Recognition: Subword tokenization assists in building language models for speech recognition systems, where the vocabulary size may be large and handling out-of-vocabulary words is critical for accuracy.

### Regular Expression Tokenization:

Definition: Regular expression tokenization is a tokenization technique that involves using regular expressions to define patterns for identifying tokens within a text. Regular expressions provide a powerful and flexible way to specify patterns of characters, enabling tokenization based on custom-defined criteria.

#### Key Features:

* Flexibility: Regular expressions offer flexibility in defining tokenization patterns, allowing for customization to meet specific requirements.
* Customization: Tokenization rules can be tailored to handle various types of tokens, such as email addresses, URLs, or specific formatting conventions.
* Pattern Matching: Regular expressions use pattern matching to identify tokens within the text, enabling precise tokenization based on user-defined patterns.

#### Example Applications:

* Email Addresses: Regular expressions can be used to tokenize text and identify email addresses within the text. For example, the pattern "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}\b" matches typical email address formats.
* URLs: Regular expressions can tokenize text to identify URLs or web addresses. For example, the pattern "(http|https)://[^\s]+" matches URLs starting with "http://" or "https://".
* Formatting Conventions: Regular expressions can tokenize text based on specific formatting conventions, such as dates, phone numbers, or numerical values.

**Advantages:**

* Customizability: Regular expression tokenization can be customized to handle specific tokenization requirements and patterns.
* Versatility: Regular expressions can be applied to various types of text data and tokenization tasks.
* Efficiency: Regular expression tokenization can efficiently identify tokens based on predefined patterns, even in large text corpora.

### Tokenizer for Specialized Domains:

Definition: Tokenizer for specialized domains refers to tokenization techniques specifically designed to handle the unique characteristics and terminology of specialized domains, such as biomedical text or legal documents. These tokenizers incorporate domain-specific rules, dictionaries, or machine learning models trained on domain-specific data to achieve accurate tokenization.

#### Key Features:

* Domain-specific Rules: Tokenizers for specialized domains may incorporate rules tailored to handle the unique linguistic characteristics and terminology of the domain.
* Dictionaries: Domain-specific tokenizers may utilize dictionaries or lexicons containing domain-specific terms and vocabulary to assist in tokenization.
* Machine Learning Models: Some tokenizers may leverage machine learning models trained on domain-specific data to improve tokenization accuracy and performance.

#### Example Applications:

* Biomedical Text: Tokenizers for biomedical text may handle specialized terminology, abbreviations, and symbols commonly found in medical literature.
* Legal Documents: Tokenizers for legal documents may address the unique formatting conventions, legal terminology, and citation styles prevalent in legal texts.

**Advantages:**

* Accuracy: Tokenizers for specialized domains can achieve higher accuracy by considering domain-specific rules and terminology.
* Improved Performance: By incorporating domain-specific knowledge, these tokenizers can improve performance in tokenizing text from specialized domains.
* Customizability: Tokenizers for specialized domains can be customized to handle specific linguistic features and requirements of the domain.

### Conclusion

Tokenization is a fundamental preprocessing step in natural language processing tasks, and the choice of tokenizer depends on the specific requirements of the task and the characteristics of the text data. Each type of tokenizer has its advantages and limitations, and selecting the appropriate tokenizer is essential for accurate and effective analysis of textual data in NLP applications.

<center> <h1> Bi-gram, Tri-gram, N-gram </h1> </center>

In [16]:
from nltk.util import bigrams, ngrams, trigrams
import nltk
nltk.download('punkt')

string= "Natural language processing (NLP) is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language."

token= nltk.word_tokenize(string)
token

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Natural',
 'language',
 'processing',
 '(',
 'NLP',
 ')',
 'is',
 'a',
 'machine',
 'learning',
 'technology',
 'that',
 'gives',
 'computers',
 'the',
 'ability',
 'to',
 'interpret',
 ',',
 'manipulate',
 ',',
 'and',
 'comprehend',
 'human',
 'language',
 '.']

In [21]:
bigram= list(bigrams(token))
bigram

[('Natural', 'language'),
 ('language', 'processing'),
 ('processing', '('),
 ('(', 'NLP'),
 ('NLP', ')'),
 (')', 'is'),
 ('is', 'a'),
 ('a', 'machine'),
 ('machine', 'learning'),
 ('learning', 'technology'),
 ('technology', 'that'),
 ('that', 'gives'),
 ('gives', 'computers'),
 ('computers', 'the'),
 ('the', 'ability'),
 ('ability', 'to'),
 ('to', 'interpret'),
 ('interpret', ','),
 (',', 'manipulate'),
 ('manipulate', ','),
 (',', 'and'),
 ('and', 'comprehend'),
 ('comprehend', 'human'),
 ('human', 'language'),
 ('language', '.')]

In [22]:
trigram= list(trigrams(token))
trigram

[('Natural', 'language', 'processing'),
 ('language', 'processing', '('),
 ('processing', '(', 'NLP'),
 ('(', 'NLP', ')'),
 ('NLP', ')', 'is'),
 (')', 'is', 'a'),
 ('is', 'a', 'machine'),
 ('a', 'machine', 'learning'),
 ('machine', 'learning', 'technology'),
 ('learning', 'technology', 'that'),
 ('technology', 'that', 'gives'),
 ('that', 'gives', 'computers'),
 ('gives', 'computers', 'the'),
 ('computers', 'the', 'ability'),
 ('the', 'ability', 'to'),
 ('ability', 'to', 'interpret'),
 ('to', 'interpret', ','),
 ('interpret', ',', 'manipulate'),
 (',', 'manipulate', ','),
 ('manipulate', ',', 'and'),
 (',', 'and', 'comprehend'),
 ('and', 'comprehend', 'human'),
 ('comprehend', 'human', 'language'),
 ('human', 'language', '.')]

In [24]:
n_gram= list(ngrams(token, n= 4))
n_gram

[('Natural', 'language', 'processing', '('),
 ('language', 'processing', '(', 'NLP'),
 ('processing', '(', 'NLP', ')'),
 ('(', 'NLP', ')', 'is'),
 ('NLP', ')', 'is', 'a'),
 (')', 'is', 'a', 'machine'),
 ('is', 'a', 'machine', 'learning'),
 ('a', 'machine', 'learning', 'technology'),
 ('machine', 'learning', 'technology', 'that'),
 ('learning', 'technology', 'that', 'gives'),
 ('technology', 'that', 'gives', 'computers'),
 ('that', 'gives', 'computers', 'the'),
 ('gives', 'computers', 'the', 'ability'),
 ('computers', 'the', 'ability', 'to'),
 ('the', 'ability', 'to', 'interpret'),
 ('ability', 'to', 'interpret', ','),
 ('to', 'interpret', ',', 'manipulate'),
 ('interpret', ',', 'manipulate', ','),
 (',', 'manipulate', ',', 'and'),
 ('manipulate', ',', 'and', 'comprehend'),
 (',', 'and', 'comprehend', 'human'),
 ('and', 'comprehend', 'human', 'language'),
 ('comprehend', 'human', 'language', '.')]

<center> <h1> Stemming </h1> </center>

In [26]:
from nltk.stem import PorterStemmer
# Initializing the Porter Stemer
pst = PorterStemmer()
# steming a single word "having"
print("Stemming 'having' :", pst.stem('having'))

word_to_stem= ['give', 'giving', 'given', 'gave']
for word in word_to_stem:
  print(word + " : "+ pst.stem(word))

Stemming 'having' : have
give : give
giving : give
given : given
gave : gave


In [27]:
from nltk.stem import LancasterStemmer
# Initializing the Porter Stemer
lst = LancasterStemmer()
# steming a single word "having"
print("Stemming 'having' :", lst.stem('having'))

word_to_stem= ['give', 'giving', 'given', 'gave']
for word in word_to_stem:
  print(word + " : "+ lst.stem(word))

Stemming 'having' : hav
give : giv
giving : giv
given : giv
gave : gav


In [28]:
from nltk.stem import SnowballStemmer

sbst= SnowballStemmer('english')

support_language= sbst.languages
print(support_language)

('arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')


In [29]:
from nltk.stem import SnowballStemmer

sbst= SnowballStemmer('english')

print("Stemming 'having' :", sbst.stem('having'))

word_to_stem= ['give', 'giving', 'given', 'gave']
for word in word_to_stem:
  print(word + " : "+ sbst.stem(word))

Stemming 'having' : have
give : give
giving : give
given : given
gave : gave


<center> <h1> Lemmatization </h1> </center>

In [31]:
from nltk.stem import WordNetLemmatizer

word_lem= WordNetLemmatizer()

print("Lemmatized Word 'corpora' :", word_lem.lemmatize('corpora'))

word_to_lemmat= ['give', 'giving', 'given', 'gave']
for word in word_to_lemmat:
  print(word + " : "+ word_lem.lemmatize(word))

Lemmatized Word 'corpora' : corpus
give : give
giving : giving
given : given
gave : gave


<center> <h1> Stopwords </h1> </center>

In [33]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
nltk.download('stopwords')

text= "Natural language processing (NLP) is a machine learning technology that gives computers the ability to interpret, manipulate, and comprehend human language."

tokens= word_tokenize(text)

stop_words= set(stopwords.words('english'))

filtered_tokens= [word for word in tokens if word.lower() not in stop_words]

fdist= FreqDist(filtered_tokens)

fdist_top10= fdist.most_common(10)

print("Top 10 most common words after stopwords removal:")

print(fdist_top10)

Top 10 most common words after stopwords removal:
[('language', 2), (',', 2), ('Natural', 1), ('processing', 1), ('(', 1), ('NLP', 1), (')', 1), ('machine', 1), ('learning', 1), ('technology', 1)]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


<center> <h1> Part of Speech (POS) Tagging </h1> </center>

In [36]:
import nltk

nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize

sent= "Wasiq is Driving Luxury Car."
sent_tokens= word_tokenize(sent)
print(sent_tokens)

for token in sent_tokens:
  print(nltk.pos_tag([token]))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


['Wasiq', 'is', 'Driving', 'Luxury', 'Car', '.']
[('Wasiq', 'NN')]
[('is', 'VBZ')]
[('Driving', 'VBG')]
[('Luxury', 'NN')]
[('Car', 'NN')]
[('.', '.')]


<center> <h1> Bag of Word (BOW) </h1> </center>

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

documents = ["This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?"]

vectorizer= CountVectorizer()

X= vectorizer.fit_transform(documents)

feature_name= vectorizer.get_feature_names_out()

In [38]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [39]:
print("Feature Name :", feature_name)

Feature Name : ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


In [41]:
import pandas as pd
pd.DataFrame(X.toarray(), columns= feature_name)


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1
