<a href="https://colab.research.google.com/github/RDGopal/Prompt-Engineering-Guide/blob/main/Lecture_POS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part-of-Speech (POS)
Part-of-Speech (POS) tagging is a fundamental technique in Natural Language Processing (NLP) that involves assigning a word in a text corpus to a particular part of speech, based on both its definition and its context in a sentence. POS tagging is useful for syntactic parsing and text analysis.

##Concepts
**Definition and Usage**: Each word in a sentence is tagged with a part of speech that explains its usage and function in that sentence. Common parts of speech include noun, verb, adjective, adverb, pronoun, preposition, conjunction, and interjection.

**Tag Sets**: There are different sets of tags, but one of the most commonly used is the Penn Treebank tagset. Examples include:

NN: Noun, singular
NNS: Noun, plural
VB: Verb, base form
VBD: Verb, past tense
JJ: Adjective
RB: Adverb
Techniques for POS Tagging:

**Rule-Based POS Tagging**: Uses hand-written rules to distinguish the POS based on the word itself and its context within a sentence.

**Stochastic POS Tagging**: Uses model-based systems like Hidden Markov Models (HMMs), Maximum Entropy Markov Models, or Conditional Random Fields (CRFs), where the model learns from a tagged corpus to apply tags to new sentences.

**Deep Learning Methods**: Utilizes neural networks, especially recurrent neural networks (RNNs) and transformers, to perform POS tagging, often achieving superior accuracy.

## Example with Explanation
Consider the sentence: "Apple is looking at buying U.K. startup for $1 billion."

The POS tagging for this might look like:

Apple (NNP - Proper noun, singular)

is (VBZ - Verb, 3rd person singular present)

looking (VBG - Verb, gerund or present participle)

at (IN - Preposition or subordinating conjunction)

buying (VBG - Verb, gerund or present participle)

U.K. (NNP - Proper noun, singular)

startup (NN - Noun, singular)

for (IN - Preposition or subordinating conjunction)

$1 billion (CD - Cardinal number)

Each word is assigned a tag that explains how it functions grammatically within the sentence.

# Algorithm
The `averaged_perceptron_tagger` in NLTK is a type of part-of-speech tagger based on the Averaged Perceptron algorithm, a supervised learning method. It's designed to assign part-of-speech tags to each word in a sentence, based on the contextual information of the words. Averaged Perceptron algorithm works as follows:

##Overview of the Perceptron
A perceptron is a simple linear binary classifier that makes predictions based on a weighted sum of the input features. For POS tagging, the features are typically aspects of words (like the word itself, the previous word, the next word, prefixes, suffixes, etc.). The basic steps are:

**Extract Features**: For each word, relevant features are extracted. These might include:

* The word itself.
* The previous word and its tag.
* The next word.
* Suffixes and prefixes of the word.
* Whether the word is capitalized.


**Weight Calculation**: Each feature has an associated weight, which determines its importance. In training, the perceptron algorithm adjusts these weights.

**Decision Making**: To predict the POS tag of a word, the perceptron computes a weighted sum of the features. The tag with the highest sum (score) is chosen as the output.

## Averaged Perceptron
The Averaged Perceptron is an enhancement over the basic perceptron, designed to reduce the sensitivity to the order of training data and to provide better generalization:

**Weight Update**: During training, each time the perceptron makes a mistake (the predicted tag differs from the actual tag), it updates the weights of the features. Weights that would have led to the correct tag are increased, while those that led to the wrong tag are decreased.

**Averaging**: The core idea of the Averaged Perceptron is that it keeps a running average of all the weight values throughout training. Instead of using just the final values of the weights (as in a standard perceptron), it calculates the average of each weight over all updates. This averaging helps smooth out anomalies from any particular training instance and leads to better performance on unseen data.

## Training and Prediction
The training process involves multiple passes (epochs) over the training dataset, adjusting the weights based on errors. These steps are iterated until a stopping criterion is met, typically a maximum number of epochs or a minimum error threshold.

When predicting POS tags:

**Feature Extraction**: Extract the same types of features used during training.

**Score Calculation**: Compute scores for each tag based on the current weights and the extracted features.

**Tag Selection**: Select the tag with the highest score.

# Training
The averaged_perceptron_tagger in NLTK is typically trained on the Wall Street Journal (WSJ) portion of the Penn Treebank dataset. This dataset is one of the most widely used corpora for training linguistic models in English, particularly for tasks involving part-of-speech tagging and syntactic parsing.

## Penn Treebank Dataset
**Overview**: The Penn Treebank, developed in the collaborations between the University of Pennsylvania and IBM in the early 1990s, is a corpus that annotates syntactic or predicate-argument structure of English sentences. This corpus includes a diverse range of texts from the Wall Street Journal (WSJ), as well as other materials.

**Content**: For POS tagging, specifically, the portion from the Wall Street Journal is commonly used because it covers a wide variety of topics in financial and economic news, making the vocabulary and grammatical structures varied and rich.

**Annotations**: Each word in the dataset is annotated with a POS tag based on the Penn Treebank tagset, which includes detailed tags not only for traditional parts of speech like nouns and verbs but also for more specific categories like cardinal numbers and different verb forms.

## Training Process
The training of the averaged_perceptron_tagger involves the following steps using this dataset:

**Feature Extraction**: The model extracts features from the training sentences, which might include the word itself, the context of the word (e.g., previous word, next word), prefixes, suffixes, and whether the word is capitalized.

**Learning Weights**: As it processes each word in the training data, the model updates the weights associated with each feature based on whether its current predictions are correct. If a prediction is wrong, the model adjusts the weights to be more likely to make the correct prediction next time.

**Averaging Weights**: Throughout training, the model maintains an average for each weight to ensure that the final model isn’t too biased towards the latter stages of the training process or specific quirks of the training data order.

## Benefits of Using Penn Treebank
Using the Penn Treebank, especially the WSJ corpus, provides several advantages:

**High Quality of Annotation**: The data is manually annotated by linguistic experts, ensuring high-quality and reliable annotations.

**Diversity of Examples**: The corpus contains a wide range of sentence structures and vocabulary, which helps in building a robust model capable of handling various real-world texts.

**Standardization**: Because many linguistic models are trained on this corpus, it provides a common benchmark for comparing the performance of different algorithms and approaches.

In [None]:
pos_tag_full_form = {
    'CC': 'Coordinating conjunction',
    'CD': 'Cardinal number',
    'DT': 'Determiner',
    'EX': 'Existential there',
    'FW': 'Foreign word',
    'IN': 'Preposition or subordinating conjunction',
    'JJ': 'Adjective',
    'JJR': 'Adjective, comparative',
    'JJS': 'Adjective, superlative',
    'LS': 'List item marker',
    'MD': 'Modal',
    'NN': 'Noun, singular or mass',
    'NNS': 'Noun, plural',
    'NNP': 'Proper noun, singular',
    'NNPS': 'Proper noun, plural',
    'PDT': 'Predeterminer',
    'POS': 'Possessive ending',
    'PRP': 'Personal pronoun',
    'PRP$': 'Possessive pronoun',
    'RB': 'Adverb',
    'RBR': 'Adverb, comparative',
    'RBS': 'Adverb, superlative',
    'RP': 'Particle',
    'SYM': 'Symbol',
    'TO': 'to',
    'UH': 'Interjection',
    'VB': 'Verb, base form',
    'VBD': 'Verb, past tense',
    'VBG': 'Verb, gerund or present participle',
    'VBN': 'Verb, past participle',
    'VBP': 'Verb, non-3rd person singular present',
    'VBZ': 'Verb, 3rd person singular present',
    'WDT': 'Wh-determiner',
    'WP': 'Wh-pronoun',
    'WP$': 'Possessive wh-pronoun',
    'WRB': 'Wh-adverb'
}

In [None]:
import nltk
import pandas as pd

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Example text
sentence = "The quick brown fox jumps over the lazy dog"

# Tokenizing the sentence
tokens = nltk.word_tokenize(sentence)

# POS Tagging
tags = nltk.pos_tag(tokens)
print(tags)


[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]


In [None]:
# Create the DataFrame
df_pos = pd.DataFrame(tags)
df_pos.columns = ['Word','POS']

In [None]:
# Add POS explainers
df_pos['POS Full Form'] = df_pos['POS'].map(pos_tag_full_form)
df_pos

Unnamed: 0,Word,POS,POS Full Form
0,The,DT,Determiner
1,quick,JJ,Adjective
2,brown,NN,"Noun, singular or mass"
3,fox,NN,"Noun, singular or mass"
4,jumps,VBZ,"Verb, 3rd person singular present"
5,over,IN,Preposition or subordinating conjunction
6,the,DT,Determiner
7,lazy,JJ,Adjective
8,dog,NN,"Noun, singular or mass"


# Context-Dependent POS Tagging:
**Word**: "set"

**As a Noun**: "I have a chess set at home." (NN - Noun, singular)

**As a Verb**: "Please set the table for dinner." (VB - Verb, base form)

**Word**: "can"

**As a Modal Verb**: "You can see the stars tonight." (MD - Modal)

**As a Noun**: "Throw it in the trash can." (NN - Noun, singular)

**Word**: "right"

**As an Adverb**: "Turn right at the corner." (RB - Adverb)

**As an Adjective**: "He is the right person for the job." (JJ - Adjective)

**As a Noun**: "They fought for their rights." (NNS - Noun, plural)

## How POS Tagging Handles Context
POS tagging models, especially those that are context-aware (like HMMs, Conditional Random Fields, or neural network-based models), determine the part of speech for each word by considering:

* The word itself: Some words are more likely to be associated with certain parts of speech.
* Contextual clues: The words surrounding a particular word give strong hints about its likely part of speech.
* Word endings: For example, words ending in "-ing" are often verbs.
* Grammatical patterns: Certain common patterns or sequences of tags are more likely than others, which guides the tagging process.

In [18]:
from nltk.tokenize import word_tokenize
sentences = [
    "It is a good book",
    "Can you book the ticket?",
    "That is a trash can",
    "I love fresh air",
    "can you air out the room?"
]

# Tokenize and POS tag each sentence
for sentence in sentences:
    tokens = word_tokenize(sentence)
    tags = nltk.pos_tag(tokens)
    print(tags)

[('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('book', 'NN')]
[('Can', 'MD'), ('you', 'PRP'), ('book', 'NN'), ('the', 'DT'), ('ticket', 'NN'), ('?', '.')]
[('That', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('trash', 'NN'), ('can', 'MD')]
[('I', 'PRP'), ('love', 'VBP'), ('fresh', 'JJ'), ('air', 'NN')]
[('can', 'MD'), ('you', 'PRP'), ('air', 'VB'), ('out', 'RP'), ('the', 'DT'), ('room', 'NN'), ('?', '.')]


# SpaCy
SpaCy and NLTK (Natural Language Toolkit) are two of the most popular libraries in Python for natural language processing (NLP), but they are designed with different goals and use cases in mind. Here's a detailed comparison highlighting their key differences:

## Design Philosophy and Use Case
### SpaCy:

Performance-Oriented: SpaCy is built with performance in mind, both in terms of processing speed and accuracy. It is designed for practical, real-world system application and is often preferred in industry settings where performance and scalability are crucial.

Modern and Minimalist: Provides a clear and concise API that focuses on common use cases, making it easier to implement complex NLP pipelines with fewer lines of code.

Production-Ready: Includes pre-trained models that are optimized for more efficient performance, making it more suitable for deployment in production environments.

###NLTK:

Educational Tool: Originally designed as an educational and research tool, NLTK is excellent for teaching and studying NLP concepts, including traditional NLP tasks.

Comprehensive Toolkit: Offers a vast array of algorithms for almost every NLP task and is ideal for experimenting with different methods, especially in an academic or research setting.

Modular and Extensive: Contains a wide range of modules and datasets, providing tools for almost every computational task related to human language.

##Features and Capabilities
###SpaCy:

Built-In Word Vectors: SpaCy supports word vectors natively and includes functions to leverage these for various tasks.

Dependency Parsing: Comes with a highly efficient and accurate syntactic dependency parser.

Entity Recognition: Has strong support for named entity recognition with pre-trained models.

Pipeline Customization: Allows for easy customization and extension of processing pipelines to include custom components or models.

###NLTK:

Text Processing Libraries: Includes a broad spectrum of libraries for tokenization, stemming, tagging, parsing, and semantic reasoning, which is more comprehensive than SpaCy's.

Corpora and Resources: Ships with a large suite of text corpora and lexical resources, facilitating a wide variety of NLP tasks.

Prototyping: Provides more flexibility for NLP prototyping, especially for less common or more experimental techniques.

## Performance
SpaCy is generally faster and more memory-efficient than NLTK, due to its Cython-based implementation, which makes it suitable for high-volume and real-time applications.

NLTK can be slower and less efficient, making it less ideal for production but excellent for teaching and prototyping, where performance is often a secondary concern.

##Practical Implementation
SpaCy is often chosen for commercial software and production systems due to its performance and ease of integration into applications.

NLTK is typically used in academic settings or in scenarios where one needs to try a variety of algorithms or conduct comprehensive linguistic research.

In SpaCy, when you process a text through its NLP pipeline, each word in the text is converted into a Token object. Each Token object has various attributes that provide detailed information about the word. Here's what each of the attributes represents in the context of a SpaCy Token:

`token.text`:

This attribute returns the exact text of the token or word as it appeared in the input text.

`token.lemma_`:

The lemma of the token is its base form or dictionary form. For example, the lemma of "was" is "be", and the lemma of "mice" is "mouse". This is useful for normalizing the text and reducing the number of forms you need to deal with in NLP applications.

`token.pos_`:

Stands for "Part Of Speech". This attribute returns the simple part-of-speech tag of the token (like VERB, NOUN, ADJECTIVE, etc.). It's a coarse-grained POS tag.

`token.tag_`:

This provides the detailed part-of-speech tag. It's more specific than pos_ and uses a tag set specific to the language (such as the Penn Treebank tag set for English). For example, NN for singular noun, NNS for plural noun, etc.

`token.dep_`:

This stands for "syntactic dependency". It indicates the relation between this token and the token it is attached to (its head). Examples include nsubj (nominal subject), dobj (direct object), or amod (adjective modifier).
token.shape_:

This attribute returns a transformation of the token text that shows its general shape by replacing lowercase letters with 'x', uppercase letters with 'X', and digits with 'd'. For example, "Apple123" becomes "Xxxxxddd".

`token.is_alpha`:

A boolean attribute that returns True if the token consists of only alphabetic characters (no digits or punctuation). For example, "Apple" would return True, but "Apple123" would return False.

`token.is_stop`:

Another boolean attribute that indicates whether the token is a stop word (a common word that may be filtered out in some types of text processing, such as "and", "the", or "is" in English). SpaCy has a built-in list of stop words for each language.

In [2]:
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

text = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)


Apple Apple PROPN NNP nsubj Xxxxx True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP dobj X.X. False False
startup startup NOUN NN dep xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
1 1 NUM CD compound d False False
billion billion NUM CD pobj xxxx True False


In [17]:
import spacy

# Load the English NLP model
nlp = spacy.load('en_core_web_sm')

sentences = [
    "It is a good book",
    "Can you book the ticket?",
    "That is a trash can",
    "I love fresh air",
    "can you air out the room?"
]

# Process the sentences and print POS tags
for sentence in sentences:
    doc = nlp(sentence)
    print([(token.text, token.pos_) for token in doc])


[('It', 'PRON'), ('is', 'AUX'), ('a', 'DET'), ('good', 'ADJ'), ('book', 'NOUN')]
[('Can', 'AUX'), ('you', 'PRON'), ('book', 'VERB'), ('the', 'DET'), ('ticket', 'NOUN'), ('?', 'PUNCT')]
[('That', 'PRON'), ('is', 'AUX'), ('a', 'DET'), ('trash', 'NOUN'), ('can', 'AUX')]
[('I', 'PRON'), ('love', 'VERB'), ('fresh', 'ADJ'), ('air', 'NOUN')]
[('can', 'AUX'), ('you', 'PRON'), ('air', 'VERB'), ('out', 'ADP'), ('the', 'DET'), ('room', 'NOUN'), ('?', 'PUNCT')]


# BERT based pos


BERT's architecture allows it to understand the context of each word in a sentence more effectively than traditional models, which is particularly beneficial for tasks like POS tagging.

##How BERT is Used for POS Tagging
BERT can be fine-tuned for specific tasks including POS tagging. The process typically involves:

**Pre-training**: BERT models are pre-trained on a large corpus of text with tasks like masked language modeling (MLM) and next sentence prediction (NSP). This pre-training helps the model understand language deeply.

**Fine-Tuning**: For POS tagging, BERT is fine-tuned on a labeled dataset where each word in a sentence is tagged with its correct part of speech. During fine-tuning, the output layer of BERT is adapted to predict POS tags instead of its original pre-training tasks.

##Hugging Face Transformers
The Hugging Face transformers library provides an accessible way to use pre-trained BERT models and fine-tune them for various tasks including POS tagging.

In [None]:
pip install transformers

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

model_name = "QCRI/bert-base-multilingual-cased-pos-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
pipeline = TokenClassificationPipeline(model=model, tokenizer=tokenizer)

In [12]:
outputs = pipeline("It is a good book")
# print(outputs)

# Create a new list to store just the word and entity
filtered_outputs = [{'word': item['word'], 'entity': item['entity']} for item in outputs]

# Display the filtered output
print(filtered_outputs)

[{'word': 'It', 'entity': 'PRP'}, {'word': 'is', 'entity': 'VBZ'}, {'word': 'a', 'entity': 'DT'}, {'word': 'good', 'entity': 'JJ'}, {'word': 'book', 'entity': 'NN'}]


In [16]:
sentences = [
    "It is a good book",
    "Can you book the ticket?",
    "That is a trash can",
    "I love fresh air",
    "can you air out the room?"
]

# Process the sentences and print POS tags
for sentence in sentences:
    outputs = pipeline(sentence)
    filtered_outputs = [{item['word'],item['entity']} for item in outputs]
    print(filtered_outputs)

[{'PRP', 'It'}, {'VBZ', 'is'}, {'DT', 'a'}, {'JJ', 'good'}, {'book', 'NN'}]
[{'MD', 'Can'}, {'PRP', 'you'}, {'VB', 'book'}, {'DT', 'the'}, {'ticket', 'NN'}, {'.', '?'}]
[{'DT', 'That'}, {'VBZ', 'is'}, {'DT', 'a'}, {'tras', 'NN'}, {'NN', '##h'}, {'MD', 'can'}]
[{'I', 'PRP'}, {'love', 'VBP'}, {'JJ', 'fresh'}, {'NN', 'air'}]
[{'MD', 'can'}, {'PRP', 'you'}, {'VB', 'air'}, {'out', 'IN'}, {'DT', 'the'}, {'NN', 'room'}, {'.', '?'}]
