# POS-TAGGING APLICATION

**Part-of-Speech (POS) Tagging** is a fundamental task in Natural Language Processing (NLP) that involves **assigning a grammatical category (or "tag") to each word in a given text.**

**Objective:** To identify the lexical category of each word, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, etc., based on its definition and its context within the sentence.

**How it Works:**

* **Input:** A sequence of words (a sentence or a chunk of text).
* **Output:** A sequence of words, where each word is paired with its corresponding POS tag.

**Example:**

* **Sentence:** "The quick brown fox jumps over the lazy dog."
* **POS Tagged Output:**
    * "The": Determiner (DT)
    * "quick": Adjective (JJ)
    * "brown": Adjective (JJ)
    * "fox": Noun (NN)
    * "jumps": Verb (VBZ)
    * "over": Preposition (IN)
    * "the": Determiner (DT)
    * "lazy": Adjective (JJ)
    * "dog": Noun (NN)

**Importance/Applications:**

* **Foundation for Higher-Level NLP Tasks:** POS tagging is a crucial preprocessing step for many more complex NLP applications.
* **Word Sense Disambiguation:** Helps to understand the correct meaning of a word that might have multiple senses (e.g., "bank" as a financial institution vs. "bank" as a river bank).
* **Syntactic Parsing:** Essential for building parse trees and understanding the grammatical structure of sentences.
* **Named Entity Recognition (NER):** Helps to identify proper nouns, locations, organizations, etc.
* **Machine Translation:** Provides grammatical information that can guide translation.
* **Information Extraction:** Aids in extracting specific data from text.
* **Text-to-Speech Systems:** Helps determine pronunciation and intonation (e.g., "read" - present vs. past tense).

**Challenges:**

* **Ambiguity:** Many words can function as different parts of speech depending on the context (e.g., "book" as a noun vs. "book" as a verb).
* **New Words/Slang:** Models need to be robust enough to handle words not seen during training.

**Common Approaches:**

* **Rule-Based Tagging:** Uses hand-crafted rules based on suffixes, prefixes, and context.
* **Stochastic/Statistical Tagging:** Uses probability based on how frequently a word appears with a certain tag and how frequently one tag follows another. (e.g., Hidden Markov Models - HMMs, Maximum Entropy Models).
* **Neural Network-Based Tagging:** Uses deep learning models (like RNNs, LSTMs, Transformers) to learn complex patterns from data.

In [None]:
import nltk # Import the Natural Language Toolkit (NLTK) library, which is a powerful tool for working with human language data.

# Download the 'averaged_perceptron_tagger' resource. This is a pre-trained statistical model
# used by NLTK for Part-of-Speech (POS) tagging. It assigns grammatical categories (like noun, verb, adjective)
# to words in a text. This resource needs to be downloaded once to be available for use.
nltk.download('averaged_perceptron_tagger')

# Download the 'punkt' tokenizer models. This resource contains pre-trained models for tokenizing
# text into sentences and words. Tokenization is the process of breaking down a text into smaller units.
# 'punkt' is essential for the `word_tokenize` function used below. This also needs to be downloaded once.
nltk.download('punkt')

# Define a sample text string.
text_sentence = "I will buy ice cream."

# Tokenize the text_sentence into individual words. The `nltk.word_tokenize()` function splits the string
# into a list of words and punctuation marks.
# For example, "I will buy ice cream." becomes ['I', 'will', 'buy', 'ice', 'cream', '.'].
words_tokenized = nltk.word_tokenize(text_sentence)

# Perform Part-of-Speech (POS) tagging on the tokenized words.
# The `nltk.pos_tag()` function takes a list of words and returns a list of tuples,
# where each tuple contains a word and its corresponding POS tag.
# For example, ['I', 'will', 'buy', 'ice', 'cream', '.'] might become
# [('I', 'PRP'), ('will', 'MD'), ('buy', 'VB'), ('ice', 'NN'), ('cream', 'NN'), ('.', '.')]
# (PRP: Personal Pronoun, MD: Modal Verb, VB: Verb Base Form, NN: Noun, .: Punctuation mark)
pos_tagged_words = nltk.pos_tag(words_tokenized)

# The result of `nltk.pos_tag(text)` will be a list of tuples, which would typically be printed
# or used for further NLP tasks.
print(pos_tagged_words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('I', 'PRP'),
 ('will', 'MD'),
 ('buy', 'VB'),
 ('ice', 'JJ'),
 ('cream', 'NN'),
 ('.', '.')]

In [4]:
#python -m spacy download pt_core_news_sm

# Import the spaCy library, which is an open-source library for advanced Natural Language Processing (NLP) in Python.
# spaCy is known for its efficiency, speed, and production-ready NLP models.
import spacy 

# Load a pre-trained spaCy language model for Portuguese.
# 'pt_core_news_sm' refers to a small (sm) Portuguese (pt) model trained on news text (core_news).
# This model includes components for tokenization, POS tagging, dependency parsing, named entity recognition (NER), etc.
# The loaded model object, assigned to the variable 'PLN', is essentially the "pipeline" for processing Portuguese text.
nlp_pipeline_pt = spacy.load('pt_core_news_sm')

# Process a given text string using the loaded spaCy model.
# When a string is passed to the nlp_pipeline_pt object, spaCy processes it through its various components
# (tokenizer, tagger, parser, etc.) and returns a 'Doc' object.
# The 'Doc' object is a container for all the processed information about the text.
text_to_process = "They are solving a mystery!"
doc_object = nlp_pipeline_pt(text_to_process)

# Iterate through each 'token' (word or punctuation mark) in the processed 'Doc' object.
# For each token, extract its original text ('token.text') and its Part-of-Speech (POS) tag ('token.pos_').
# The POS tag represents the grammatical category of the word (e.g., NOUN, VERB, ADJ, PRON).
# The results are collected into a list of tuples, where each tuple is (word, POS_tag).
# This provides a quick way to see the POS tagging performed by the spaCy model.
pos_tagged_list = [(token.text, token.pos_) for token in doc_object]

# The 'pos_tagged_list' would then contain something like:
# [('They', 'PRON'), ('are', 'AUX'), ('solving', 'VERB'), ('a', 'DET'), ('mystery', 'NOUN'), ('!', 'PUNCT')]
# This output would typically be printed or used for further analysis.
print(pos_tagged_list)

[('They', 'PROPN'), ('are', 'PROPN'), ('solving', 'PROPN'), ('a', 'PRON'), ('mystery', 'VERB'), ('!', 'PUNCT')]
