# POS-TAGGING APLICATION

**Part-of-Speech (POS) Tagging** is a fundamental task in Natural Language Processing (NLP) that involves **assigning a grammatical category (or "tag") to each word in a given text.**

**Objective:** To identify the lexical category of each word, such as noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, etc., based on its definition and its context within the sentence.

**How it Works:**

* **Input:** A sequence of words (a sentence or a chunk of text).
* **Output:** A sequence of words, where each word is paired with its corresponding POS tag.

**Example:**

* **Sentence:** "The quick brown fox jumps over the lazy dog."
* **POS Tagged Output:**
    * "The": Determiner (DT)
    * "quick": Adjective (JJ)
    * "brown": Adjective (JJ)
    * "fox": Noun (NN)
    * "jumps": Verb (VBZ)
    * "over": Preposition (IN)
    * "the": Determiner (DT)
    * "lazy": Adjective (JJ)
    * "dog": Noun (NN)

**Importance/Applications:**

* **Foundation for Higher-Level NLP Tasks:** POS tagging is a crucial preprocessing step for many more complex NLP applications.
* **Word Sense Disambiguation:** Helps to understand the correct meaning of a word that might have multiple senses (e.g., "bank" as a financial institution vs. "bank" as a river bank).
* **Syntactic Parsing:** Essential for building parse trees and understanding the grammatical structure of sentences.
* **Named Entity Recognition (NER):** Helps to identify proper nouns, locations, organizations, etc.
* **Machine Translation:** Provides grammatical information that can guide translation.
* **Information Extraction:** Aids in extracting specific data from text.
* **Text-to-Speech Systems:** Helps determine pronunciation and intonation (e.g., "read" - present vs. past tense).

**Challenges:**

* **Ambiguity:** Many words can function as different parts of speech depending on the context (e.g., "book" as a noun vs. "book" as a verb).
* **New Words/Slang:** Models need to be robust enough to handle words not seen during training.

**Common Approaches:**

* **Rule-Based Tagging:** Uses hand-crafted rules based on suffixes, prefixes, and context.
* **Stochastic/Statistical Tagging:** Uses probability based on how frequently a word appears with a certain tag and how frequently one tag follows another. (e.g., Hidden Markov Models - HMMs, Maximum Entropy Models).
* **Neural Network-Based Tagging:** Uses deep learning models (like RNNs, LSTMs, Transformers) to learn complex patterns from data.

In [None]:
import nltk # Import the Natural Language Toolkit (NLTK) library, which is a powerful tool for working with human language data.

# Download the 'averaged_perceptron_tagger' resource. This is a pre-trained statistical model
# used by NLTK for Part-of-Speech (POS) tagging. It assigns grammatical categories (like noun, verb, adjective)
# to words in a text. This resource needs to be downloaded once to be available for use.
nltk.download('averaged_perceptron_tagger')

# Download the 'punkt' tokenizer models. This resource contains pre-trained models for tokenizing
# text into sentences and words. Tokenization is the process of breaking down a text into smaller units.
# 'punkt' is essential for the `word_tokenize` function used below. This also needs to be downloaded once.
nltk.download('punkt')

# Define a sample text string.
text_sentence = "I will buy ice cream."

# Tokenize the text_sentence into individual words. The `nltk.word_tokenize()` function splits the string
# into a list of words and punctuation marks.
# For example, "I will buy ice cream." becomes ['I', 'will', 'buy', 'ice', 'cream', '.'].
words_tokenized = nltk.word_tokenize(text_sentence)

# Perform Part-of-Speech (POS) tagging on the tokenized words.
# The `nltk.pos_tag()` function takes a list of words and returns a list of tuples,
# where each tuple contains a word and its corresponding POS tag.
# For example, ['I', 'will', 'buy', 'ice', 'cream', '.'] might become
# [('I', 'PRP'), ('will', 'MD'), ('buy', 'VB'), ('ice', 'NN'), ('cream', 'NN'), ('.', '.')]
# (PRP: Personal Pronoun, MD: Modal Verb, VB: Verb Base Form, NN: Noun, .: Punctuation mark)
pos_tagged_words = nltk.pos_tag(words_tokenized)

# The result of `nltk.pos_tag(text)` will be a list of tuples, which would typically be printed
# or used for further NLP tasks.
print(pos_tagged_words)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[('I', 'PRP'),
 ('will', 'MD'),
 ('buy', 'VB'),
 ('ice', 'JJ'),
 ('cream', 'NN'),
 ('.', '.')]

In [7]:
#python -m spacy download pt_core_news_sm

# Import the spaCy library, which is an open-source library for advanced Natural Language Processing (NLP) in Python.
# spaCy is known for its efficiency, speed, and production-ready NLP models.
import spacy 

# Load a pre-trained spaCy language model for Portuguese.
# 'pt_core_news_sm' refers to a small (sm) Portuguese (pt) model trained on news text (core_news).
# This model includes components for tokenization, POS tagging, dependency parsing, named entity recognition (NER), etc.
# The loaded model object, assigned to the variable 'PLN', is essentially the "pipeline" for processing Portuguese text.
nlp_pipeline_pt = spacy.load('pt_core_news_sm')

# Process a given text string using the loaded spaCy model.
# When a string is passed to the nlp_pipeline_pt object, spaCy processes it through its various components
# (tokenizer, tagger, parser, etc.) and returns a 'Doc' object.
# The 'Doc' object is a container for all the processed information about the text.
text_to_process = "O mistério foi resolvido antes do esperado!"
doc_object = nlp_pipeline_pt(text_to_process)

# Iterate through each 'token' (word or punctuation mark) in the processed 'Doc' object.
# For each token, extract its original text ('token.text') and its Part-of-Speech (POS) tag ('token.pos_').
# The POS tag represents the grammatical category of the word (e.g., NOUN, VERB, ADJ, PRON).
# The results are collected into a list of tuples, where each tuple is (word, POS_tag).
# This provides a quick way to see the POS tagging performed by the spaCy model.
pos_tagged_list = [(token.text, token.pos_) for token in doc_object]

# The 'pos_tagged_list' would then contain something like:
# [('They', 'PRON'), ('are', 'AUX'), ('solving', 'VERB'), ('a', 'DET'), ('mystery', 'NOUN'), ('!', 'PUNCT')]
# This output would typically be printed or used for further analysis.
print(pos_tagged_list)

[('O', 'DET'), ('mistério', 'NOUN'), ('foi', 'AUX'), ('resolvido', 'VERB'), ('antes', 'ADV'), ('do', 'PRON'), ('esperado', 'VERB'), ('!', 'PUNCT')]


In [8]:
from stanza import Pipeline

PLN = Pipeline(lang='pt')

doc = PLN("Existem muitas possibilidades. Porém, não evidentes.")

tags = [(word.text, word.upos) for sent in doc.sentences for word in sent.words]

print(tags)

2025-06-02 08:23:54 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-06-02 08:23:55 INFO: Downloaded file to C:\Users\Felipe\stanza_resources\resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/tokenize/bosque.pt:   0%|     …

Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/mwt/bosque.pt:   0%|          …

Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/pos/bosque_charlm.pt:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/lemma/bosque_nocharlm.pt:   0%…

Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/constituency/cintil_charlm.pt:…

Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/depparse/bosque_charlm.pt:   0…

Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/forward_charlm/oscar2023.pt:  …

Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/backward_charlm/oscar2023.pt: …

Downloading https://huggingface.co/stanfordnlp/stanza-pt/resolve/v1.10.0/models/pretrain/conll17.pt:   0%|    …

2025-06-02 08:25:50 INFO: Loading these models for language: pt (Portuguese):
| Processor    | Package         |
----------------------------------
| tokenize     | bosque          |
| mwt          | bosque          |
| pos          | bosque_charlm   |
| lemma        | bosque_nocharlm |
| constituency | cintil_charlm   |
| depparse     | bosque_charlm   |

2025-06-02 08:25:50 INFO: Using device: cpu
2025-06-02 08:25:50 INFO: Loading: tokenize
2025-06-02 08:25:53 INFO: Loading: mwt
2025-06-02 08:25:53 INFO: Loading: pos
2025-06-02 08:25:58 INFO: Loading: lemma
2025-06-02 08:25:59 INFO: Loading: constituency
2025-06-02 08:26:00 INFO: Loading: depparse
2025-06-02 08:26:01 INFO: Done loading processors!


[('Existem', 'VERB'), ('muitas', 'DET'), ('possibilidades', 'NOUN'), ('.', 'PUNCT'), ('Porém', 'ADV'), (',', 'PUNCT'), ('não', 'ADV'), ('evidentes', 'ADJ'), ('.', 'PUNCT')]


In [10]:
from textblob import TextBlob

blob = TextBlob("Só vai ser resolvido em 20 anos.")

blob.tags

[('Só', 'NNP'),
 ('vai', 'NNP'),
 ('ser', 'NN'),
 ('resolvido', 'NN'),
 ('em', 'VBD'),
 ('20', 'CD'),
 ('anos', 'NNS')]

In [None]:
# Information extraction

import nltk
from sklearn.datasets import fetch_20newsgroups

# Ensure the necessary NLTK resources are downloaded. These models are required for tokenization and POS tagging.
nltk.download('punkt') # Downloads the 'punkt' tokenizer models, used by word_tokenize.
nltk.download('averaged_perceptron_tagger') # Downloads the pre-trained POS tagger model.

def extract_nouns(text):
    """
    Extracts nouns from a given text using NLTK's word tokenization and Part-of-Speech (POS) tagging.

    Args:
        text (str): The input text from which to extract nouns.

    Returns:
        list: A list of words identified as nouns.
    """
    # Tokenize the input text into a list of words.
    tokens = nltk.word_tokenize(text)

    # Perform Part-of-Speech tagging on the tokenized words.
    # This assigns a grammatical tag (e.g., 'NN', 'VB') to each word.
    tags = nltk.pos_tag(tokens)

    # Define the POS tags corresponding to nouns in the Penn Treebank tagset.
    # NN: Noun, singular or mass (e.g., 'table', 'water')
    # NNS: Noun, plural (e.g., 'tables', 'waters')
    # NNP: Proper noun, singular (e.g., 'John', 'London')
    # NNPS: Proper noun, plural (e.g., 'Americans', 'Russians')
    noun_tags = ["NN", "NNS", "NNP", "NNPS"]

    # Filter the words based on their POS tags to include only nouns.
    # [ <expressão de saída> for <item_iterado> in <iterável> if <condição_opcional> ]
    nouns = [word for word, tag in tags if tag in noun_tags]
    return nouns

# Load the 20 Newsgroups dataset.
# 'subset='all'' fetches both training and test sets.
# 'remove=('headers', 'footers', 'quotes')' optionally cleans the text by removing common boilerplate
# content, making the core content more prominent for analysis.
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

# Select a sample text from the dataset.
# newsgroups_data.data[10] accesses the 11th document (Python lists are 0-indexed).
# .split("\n")[3] attempts to get the 4th line of that document.
# This approach might result in an empty or less meaningful line depending on the document and cleaning.
# It's more robust to select the entire document or check

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
import nltk

# (Assuming nltk.download('punkt') and nltk.download('averaged_perceptron_tagger') have been run previously
# to ensure the necessary resources for tokenization and POS tagging are available.)

def find_verb_verb_sequences(text):
    """
    Identifies and returns sequences of two consecutive verbs in a given text.
    This function can be a simple helper for grammatical correction, as consecutive verbs
    (especially main verbs) often indicate a grammatical error in English (e.g., "I want go").

    Args:
        text (str): The input text string to analyze.

    Returns:
        list: A list of tuples, where each tuple contains two consecutive words
              that were identified as verbs.
    """
    # 1. Tokenize the input text into individual words and punctuation marks.
    # Example: "I want go there." -> ['I', 'want', 'go', 'there', '.']
    tokens = nltk.word_tokenize(text)

    # 2. Perform Part-of-Speech (POS) tagging on the tokenized words.
    # This assigns a grammatical tag (e.g., 'VB', 'NNP') to each word.
    # Example: ['I', 'want', 'go', 'there', '.'] -> [('I', 'PRP'), ('want', 'VBP'), ('go', 'VB'), ('there', 'RB'), ('.', '.')]
    tags = nltk.pos_tag(tokens)

    # 3. Identify consecutive verb sequences using a list comprehension.
    # This comprehension iterates through the 'tags' list.
    # `range(len(tags)-1)`: This ensures the loop goes up to the second-to-last element,
    # because we are always looking at `tags[i]` and `tags[i+1]`.
    # `tags[i][1].startswith("VB")`: Checks if the POS tag of the current word (tags[i][1]) starts with "VB".
    #    "VB" is the prefix for all verb tags in the Penn Treebank tagset (e.g., VB, VBD, VBG, VBN, VBP, VBZ).
    # `tags[i+1][1].startswith("VB")`: Checks if the POS tag of the *next* word (tags[i+1][1]) also starts with "VB".
    # If both conditions are true, then:
    # `(tags[i][0], tags[i+1][0])`: A tuple containing the actual words (tags[i][0] is the word for the current tag,
    #    tags[i+1][0] is the word for the next tag) is added to the `verb_verb` list.
    verb_verb = [(tags[i][0], tags[i+1][0]) #list comprehension
                 for i in range(len(tags)-1)
                 if tags[i][1].startswith("VB") and tags[i+1][1].startswith("VB")]

    # 4. Return the list of identified consecutive verb sequences.
    return verb_verb

# Define a sample text string to test the function.
sample_text = "I want go there." # This sentence has a common grammatical error: "want go" (should be "want to go" or "wanna go").

# Call the function with the sample text and print the result.
# Expected Output: [('want', 'go')] because 'want' is a verb and 'go' is a verb.
print(find_verb_verb_sequences(sample_text))

# Another example:
# sample_text_2 = "They plan to eat pizza. We will finish soon. He likes swimming."
# print(find_verb_verb_sequences(sample_text_2))
# Expected output for sample_text_2 might be empty or specific modal-verb sequences depending on NLTK's tagging.
# For "He likes swimming.", 'likes' (VBZ) and 'swimming' (VBG, acting as a noun/gerund here but tagged as verb-like)
# could be flagged depending on the exact context and tagger's interpretation.

[('want', 'go')]


In [24]:
def disambiguate_word(word, context):
    """Attempts to disambiguate the meaning of a word based on its context.

    This is a simplified example that uses a dictionary of ambiguous words and their
    possible meanings with associated keywords.  It checks if any of the keywords
    for a particular meaning are present in the context.  This is a very basic
    Word Sense Disambiguation (WSD) approach and would not be robust in real-world scenarios.

    Args:
        word (str): The ambiguous word to disambiguate.
        context (str): The surrounding text providing context for the word.

    Returns:
        str: The disambiguated meaning of the word, or None if the word is not in the dictionary
             or no matching context is found.
    """
    word_meanings = {
        'manga': [
            {'meaning': 'fruta', 'keywords': ['comer', 'doce', 'fruta', 'vitamina', 'café da manhã']},  # Added 'café da manhã' for contexto1
            {'meaning': 'parte da roupa', 'keywords': ['vestir', 'camisa', 'tecido', 'costura']}
        ],
        # ... (outras palavras ambíguas podem ser adicionadas aqui)
    }
    if word not in word_meanings:
        return None
    meanings = word_meanings[word]
    for meaning_data in meanings:
        for keyword in meaning_data['keywords']:
            if keyword in context:
                return meaning_data['meaning']
    return None

contexto1 = "Eu gosto de comer manga no café da manhã."
contexto2 = "A manga da minha camisa rasgou."
print(disambiguate_word('manga', contexto1))  # Esperado: 'fruta'
print(disambiguate_word('manga', contexto2))  # Esperado: 'parte da roupa'

fruta
parte da roupa


In [25]:
# Lemmatization

import nltk
from nltk.stem import WordNetLemmatizer # Import the WordNetLemmatizer for lemmatization.
from nltk.corpus import wordnet # Import the WordNet corpus, which is a lexical database for English.

# Download the 'wordnet' resource. This is necessary for the WordNetLemmatizer to function correctly,
# as it uses WordNet to look up lemmas. This needs to be downloaded once.
nltk.download('wordnet')
# Ensure 'punkt' and 'averaged_perceptron_tagger' are also downloaded for tokenization and POS tagging.
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(treebank_tag):
    """
    Converts a Penn Treebank POS tag to a WordNet POS tag.
    The WordNetLemmatizer requires specific WordNet POS tags (NOUN, VERB, ADJ, ADV)
    for accurate lemmatization, while nltk.pos_tag returns Penn Treebank tags.

    Args:
        treebank_tag (str): The POS tag returned by nltk.pos_tag (e.g., 'NN', 'VBP', 'JJ').

    Returns:
        str: The corresponding WordNet POS tag (e.g., wordnet.NOUN, wordnet.VERB).
        Defaults to wordnet.NOUN if no specific mapping is found.
    """
    if treebank_tag.startswith('J'): # 'J' for Adjective (e.g., JJ, JJR, JJS)
        return wordnet.ADJ
    elif treebank_tag.startswith('V'): # 'V' for Verb (e.g., VB, VBD, VBG, VBN, VBP, VBZ)
        return wordnet.VERB
    elif treebank_tag.startswith('N'): # 'N' for Noun (e.g., NN, NNS, NNP, NNPS)
        return wordnet.NOUN
    elif treebank_tag.startswith('R'): # 'R' for Adverb (e.g., RB, RBR, RBS)
        return wordnet.ADV
    else: # Default to Noun if the tag is not recognized or doesn't start with a specific letter.
        return wordnet.NOUN

# Initialize the WordNet Lemmatizer.
lemmatizer_instance = WordNetLemmatizer()

# Define a sample text and tokenize it into words.
sample_text = "The flies are flying."
text_tokens = nltk.word_tokenize(sample_text)

# Perform Part-of-Speech tagging on the tokens.
# This is crucial because lemmatization accuracy heavily depends on knowing the word's POS.
pos_tags = nltk.pos_tag(text_tokens)

# Lemmatize each word in the tokenized and tagged list.
# For each (word, tag) pair:
# 1. 'word' is the actual word.
# 2. 'get_wordnet_pos(tag)' converts the NLTK POS tag to a WordNet-compatible POS tag.
# 3. 'lemmatizer_instance.lemmatize()' then finds the base form (lemma) of the word,
#    using the provided WordNet POS tag for context.
lemmatized_words_list = [lemmatizer_instance.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]

# Print the list of lemmatized words.
print(lemmatized_words_list)

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['The', 'fly', 'be', 'fly', '.']


In Natural Language Processing (NLP), both **lemmatization** and **stemming** are techniques used to reduce words to their base or root form. This process, known as **word normalization**, helps in reducing redundancy and improving the accuracy of various NLP tasks like information retrieval, text classification, and sentiment analysis. However, they achieve this goal in different ways and with different levels of linguistic sophistication.

-----

### Lemmatization

**Lemmatization** is a more sophisticated and linguistically informed process. It aims to reduce a word to its **lemma**, which is its canonical or dictionary form. This means that the output of lemmatization is always a valid word that can be found in a dictionary. It achieves this by considering the word's morphological analysis and its Part-of-Speech (POS) tag.

For example:

  * "running" would become "run"
  * "better" would become "good" (as "better" is the comparative form of "good")
  * "flies" (verb) would become "fly"
  * "flies" (noun, plural of "fly") would become "fly"

Lemmatization typically requires a lexicon (dictionary) and morphological analysis (understanding word structure and derivations). This makes it more accurate but also computationally more expensive.

-----

### Stemming

**Stemming** is a more crude and heuristic process. It primarily involves **removing suffixes** from words to get to a "stem" or "root form." The stem is often not a valid word itself. It works by applying a set of rules, often regular expressions, to chop off word endings.

For example:

  * "running" might become "run" (if the rule removes "ning")
  * "connection" might become "connect" (if the rule removes "ion")
  * "flies" might become "fli" (if the rule just removes "es")

Stemming algorithms (like the Porter Stemmer or Snowball Stemmer) are typically faster and simpler to implement than lemmatization, as they don't require external lexical resources. However, their output might not always be linguistically correct, leading to "over-stemming" (removing too much of the word) or "under-stemming" (not removing enough).

-----

### Key Differences:

| Feature           | Lemmatization                                      | Stemming                                           |
| :---------------- | :------------------------------------------------- | :------------------------------------------------- |
| **Output** | A **valid word** (lemma/dictionary form).           | Often a **root or "stem"** that may not be a real word. |
| **Approach** | Linguistically informed; uses vocabulary and morphological analysis. | Rule-based; chops off prefixes/suffixes heuristically. |
| **Accuracy** | Generally **more accurate** and reliable.          | Less accurate; can lead to over-stemming or under-stemming. |
| **POS Tagging** | **Requires or benefits significantly** from POS tagging for better accuracy. | Does **not typically use** POS tagging.         |
| **Speed/Complexity** | Slower and more computationally intensive.        | Faster and less computationally intensive.         |
| **Language Dependency** | Highly language-dependent (needs a lexicon for each language). | Can be language-dependent, but simpler rules can be generalized. |
| **Use Cases** | When linguistic accuracy is crucial (e.g., machine translation, complex question answering). | When speed and basic normalization are sufficient (e.g., information retrieval index, large-scale text analysis where some errors are acceptable). |
| **Example** | "am", "are", "is" $\\rightarrow$ "be"\<br\>"better" $\\rightarrow$ "good"\<br\>"flies" (noun) $\\rightarrow$ "fly" | "am", "are", "is" $\\rightarrow$ "be" (might be different depending on stemmer)\<br\>"better" $\\rightarrow$ "better"\<br\>"flies" $\\rightarrow$ "fli" |

In [None]:
from sklearn.datasets import fetch_20newsgroups # Import the function to fetch the 20 Newsgroups dataset.
import nltk # Import the Natural Language Toolkit (NLTK) library for NLP tasks.

# This model is used by nltk.word_tokenize to split text into words.
nltk.download('punkt')

# This model is used by nltk.pos_tag to assign grammatical tags to words.
nltk.download('averaged_perceptron_tagger')

# Fetch a sample of text from the 20 Newsgroups dataset.
# 'subset='train'' specifies that we want to load only the training subset of the data.
newsgroups_data = fetch_20newsgroups(subset='train')

# Select a sample line from one of the documents.
# newsgroups_data.data is a list where each element is a document (as a string).
# [5] accesses the 6th document in the list.
# .split("\n")[2] splits that document into lines and selects the 3rd line (index 2).
sample_text_line = newsgroups_data.data[5].split("\n")[2]

# Tokenize the selected sample text line into individual words.
# Example: "This is a sample sentence." -> ['This', 'is', 'a', 'sample', 'sentence', '.']
word_tokens = nltk.word_tokenize(sample_text_line)

# Perform Part-of-Speech (POS) tagging on the word tokens.
# This assigns a grammatical category (e.g., noun, verb, adjective) to each word.
# The result is a list of tuples, where each tuple is (word, pos_tag).
# Example: [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'NN'), ('sentence', 'NN'), ('.', '.')]
pos_tags = nltk.pos_tag(word_tokens)

# Print the list of words with their corresponding POS tags.
print(pos_tags)

frase = 'oi, eu sou o goku'
print(frase.split('\n'))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Felipe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[('Organization', 'NN'), (':', ':'), ('VTT', 'NN')]


In [34]:
frase = 'oi, eu sou o goku'
print(frase.split(' '))
print(frase)

['oi,', 'eu', 'sou', 'o', 'goku']
oi, eu sou o goku
