# Etiquetado POS

El POS (part of speech) tagging o etiquetado morfológico es el proceso mediante el cual se clasifican las partes de un texto de acuerdo a su clasificación léxica.

Cada palabra recibirá una clasificación léxica a partir de una colección de etiquetas codificadas de acuerdo a su significado en el idioma correspondiente. Para poder realizar un etiquetado POS el texto debe estar previamente tokenizado.

NLKT ofrece una función llamada pos_tag. Esta función clasifica las palabras en ingés según un sistema de codificación pre-definido. Este etiquetador en particular está basado en machine learning y ha sido entrenado a partir de miles de ejemplos de oraciones pre-etiquetadas de manera manual. De esta manera puede estimar la clasificación léxica más probable de un término lo cuál no significa que esté libre de errores.

Es posible obtener una lista completa de los códigos de etiquetado para NLTK

In [1]:
import nltk
nltk.download('tagsets')
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data] Downloading package tagsets to /home/lucas/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


Es posible obtener la descripción cada una categoría específica.

In [2]:
nltk.help.upenn_tagset("NNP")

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


Ya que este etiquetador puede no ser suficiente bueno en algunos casos es posible mejorar la eficiencia del etiquetado sumando etiquetadores POS creados manualmente.

## Etiquetado

In [1]:
example = "The Palace of Westminster serves as the meeting place for both the House of Commons and the House of Lords, the two houses of the Parliament of the United Kingdom. Informally known as the Houses of Parliament after its occupants, the Palace lies on the north bank of the River Thames in the City of Westminster, in central London, England."
# Tokenizar texto
nltk.download('averaged_perceptron_tagger')
tokenized_text = nltk.word_tokenize(example)
print(tokenized_text)
# Etiquetar texto con pos_tag
text_pos = nltk.pos_tag(tokenized_text)
print(text_pos)

NameError: name 'nltk' is not defined

# POS en español

Si queremos hacer un etiquetado morfológico en otro idioma entonces es necesario encontrar un etiquetador ya entrenado para ese idioma o entrenar uno nosotros mismos. También es necesario saber cuales son las clasificaciones de palabras existentes para dicho idioma.

En el siguiente <a href="https://colab.research.google.com/github/vitojph/kschool-nlp-18/blob/master/notebooks/pos-tagger-es.ipynb">enlace se muestra un ejemplo práctico de etiquetado POS en español.

# Ejercicio

- Obtener de la API una lista de Tweets que no sean retweet y que contengan el hashtag #GRAMMYs en inglés.
- Realizar la tokenización
- Realizar un etiquetado POS con la función pos_tag de NLTK
- Obtener la lista y frecuencia de los sustantivos en singular y plural

In [4]:
import os
from dotenv import load_dotenv
import pandas as pd
import requests
# Cargar valores del archivo .env en las variables de entorno
load_dotenv()
# Cargar valor del token a variable
bearer_token = os.environ.get("BEARER_TOKEN")
url = "https://api.twitter.com/2/tweets/search/recent"
headers = {
    "Authorization": f"Bearer {bearer_token}",
    "User-Agent":"v2FullArchiveSearchPython"
} 
hashtag='#GRAMMYs'
params = {
    'query': f'{hashtag} -is:retweet lang:en',
    'max_results':100
}
response = requests.get(url, headers=headers, params=params)
print(response)
# Generar excepción si la respuesta no es exitosa
if response.status_code != 200:
    raise Exception(response.status_code, response.text)
json_response = response.json()['data']
print(json_response)
df = pd.json_normalize(json_response)
df

<Response [200]>
[{'id': '1450201218580680705', 'text': "The reason the #GRAMMYs is music’s highest honor is because it’s music’s only peer-voted award. Today at 3pm PT, I'm doing an Instagram Live with the  @RecordingAcad to shed light on the power, prestige, and privilege of recognizing music’s creators! #Vote4GRAMMYs https://t.co/i5q6EU5jrT"}, {'id': '1450197865880489984', 'text': "RT @ImEricaCampbell: The reason the #GRAMMYs is music’s highest honor is because it’s music’s only peer-voted award. Today at 1 p.m PT, I'm doing a @TwitterSpaces conversation with the @RecordingAcad to shed light on the power of recognizing music’s crea… https://t.co/giZAdfoYay"}, {'id': '1450182785004621829', 'text': 'In a few minutes we’re talking with @ImEricaCampbell, @J_Ivy, and @lukesmorgan on the importance of #GRAMMYs\xa0 voting and the power of recognizing music’s creators.\n\nSet your reminder and join us for #Vote4GRAMMYs\xa0 conversation.\n https://t.co/HPXubusMUw'}, {'id': '14501792155848990

Unnamed: 0,id,text
0,1450201218580680705,The reason the #GRAMMYs is music’s highest hon...
1,1450197865880489984,RT @ImEricaCampbell: The reason the #GRAMMYs i...
2,1450182785004621829,In a few minutes we’re talking with @ImEricaCa...
3,1450179215584899072,The reason the #GRAMMYs is music’s highest hon...
4,1450178500946640898,In a few minutes we’re talking with @ImEricaCa...
...,...,...
95,1449480556412997634,SONGWRITER: “The human voice is an instrument....
96,1449477397674401796,We are 2 for 2❗️❕ Thank You @christongray @dre...
97,1449468672595017729,Give this man an award #Grammys \n#AmericaFirs...
98,1449461690299568130,PRESAVE SG NOW 🚨\n\n#PreSave #NewMusic #collab...


In [8]:
import re
from nltk.tokenize import TweetTokenizer
# Instanciar Tokenizer
tt = TweetTokenizer()
# Aplicar Tokenizer a la columna
#tokenized_text = df['text'].apply(tt.tokenize)
#df["tokenized_text"] = tokenized_text
#df
tokenized_text = df['text']
tokenized_text

0     The reason the #GRAMMYs is music’s highest hon...
1     RT @ImEricaCampbell: The reason the #GRAMMYs i...
2     In a few minutes we’re talking with @ImEricaCa...
3     The reason the #GRAMMYs is music’s highest hon...
4     In a few minutes we’re talking with @ImEricaCa...
                            ...                        
95    SONGWRITER: “The human voice is an instrument....
96    We are 2 for 2❗️❕ Thank You @christongray @dre...
97    Give this man an award #Grammys \n#AmericaFirs...
98    PRESAVE SG NOW 🚨\n\n#PreSave #NewMusic #collab...
99    This album by @grampsmorgan\nRated ⭐⭐⭐⭐⭐ by #b...
Name: text, Length: 100, dtype: object

In [16]:
import nltk
tweets=[]
for tweet in tokenized_text:
    tweets.append(nltk.word_tokenize(tweet))
# Etiquetar texto con pos_tag
for i in range(len(tweets)):
    tweets[i] = nltk.pos_tag(tweets[i])
tweets2 = tweets
tweets


[[('The', 'DT'),
  ('reason', 'NN'),
  ('the', 'DT'),
  ('#', '#'),
  ('GRAMMYs', 'NNP'),
  ('is', 'VBZ'),
  ('music', 'NN'),
  ('’', 'NN'),
  ('s', 'NN'),
  ('highest', 'JJS'),
  ('honor', 'NN'),
  ('is', 'VBZ'),
  ('because', 'IN'),
  ('it', 'PRP'),
  ('’', 'VBZ'),
  ('s', 'JJ'),
  ('music', 'NN'),
  ('’', 'NN'),
  ('s', 'VBZ'),
  ('only', 'RB'),
  ('peer-voted', 'JJ'),
  ('award', 'NN'),
  ('.', '.'),
  ('Today', 'NN'),
  ('at', 'IN'),
  ('3pm', 'CD'),
  ('PT', 'NNP'),
  (',', ','),
  ('I', 'PRP'),
  ("'m", 'VBP'),
  ('doing', 'VBG'),
  ('an', 'DT'),
  ('Instagram', 'NNP'),
  ('Live', 'NNP'),
  ('with', 'IN'),
  ('the', 'DT'),
  ('@', 'NNP'),
  ('RecordingAcad', 'NNP'),
  ('to', 'TO'),
  ('shed', 'VB'),
  ('light', 'NN'),
  ('on', 'IN'),
  ('the', 'DT'),
  ('power', 'NN'),
  (',', ','),
  ('prestige', 'NN'),
  (',', ','),
  ('and', 'CC'),
  ('privilege', 'NN'),
  ('of', 'IN'),
  ('recognizing', 'VBG'),
  ('music', 'NN'),
  ('’', 'NNP'),
  ('s', 'NN'),
  ('creators', 'NNS'),
  ('!', 

In [17]:
for tweet in tweets2:
    for word in tweet:
        print(word)

('The', 'DT')
('reason', 'NN')
('the', 'DT')
('#', '#')
('GRAMMYs', 'NNP')
('is', 'VBZ')
('music', 'NN')
('’', 'NN')
('s', 'NN')
('highest', 'JJS')
('honor', 'NN')
('is', 'VBZ')
('because', 'IN')
('it', 'PRP')
('’', 'VBZ')
('s', 'JJ')
('music', 'NN')
('’', 'NN')
('s', 'VBZ')
('only', 'RB')
('peer-voted', 'JJ')
('award', 'NN')
('.', '.')
('Today', 'NN')
('at', 'IN')
('3pm', 'CD')
('PT', 'NNP')
(',', ',')
('I', 'PRP')
("'m", 'VBP')
('doing', 'VBG')
('an', 'DT')
('Instagram', 'NNP')
('Live', 'NNP')
('with', 'IN')
('the', 'DT')
('@', 'NNP')
('RecordingAcad', 'NNP')
('to', 'TO')
('shed', 'VB')
('light', 'NN')
('on', 'IN')
('the', 'DT')
('power', 'NN')
(',', ',')
('prestige', 'NN')
(',', ',')
('and', 'CC')
('privilege', 'NN')
('of', 'IN')
('recognizing', 'VBG')
('music', 'NN')
('’', 'NNP')
('s', 'NN')
('creators', 'NNS')
('!', '.')
('#', '#')
('Vote4GRAMMYs', 'NNP')
('https', 'NN')
(':', ':')
('//t.co/i5q6EU5jrT', 'NN')
('RT', 'NNP')
('@', 'NNP')
('ImEricaCampbell', 'NNP')
(':', ':')
('The',

('is', 'VBZ')
('in', 'IN')
('my', 'PRP$')
('bio', 'NN')
('above', 'IN')
('!', '.')
('#', '#')
('foryourconsideration', 'NN')
('#', '#')
('grammys', 'JJ')
('#', '#')
('grammys2022', 'JJ')
('#', '#')
('spokenword', 'JJ')
('#', '#')
('faylitahicks', 'NNS')
('My', 'PRP$')
('latest', 'JJS')
('album', 'NN')
('A', 'DT')
('NEW', 'JJ')
('NAME', 'NN')
('FOR', 'NNP')
('MY', 'NNP')
('LOVE', 'NNP')
('was', 'VBD')
('released', 'VBN')
('in', 'IN')
('September', 'NNP')
('in', 'IN')
('support', 'NN')
('of', 'IN')
('the', 'DT')
('incredible', 'JJ')
('work', 'NN')
('being', 'VBG')
('done', 'VBN')
('by', 'IN')
('@', 'NNP')
('civrightscorps', 'NN')
('and', 'CC')
('the', 'DT')
('many', 'JJ')
('grassroots', 'NNS')
('organizations', 'NNS')
('fighting', 'VBG')
('for', 'IN')
('our', 'PRP$')
('rights', 'NNS')
('.', '.')
('Produced', 'VBN')
('by', 'IN')
('@', 'NNP')
('hanznobe', 'NN')
('and', 'CC')
('@', 'JJ')
('room380llc', 'NN')
('.', '.')
('#', '#')
('fyc', 'JJ')
('#', '#')
('GRAMMYs', 'NNP')
('#', '#')
('spok

('#', '#')
('grammys', 'JJ')
('#', '#')
('beat', 'JJ')
('https', 'NN')
(':', ':')
('//t.co/jTy4xOlPlu', 'NN')
('KiCk', 'NNP')
('i', 'NN')
('deserved', 'VBD')
('the', 'DT')
('Grammy', 'NNP')
('for', 'IN')
('best', 'JJS')
('Dance/Electronic', 'NNP')
('Album', 'NNP')
('of', 'IN')
('2020', 'CD')
('#', '#')
('GRAMMYs', 'NNP')
('#', '#')
('kicki', 'NN')
('#', '#')
('grammy2020', 'JJ')
('#', '#')
('electronicmusic', 'JJ')
('@', 'NN')
('arca1000000', 'NN')
('https', 'NN')
(':', ':')
('//t.co/rWs3Kphm4A', 'NN')
('The', 'DT')
('latest', 'JJS')
('The', 'DT')
('Lemonade', 'NNP')
('(', '(')
('Beyoncé', 'NNP')
('album', 'RB')
(')', ')')
('Feminist', 'NNP')
('Daily', 'NNP')
('!', '.')
('https', 'NN')
(':', ':')
('//t.co/bb09aXx3ZQ', 'JJ')
('#', '#')
('grammys', 'JJ')
('#', '#')
('musicnews', 'NNS')
('What', 'WP')
('an', 'DT')
('amazing', 'JJ')
('Review', 'NN')
('of', 'IN')
('the', 'DT')
('3rd', 'CD')
('Fiction', 'NNP')
('Syxx', 'NNP')
('release', 'NN')
('“', 'NNP')
('Ghost', 'NNP')
('of', 'IN')
('My'

- Obtener la lista y frecuencia de los nombres propios en singular y plural


- Obtener la lista y frecuencia de los verbos en todos los tiempos verbales

- Obtener la lista y frecuencia de todos los adjetivos