# Etiquetado POS

El POS (part of speech) tagging o etiquetado morfológico es el proceso mediante el cual se clasifican las partes de un texto de acuerdo a su clasificación léxica.

Cada palabra recibirá una clasificación léxica a partir de una colección de etiquetas codificadas de acuerdo a su significado en el idioma correspondiente. Para poder realizar un etiquetado POS el texto debe estar previamente tokenizado.

NLKT ofrece una función llamada pos_tag. Esta función clasifica las palabras en ingés según un sistema de codificación pre-definido. Este etiquetador en particular está basado en machine learning y ha sido entrenado a partir de miles de ejemplos de oraciones pre-etiquetadas de manera manual. De esta manera puede estimar la clasificación léxica más probable de un término lo cuál no significa que esté libre de errores.

Es posible obtener una lista completa de los códigos de etiquetado para NLTK

In [9]:
#nltk.download('tagsets')
#nltk.download('averaged_perceptron_tagger')
import nltk
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

Es posible obtener la descripción cada una categoría específica.

In [2]:
nltk.help.upenn_tagset("NNP")

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...


Ya que este etiquetador puede no ser suficiente bueno en algunos casos es posible mejorar la eficiencia del etiquetado sumando etiquetadores POS creados manualmente.

## Etiquetado

In [10]:
example = "The Palace of Westminster serves as the meeting place for both the House of Commons and the House of Lords, the two houses of the Parliament of the United Kingdom. Informally known as the Houses of Parliament after its occupants, the Palace lies on the north bank of the River Thames in the City of Westminster, in central London, England."
# Tokenizar texto
tokenized_text = nltk.word_tokenize(example)
print(tokenized_text)
# Etiquetar texto con pos_tag
text_pos = nltk.pos_tag(tokenized_text)
print(text_pos)

['The', 'Palace', 'of', 'Westminster', 'serves', 'as', 'the', 'meeting', 'place', 'for', 'both', 'the', 'House', 'of', 'Commons', 'and', 'the', 'House', 'of', 'Lords', ',', 'the', 'two', 'houses', 'of', 'the', 'Parliament', 'of', 'the', 'United', 'Kingdom', '.', 'Informally', 'known', 'as', 'the', 'Houses', 'of', 'Parliament', 'after', 'its', 'occupants', ',', 'the', 'Palace', 'lies', 'on', 'the', 'north', 'bank', 'of', 'the', 'River', 'Thames', 'in', 'the', 'City', 'of', 'Westminster', ',', 'in', 'central', 'London', ',', 'England', '.']
[('The', 'DT'), ('Palace', 'NNP'), ('of', 'IN'), ('Westminster', 'NNP'), ('serves', 'NNS'), ('as', 'IN'), ('the', 'DT'), ('meeting', 'NN'), ('place', 'NN'), ('for', 'IN'), ('both', 'CC'), ('the', 'DT'), ('House', 'NNP'), ('of', 'IN'), ('Commons', 'NNPS'), ('and', 'CC'), ('the', 'DT'), ('House', 'NNP'), ('of', 'IN'), ('Lords', 'NNPS'), (',', ','), ('the', 'DT'), ('two', 'CD'), ('houses', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Parliament', 'NNP'), ('of'

# POS en español

Si queremos hacer un etiquetado morfológico en otro idioma entonces es necesario encontrar un etiquetador ya entrenado para ese idioma o entrenar uno nosotros mismos. También es necesario saber cuales son las clasificaciones de palabras existentes para dicho idioma.

En el siguiente <a href="https://colab.research.google.com/github/vitojph/kschool-nlp-18/blob/master/notebooks/pos-tagger-es.ipynb">enlace se muestra un ejemplo práctico de etiquetado POS en español.

In [11]:
import requests
import os
from dotenv import load_dotenv
import pandas as pd
import string
from nltk.tokenize import TweetTokenizer
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
url = "https://api.twitter.com/2/tweets/search/recent"
# Cargar valores del archivo .env en las variables de entorno
load_dotenv()
# Cargar valor del token a variable
bearer_token = os.environ.get("BEARER_TOKEN")

# Ejercicio

- Obtener de la API una lista de Tweets que no sean retweet y que contengan el hashtag #GRAMMYs en inglés.
- Realizar la tokenización
- Realizar un etiquetado POS con la función pos_tag de NLTK
- Obtener la lista y frecuencia de los sustantivos en singular y plural

# Ejercicio N°1

In [12]:
params = {
    'query': '#GRAMMYs  lang:en -is:retweet',
    'tweet.fields':'created_at',
    'max_results':100
}
headers = {
    "Authorization": f"Bearer {bearer_token}",
    "User-Agent":"TweeHunch"
}
response = requests.get(url, headers=headers, params=params)
print(response)
# Generar excepción si la respuesta no es exitosa
if response.status_code != 200:
    raise Exception(response.status_code, response.text)
print(response.json())

<Response [200]>
{'data': [{'created_at': '2021-10-20T19:11:57.000Z', 'id': '1450902629400334340', 'text': '‘Justice’ by Justin Bieber has been submitted for “Album of the Year” while “Peaches” feat. Daniel Caesar and GIVEON vying for “Record of the Year” at the 2022 #Grammys. https://t.co/yL3xZXMuRm'}, {'created_at': '2021-10-20T19:08:11.000Z', 'id': '1450901678727864323', 'text': '#FYC Please consider “The Chopstars” for Best Remixed Recording for the following #ChopNotSlopRemixes #Grammys @ Houston, Texas https://t.co/ORTHOrSdw6'}, {'created_at': '2021-10-20T18:55:14.000Z', 'id': '1450898420135342087', 'text': 'Happy Birthday NBA Youngboy #NBAYoungboy #hiphop #BETAwards #GRAMMYs #Billboards2021 https://t.co/fZWgkXsXua'}, {'created_at': '2021-10-20T18:31:31.000Z', 'id': '1450892451007324161', 'text': "RT @sekouandrews: The reason the #GRAMMYs is music’s highest honor is because it’s music’s only peer-voted award. Today at 3 p.m PT, I'm doing a @TwitterSpaces conversation with the @Re

In [13]:
df = pd.json_normalize(response.json()['data'])
df

Unnamed: 0,created_at,id,text
0,2021-10-20T19:11:57.000Z,1450902629400334340,‘Justice’ by Justin Bieber has been submitted ...
1,2021-10-20T19:08:11.000Z,1450901678727864323,#FYC Please consider “The Chopstars” for Best ...
2,2021-10-20T18:55:14.000Z,1450898420135342087,Happy Birthday NBA Youngboy #NBAYoungboy #hiph...
3,2021-10-20T18:31:31.000Z,1450892451007324161,RT @sekouandrews: The reason the #GRAMMYs is m...
4,2021-10-20T18:30:08.000Z,1450892102842388481,Attention @RecordingAcad members: First-round ...
...,...,...,...
95,2021-10-20T08:21:27.000Z,1450738925656440833,SXY GIRL #GRAMMYs https://t.co/jtIe97UCc8
96,2021-10-20T08:09:25.000Z,1450735895691214849,#KritiSanon is ecstatic as #ARRahman shares th...
97,2021-10-20T07:47:35.000Z,1450730399051747328,#GRAMMYs Voting is critically important becaus...
98,2021-10-20T07:46:32.000Z,1450730135297142787,The #Grammys will stick to its word with the p...


In [14]:
# Tokenizar

tt = TweetTokenizer()

tokenized_text = df['text'].apply(tt.tokenize)
df["tokenized_text"] = tokenized_text
df

Unnamed: 0,created_at,id,text,tokenized_text
0,2021-10-20T19:11:57.000Z,1450902629400334340,‘Justice’ by Justin Bieber has been submitted ...,"[‘, Justice, ’, by, Justin, Bieber, has, been,..."
1,2021-10-20T19:08:11.000Z,1450901678727864323,#FYC Please consider “The Chopstars” for Best ...,"[#FYC, Please, consider, “, The, Chopstars, ”,..."
2,2021-10-20T18:55:14.000Z,1450898420135342087,Happy Birthday NBA Youngboy #NBAYoungboy #hiph...,"[Happy, Birthday, NBA, Youngboy, #NBAYoungboy,..."
3,2021-10-20T18:31:31.000Z,1450892451007324161,RT @sekouandrews: The reason the #GRAMMYs is m...,"[RT, @sekouandrews, :, The, reason, the, #GRAM..."
4,2021-10-20T18:30:08.000Z,1450892102842388481,Attention @RecordingAcad members: First-round ...,"[Attention, @RecordingAcad, members, :, First-..."
...,...,...,...,...
95,2021-10-20T08:21:27.000Z,1450738925656440833,SXY GIRL #GRAMMYs https://t.co/jtIe97UCc8,"[SXY, GIRL, #GRAMMYs, https://t.co/jtIe97UCc8]"
96,2021-10-20T08:09:25.000Z,1450735895691214849,#KritiSanon is ecstatic as #ARRahman shares th...,"[#KritiSanon, is, ecstatic, as, #ARRahman, sha..."
97,2021-10-20T07:47:35.000Z,1450730399051747328,#GRAMMYs Voting is critically important becaus...,"[#GRAMMYs, Voting, is, critically, important, ..."
98,2021-10-20T07:46:32.000Z,1450730135297142787,The #Grammys will stick to its word with the p...,"[The, #Grammys, will, stick, to, its, word, wi..."


In [17]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_noun = []
for k,v in data_pos:
    if v in ["NN","NNS"]:
        data_noun.append(k)
print(data_noun)

[('‘', 'JJ'), ('Justice', 'NNP'), ('’', 'NN'), ('by', 'IN'), ('Justin', 'NNP'), ('Bieber', 'NNP'), ('has', 'VBZ'), ('been', 'VBN'), ('submitted', 'VBN'), ('for', 'IN'), ('“', 'NNP'), ('Album', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Year', 'NNP'), ('”', 'NNP'), ('while', 'IN'), ('“', 'JJ'), ('Peaches', 'NNP'), ('”', 'NNP'), ('feat', 'NN'), ('.', '.'), ('Daniel', 'NNP'), ('Caesar', 'NNP'), ('and', 'CC'), ('GIVEON', 'NNP'), ('vying', 'VBG'), ('for', 'IN'), ('“', 'NNP'), ('Record', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Year', 'NNP'), ('”', 'NN'), ('at', 'IN'), ('the', 'DT'), ('2022', 'CD'), ('#Grammys', 'NN'), ('.', '.'), ('https://t.co/yL3xZXMuRm', 'NN'), ('#FYC', 'JJ'), ('Please', 'NNP'), ('consider', 'VB'), ('“', 'NNP'), ('The', 'DT'), ('Chopstars', 'NNP'), ('”', 'NNP'), ('for', 'IN'), ('Best', 'NNP'), ('Remixed', 'NNP'), ('Recording', 'NNP'), ('for', 'IN'), ('the', 'DT'), ('following', 'JJ'), ('#ChopNotSlopRemixes', 'NNS'), ('#Grammys', 'VBP'), ('@', 'JJ'), ('Houston', 'NNP'), (',', '

In [8]:
# Obtener solo: NN - NNS

from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_noun)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
#pd.set_option('display.max_rows', None)

df_fdist


NameError: name 'data_noun' is not defined

- Obtener la lista y frecuencia de los nombres propios en singular y plural


# Ejercicio N°2

In [13]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_names = []
for k,v in data_pos:
    if v in ["NNP","NNPS"]:
        data_names.append(k)
print(data_names)

[('Here', 'RB'), ('we', 'PRP'), ('gow', 'VBP'), ('!', '.'), ('Nothing', 'NN'), ('but', 'CC'), ('SOTY', 'NNP'), ('/', 'NNP'), ('ROTY', 'NNP'), ('behaviour', 'NN'), ('!', '.'), ('@billieeilish', 'JJ'), ('@finneas', 'NNS'), ('#HappierThanEver', 'VBP'), ('#GRAMMYs', 'JJ'), ('🔥', 'NNP'), ('😏', 'NNP'), ('😎', 'NNP'), ('❤', 'NNP'), ('️', 'NNP'), ('💕', 'NNP'), ('🎶', 'NNP'), ('https://t.co/1ze8rRtACr', 'NN'), ('At', 'IN'), ('the', 'DT'), ('27th', 'CD'), ('#GRAMMYs', 'NN'), ('in', 'IN'), ('1985', 'CD'), (',', ','), ('@LionelRichie', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('Album', 'NNP'), ('Of', 'IN'), ('The', 'DT'), ('Year', 'NNP'), ('GRAMMY', 'NNP'), ('for', 'IN'), ("'", "''"), ("Can't", 'NNP'), ('Slow', 'NNP'), ('Down', 'NNP'), ("'", 'POS'), ('.', '.'), ('🎶', 'FW'), ('https://t.co/2alXkGP6fY', 'NN'), ('At', 'IN'), ('the', 'DT'), ('27th', 'CD'), ('#GRAMMYs', 'NN'), ('in', 'IN'), ('1985', 'CD'), (',', ','), ('@LionelRichie', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('Album', 'NNP'), ('Of', 'IN'), 

In [14]:
# Obtener solo: NN - NNS

from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_names)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
#pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
#GRAMMYs,37
Album,13
Year,9
Best,9
God,8
...,...
Tranny,1
J6,1
@RepKinzinger,1
@kelly_rdc,1


- Obtener la lista y frecuencia de los verbos en todos los tiempos verbales

# Ejercicio N°3

In [15]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_verbs = []
for k,v in data_pos:
    if v in ["VBZ", "VBP", "VBN", "VBG", "VBD", "VB"]:
        data_verbs.append(k)
print(data_verbs)

[('Here', 'RB'), ('we', 'PRP'), ('gow', 'VBP'), ('!', '.'), ('Nothing', 'NN'), ('but', 'CC'), ('SOTY', 'NNP'), ('/', 'NNP'), ('ROTY', 'NNP'), ('behaviour', 'NN'), ('!', '.'), ('@billieeilish', 'JJ'), ('@finneas', 'NNS'), ('#HappierThanEver', 'VBP'), ('#GRAMMYs', 'JJ'), ('🔥', 'NNP'), ('😏', 'NNP'), ('😎', 'NNP'), ('❤', 'NNP'), ('️', 'NNP'), ('💕', 'NNP'), ('🎶', 'NNP'), ('https://t.co/1ze8rRtACr', 'NN'), ('At', 'IN'), ('the', 'DT'), ('27th', 'CD'), ('#GRAMMYs', 'NN'), ('in', 'IN'), ('1985', 'CD'), (',', ','), ('@LionelRichie', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('Album', 'NNP'), ('Of', 'IN'), ('The', 'DT'), ('Year', 'NNP'), ('GRAMMY', 'NNP'), ('for', 'IN'), ("'", "''"), ("Can't", 'NNP'), ('Slow', 'NNP'), ('Down', 'NNP'), ("'", 'POS'), ('.', '.'), ('🎶', 'FW'), ('https://t.co/2alXkGP6fY', 'NN'), ('At', 'IN'), ('the', 'DT'), ('27th', 'CD'), ('#GRAMMYs', 'NN'), ('in', 'IN'), ('1985', 'CD'), (',', ','), ('@LionelRichie', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('Album', 'NNP'), ('Of', 'IN'), 

In [16]:
from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_verbs)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
#pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
is,26
be,20
was,9
#GRAMMYs,9
Expose,8
...,...
https://t.co/psgAlCfl1r,1
perform,1
calls,1
truly,1


- Obtener la lista y frecuencia de todos los adjetivos

# Ejercicio N°3

In [18]:
# Aplicamos POS

data = []
# Crear lista de palabras
for x in tokenized_text:
    for word in x:
        data.append(word)

# Etiquetar texto con pos_tag
data_pos = nltk.pos_tag(data)
print(data_pos)
data_adjective = []
for k,v in data_pos:
    if v in ["JJ", "JJR", "JJS"]: # "JJ", "JJR", "JJS"
        data_adjective.append(k)
print(data_adjective)

[('Here', 'RB'), ('we', 'PRP'), ('gow', 'VBP'), ('!', '.'), ('Nothing', 'NN'), ('but', 'CC'), ('SOTY', 'NNP'), ('/', 'NNP'), ('ROTY', 'NNP'), ('behaviour', 'NN'), ('!', '.'), ('@billieeilish', 'JJ'), ('@finneas', 'NNS'), ('#HappierThanEver', 'VBP'), ('#GRAMMYs', 'JJ'), ('🔥', 'NNP'), ('😏', 'NNP'), ('😎', 'NNP'), ('❤', 'NNP'), ('️', 'NNP'), ('💕', 'NNP'), ('🎶', 'NNP'), ('https://t.co/1ze8rRtACr', 'NN'), ('At', 'IN'), ('the', 'DT'), ('27th', 'CD'), ('#GRAMMYs', 'NN'), ('in', 'IN'), ('1985', 'CD'), (',', ','), ('@LionelRichie', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('Album', 'NNP'), ('Of', 'IN'), ('The', 'DT'), ('Year', 'NNP'), ('GRAMMY', 'NNP'), ('for', 'IN'), ("'", "''"), ("Can't", 'NNP'), ('Slow', 'NNP'), ('Down', 'NNP'), ("'", 'POS'), ('.', '.'), ('🎶', 'FW'), ('https://t.co/2alXkGP6fY', 'NN'), ('At', 'IN'), ('the', 'DT'), ('27th', 'CD'), ('#GRAMMYs', 'NN'), ('in', 'IN'), ('1985', 'CD'), (',', ','), ('@LionelRichie', 'NNP'), ('won', 'VBD'), ('the', 'DT'), ('Album', 'NNP'), ('Of', 'IN'), 

In [19]:
from nltk.probability import FreqDist

# Obtener frecuencia de cada término
fdist = FreqDist(data_adjective)
# Convertir a dataframe
df_fdist = pd.DataFrame.from_dict(fdist, orient='index')
df_fdist.columns = ['Frequency']
df_fdist.index.name = 'Term'
df_fdist.sort_values(by=['Frequency'], inplace=True, ascending=False)
#pd.set_option('display.max_rows', None)

df_fdist

Unnamed: 0_level_0,Frequency
Term,Unnamed: 1_level_1
latest,7
#GRAMMYs,7
likely,6
first,5
#music,5
...,...
Recent,1
(310) 882-1967,1
🥲,1
greatest,1
