# Etiquetado en NLTK

## Pipeline básico para Ingles

In [2]:
#@title Dependencias previas
import nltk
nltk.download('punkt') # tokenizer tokenizador
nltk.download('averaged_perceptron_tagger') # tagger etiquetador
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /home/luis/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/luis/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
#@title Etiquetado en una línea ...

text = word_tokenize("And now here is the example of the class today")
# text = word_tokenize("And now here I am enjoying today")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('here', 'RB'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('example', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('class', 'NN'),
 ('today', 'NN')]

In [4]:
#@title Categoria gramatical de cada etiqueta
nltk.download('tagsets')
for tag in ['CC', 'RB', 'PRP', 'VBP', 'VBG', 'NN']:
  print(nltk.help.upenn_tagset(tag))

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
None
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
None
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
None
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
None
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
None
N

[nltk_data] Downloading package tagsets to /home/luis/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [5]:
# descargamos los conjuntos de etiquetas que son
# la metadata de que significa cada etiqueta
nltk.download('tagsets')
for tag in ['CC', 'RB', 'PRP']:
  print(nltk.help.upenn_tagset(tag))

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
None
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
None
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
None


[nltk_data] Downloading package tagsets to /home/luis/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


### Palabras homónimas

In [6]:
#@title Palabras homónimas
text = word_tokenize("They do not permit other people to get a residence permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('do', 'VBP'),
 ('not', 'RB'),
 ('permit', 'VB'),
 ('other', 'JJ'),
 ('people', 'NNS'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('a', 'DT'),
 ('residence', 'NN'),
 ('permit', 'NN')]

## Etiquetado en Español 

Para el ingles, NLTK tiene tokenizador y etiquetador pre-entrenados por defecto. En cambio, para otros idiomas es preciso entrenarlo previamente. 

* usamos el corpus `cess_esp` https://mailman.uib.no/public/corpora/2007-October/005448.html

* el cual usa una convención de etiquetas gramaticales dada por el grupo EAGLES https://www.cs.upc.edu/~nlp/tools/parole-sp.html

In [7]:
nltk.download('cess_esp')
from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut
from nltk import BigramTagger as bt

[nltk_data] Downloading package cess_esp to /home/luis/nltk_data...
[nltk_data]   Package cess_esp is already up-to-date!


### Entrenamiendo del tagger por unigramas

In [8]:
#@title Entrenamiendo del tagger por unigramas
cess_sents = cess.tagged_sents()

fraction = int(len(cess_sents)*90/100) # 90% del dataset
# fraction del dataset
uni_tagger = ut(cess_sents[:fraction]) # entrenamos el etiquetador sobre el 90% del dataset
uni_tagger.evaluate(cess_sents[fraction+1:]) # lo evaluamos sobre el 10% restante

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  uni_tagger.evaluate(cess_sents[fraction+1:]) # lo evaluamos sobre el 10% restante


0.8069484240687679

In [9]:
uni_tagger.tag("A mi me gusta correr por la tarde.".split())

[('A', 'sps00'),
 ('mi', 'dp1css'),
 ('me', 'pp1cs000'),
 ('gusta', 'vmip3s0'),
 ('correr', 'vmn0000'),
 ('por', 'sps00'),
 ('la', 'da0fs0'),
 ('tarde.', None)]

In [10]:
uni_tagger.tag("Yo soy una persona muy amable".split(" "))

[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', 'di0fs0'),
 ('persona', 'ncfs000'),
 ('muy', 'rg'),
 ('amable', None)]

## Entrenamiento del tagger por bigramas

In [11]:
#@title Entrenamiento del tagger por bigramas
fraction = int(len(cess_sents)*90/100) # 90% del dataset
bi_tagger = bt(cess_sents[:fraction]) # entrenamos el etiquetador sobre el 90% del dataset
bi_tagger.evaluate(cess_sents[fraction + 1:],)  # evaluamos el resultado sobre el 10% restante


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  bi_tagger.evaluate(cess_sents[fraction + 1:],)  # evaluamos el resultado sobre el 10% restante


0.1095272206303725

In [12]:
bi_tagger.tag("Yo soy una persona muy amable".split(" "))

[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', None),
 ('persona', None),
 ('muy', None),
 ('amable', None)]

In [13]:
bi_tagger.tag("A mi me gusta correr por la tarde.".split())

[('A', 'sps00'),
 ('mi', 'dp1css'),
 ('me', None),
 ('gusta', None),
 ('correr', None),
 ('por', None),
 ('la', None),
 ('tarde.', None)]

el etiquetado por bigramas no es muy bueno y no se recomienda usarlo

In [14]:
# cess_sents = cess.tagged_sents()
uni_tagger_100 = ut(cess_sents)

In [15]:
uni_tagger_100.tag("A mi me gusta correr por la tarde.".split())


[('A', 'sps00'),
 ('mi', 'dp1css'),
 ('me', 'pp1cs000'),
 ('gusta', 'vmip3s0'),
 ('correr', 'vmn0000'),
 ('por', 'sps00'),
 ('la', 'da0fs0'),
 ('tarde.', None)]

In [16]:
uni_tagger_100.tag("Yo soy una persona muy amable".split(" "))


[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', 'di0fs0'),
 ('persona', 'ncfs000'),
 ('muy', 'rg'),
 ('amable', 'aq0cs0')]

# Etiquetado mejorado con Stanza (StanfordNLP)

**¿Que es Stanza?**

* El grupo de investigacion en NLP de Stanford tenía una suite de librerias que ejecutaban varias tareas de NLP, esta suite se unifico en un solo servicio que llamaron **CoreNLP** con base en codigo java: https://stanfordnlp.github.io/CoreNLP/index.html

* Para python existe **StanfordNLP**: https://stanfordnlp.github.io/stanfordnlp/index.html

* Sin embargo, **StanfordNLP** ha sido deprecado y las nuevas versiones de la suite de NLP reciben mantenimiento bajo el nombre de **Stanza**: https://stanfordnlp.github.io/stanza/

In [17]:
!which pip

/home/luis/projects/cursos/platzi-curso-algoritmos-clasificacion-texto/.venv/bin/pip


In [19]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.4.2-py3-none-any.whl (691 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m691.3/691.3 KB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting protobuf
  Downloading protobuf-4.21.12-cp37-abi3-manylinux2014_x86_64.whl (409 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.8/409.8 KB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 KB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting torch>=1.3.0
  Downloading torch-1.13.1-cp310-cp310-manylinux1_x86_64.whl (887.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting requests
  Using cached requests-2.28.2-py3-none-any.whl (62 kB)
Collecting n

In [22]:
# esta parte puede demorar un poco ....
import stanza
stanza.download('es')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 14.0MB/s]                    
2023-02-07 10:41:46 INFO: Downloading default packages for language: es (Spanish) ...
2023-02-07 10:41:48 INFO: File exists: /home/luis/stanza_resources/es/default.zip
2023-02-07 10:41:52 INFO: Finished downloading models and saved to /home/luis/stanza_resources.


In [24]:
nlp = stanza.Pipeline('es', processors='tokenize, pos')
doc = nlp('yo soy una persona muy amable')

2023-02-07 10:51:54 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.4.1.json: 193kB [00:00, 17.1MB/s]                    
2023-02-07 10:51:54 INFO: Loading these models for language: es (Spanish):
| Processor | Package |
-----------------------
| tokenize  | ancora  |
| mwt       | ancora  |
| pos       | ancora  |

2023-02-07 10:51:54 INFO: Use device: gpu
2023-02-07 10:51:54 INFO: Loading: tokenize
2023-02-07 10:51:54 INFO: Loading: mwt
2023-02-07 10:51:55 INFO: Loading: pos
2023-02-07 10:51:55 INFO: Done loading processors!


In [28]:
doc.sentences[0].words

[{
   "id": 1,
   "text": "yo",
   "upos": "PRON",
   "xpos": "pp1csn00",
   "feats": "Case=Nom|Number=Sing|Person=1|PronType=Prs",
   "start_char": 0,
   "end_char": 2
 },
 {
   "id": 2,
   "text": "soy",
   "upos": "AUX",
   "xpos": "vsip1s0",
   "feats": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin",
   "start_char": 3,
   "end_char": 6
 },
 {
   "id": 3,
   "text": "una",
   "upos": "DET",
   "xpos": "di0fs0",
   "feats": "Definite=Ind|Gender=Fem|Number=Sing|PronType=Art",
   "start_char": 7,
   "end_char": 10
 },
 {
   "id": 4,
   "text": "persona",
   "upos": "NOUN",
   "xpos": "ncfs000",
   "feats": "Gender=Fem|Number=Sing",
   "start_char": 11,
   "end_char": 18
 },
 {
   "id": 5,
   "text": "muy",
   "upos": "ADV",
   "xpos": "rg",
   "start_char": 19,
   "end_char": 22
 },
 {
   "id": 6,
   "text": "amable",
   "upos": "ADJ",
   "xpos": "aq0cs0",
   "feats": "Number=Sing",
   "start_char": 23,
   "end_char": 29
 }]

In [29]:
for sentence in doc.sentences:
  for word in sentence.words:
    print(word.text, word.pos)

yo PRON
soy AUX
una DET
persona NOUN
muy ADV
amable ADJ


La libreria stanza es una libreria muy poderosa creada por Stanford con herramientas muy poderosas y faciles de usar.

# Referencias adicionales:

* Etiquetado POS con Stanza https://stanfordnlp.github.io/stanza/pos.html#accessing-pos-and-morphological-feature-for-word

* Stanza | Github: https://github.com/stanfordnlp/stanza

* Articulo en ArXiv: https://arxiv.org/pdf/2003.07082.pdf