# EJERCICIO 1. ENTRENAMIENTO Y ETIQUETACIÓN CON PoS TAGGERS

## **(a)** CON MODELOS PRE-ENTRENADOS (ELEGIBLE, 0.75 PUNTOS)

Buscar y descargar un etiquetador(es) de uso libre que incluya ya modelos pre-entrenados para dos idiomas: (1) inglés y (2) alguna lengua romance.  Etiquetar con él un fichero de texto (.txt) de 10.000 palabras (aprox.) para cada idioma.


ENTREGABLES:

Para cada tagger empleado se incluirá en la memoria un apartado en el que se analicen sus características (modelo en el que se basa, etc.), URL de la web donde se obtuvo, si fue necesario preprocesar el texto de entrada y cómo, un breve análisis de la salida obtenida, etc. Asimismo, se adjuntará un fichero comprimido que contenga un subdirectorio por idioma y, dentro de cada uno:
*	Un fichero de texto indicando la URL de la fuente original del texto etiquetado (URL.txt),
*	El fichero de texto de entrada en bruto a etiquetar (INPUT_RAW.txt)
*	Una copia de la salida del tagger para dicha entrada (OUTPUT_RAW.txt).


In [None]:
import pandas as pd
import numpy as np
import spacy
import random as rn
from google.colab import drive

In [None]:
# Montamos el Google Drive en el directorio del proyecto y descomprimios el fichero con los datos
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Preprocesado de los datos

En esta seccion se preprocesan los datasets obtenidos para que se ajusten a las necesidades del problema.

#### EN

In [None]:
!unzip -n '/content/gdrive/MyDrive/PLN/1/(a)/EN_Blogger_Corpus.zip' >> /dev/null

# Especificamos los paths
en_raw_text_path = "/content/gdrive/MyDrive/PLN/1/(a)/EN/"
en_raw_text_file_name = "INPUT_RAW.txt"
en_complete_path = en_raw_text_path + en_raw_text_file_name
en_corpus_dataset = "blogtext.csv"

In [None]:
# Se lee el csv como dataframe
blogtext_df = pd.read_csv(en_corpus_dataset, usecols=[6])
print(blogtext_df.head())

                                                text
0             Info has been found (+/- 100 pages,...
1             These are the team members:   Drewe...
2             In het kader van kernfusie op aarde...
3                   testing!!!  testing!!!          
4               Thanks to Yahoo!'s Toolbar I can ...


In [None]:
# Se eliminan filas vacias
blogtext_df["text"].dropna(inplace=True)
# Se establece el tipo de dato del texto como str
blogtext_df["text"] = blogtext_df["text"].astype(str)
# Se eliminan los espacios al principio y final de los textos
blogtext_df.text = blogtext_df.text.apply(lambda s: s.strip())
print(blogtext_df.head())

                                                text
0  Info has been found (+/- 100 pages, and 4.5 MB...
1  These are the team members:   Drewes van der L...
2  In het kader van kernfusie op aarde:  MAAK JE ...
3                             testing!!!  testing!!!
4  Thanks to Yahoo!'s Toolbar I can now 'capture'...


In [None]:
# Se añade una columna con el numero de palabras que contiene el texto de cada fila
blogtext_df["word_count"] = blogtext_df.text.apply(lambda s: len(s.split()))
print(blogtext_df.head())

                                                text  word_count
0  Info has been found (+/- 100 pages, and 4.5 MB...          28
1  These are the team members:   Drewes van der L...          20
2  In het kader van kernfusie op aarde:  MAAK JE ...        4326
3                             testing!!!  testing!!!           2
4  Thanks to Yahoo!'s Toolbar I can now 'capture'...          65


In [None]:
# Se genera un nuevo DataFrame, reordenando de forma aleatoria el DataFrame existente
seed = 7
rand_blogtext_df = blogtext_df.sample(frac=1, random_state=seed, ignore_index=True)
print(rand_blogtext_df.head())

                                                text  word_count
0  urlLink    Do you see the man in the picture o...          17
1  Yeah, cause it's  obviously  a  urlLink family...           9
2              (damned tarnished halo showing again)           5
3  I suggest that the underground press could per...        5470
4  sometimes when I am quiet and in tune  I feel ...         120


In [None]:
# Se implementa un bucle en el que se escribirá el archivo INPUT_RAW.txt
# Se mantiene la cuenta de palabras del txt para deternerse al llegar a 10000
word_count = 0
row = 0
with open(en_complete_path, 'w') as f:
  while word_count < 10000:
    f.write(rand_blogtext_df["text"][row])
    f.write('\n')
    row += 1
    word_count += rand_blogtext_df["word_count"][row]

#### ES

In [None]:
!unzip -n '/content/gdrive/MyDrive/PLN/1/(a)/ES_News_Corpus.zip' >> /dev/null

# Especificamos los paths
es_raw_text_path = "/content/gdrive/MyDrive/PLN/1/(a)/ES/"
es_raw_text_file_name = "INPUT_RAW.txt"
es_complete_path = es_raw_text_path + es_raw_text_file_name
es_corpus_dataset = "df_total.csv"

In [None]:
# Se lee el csv como dataframe
newstext_df = pd.read_csv(es_corpus_dataset, usecols=[1])
print(newstext_df.head())

                                                news
0  Durante el foro La banca articulador empresari...
1  El regulador de valores de China dijo el domin...
2  En una industria históricamente masculina como...
3  Con el dato de marzo el IPC interanual encaden...
4  Ayer en Cartagena se dio inicio a la versión n...


In [None]:
# Se eliminan filas vacias
newstext_df["news"].dropna(inplace=True)
# Se establece el tipo de dato del texto como str
newstext_df["news"] = newstext_df["news"].astype(str)
# Se eliminan los espacios al principio y final de los textos
newstext_df.news = newstext_df.news.apply(lambda s: s.strip())
print(newstext_df.head())

                                                news
0  Durante el foro La banca articulador empresari...
1  El regulador de valores de China dijo el domin...
2  En una industria históricamente masculina como...
3  Con el dato de marzo el IPC interanual encaden...
4  Ayer en Cartagena se dio inicio a la versión n...


In [None]:
# Se añade una columna con el numero de palabras que contiene el texto de cada fila
newstext_df["word_count"] = newstext_df.news.apply(lambda s: len(s.split()))
print(newstext_df.head())

                                                news  word_count
0  Durante el foro La banca articulador empresari...         221
1  El regulador de valores de China dijo el domin...         342
2  En una industria históricamente masculina como...         367
3  Con el dato de marzo el IPC interanual encaden...         477
4  Ayer en Cartagena se dio inicio a la versión n...         793


In [None]:
# Se genera un nuevo DataFrame, reordenando de forma aleatoria el DataFrame existente
seed = 7
rand_newstext_df = newstext_df.sample(frac=1, random_state=seed, ignore_index=True)
print(rand_newstext_df.head())

                                                news  word_count
0  El Bbva Consumption Tracker es un indicador de...         408
1  Los cambios tecnológicos modifican la forma en...         195
2  Este viernes Shell inauguró su primera estació...         163
3  Los precios que pagan los consumidores estadou...         401
4  En un mundo tan hiperconectado como el actual,...         731


In [None]:
# Se implementa un bucle en el que se escribirá el archivo INPUT_RAW.txt
# Se mantiene la cuenta de palabras del txt para deternerse al llegar a 10000
word_count = 0
row = 0
with open(es_complete_path, 'w') as f:
  while word_count < 10000:
    f.write(rand_newstext_df["news"][row])
    f.write('\n')
    row += 1
    word_count += rand_newstext_df["word_count"][row]

### PoS Tagging

En esta sección se emplean los PoS Taggers pre-entrenados para etiquetar los textos generados en la sección anterior

#### EN

In [None]:
# Se carga el modelo preentrenado y se especifican los paths del input y output.
nlp = spacy.load("en_core_web_sm")
input_file = "/content/gdrive/MyDrive/PLN/1/(a)/EN/INPUT_RAW.txt"
output_file = "/content/gdrive/MyDrive/PLN/1/(a)/EN/OUTPUT_RAW.txt"

In [None]:
# Se lee el input y se le pasa al modelo
t_input = open(input_file).read()
doc = nlp(t_input)

In [None]:
# Se guarda el texto etiquetado frase a frase en el archivo de salida.
with open(output_file, 'w') as of:
  for sentence in doc.sents:
    for token in sentence:
      of.write(f"{token.text}({token.pos_}) ")
    of.write('\n')

#### ES

In [None]:
# Para poder emplear el pipeline en castellano, es necesario instalar el paquete primero:
!python -m spacy download es_core_news_sm

# Se carga el modelo preentrenado y se especifican los paths del input y output.
nlp = spacy.load("es_core_news_sm")
input_file = "/content/gdrive/MyDrive/PLN/1/(a)/ES/INPUT_RAW.txt"
output_file = "/content/gdrive/MyDrive/PLN/1/(a)/ES/OUTPUT_RAW.txt"

Collecting es-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.7.0/es_core_news_sm-3.7.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.




In [None]:
# Se lee el input y se le pasa al modelo
t_input = open(input_file).read()
doc = nlp(t_input)

In [None]:
# Se guarda el texto etiquetado frase a frase en el archivo de salida.
with open(output_file, 'w') as of:
  for sentence in doc.sents:
    for token in sentence:
      of.write(f"{token.text}({token.pos_}) ")
    of.write('\n')

## **(b)** ENTRENANDO LOS MODELOS  (OPTATIVO, HASTA 1.5 PUNTOS)


Lo mismo pero sin emplear modelos pre-entrenados. El alumno deberá buscar corpus libremente disponibles con los que entrenar el tagger.   Recuérdese que si un treebank recoge también las etiquetas morfosintácticas de las palabras del texto, puede también emplearse para entrenar un tagger.
Se valorará positivamente que el alumno amplíe el experimento a más taggers y, sobre todo, más idiomas, especialmente (de menos a más): lenguas no latinas, lenguas no indo-europeas y lenguas con alfabeto diferente al latino. Se tendrá también en cuenta tanto la variedad del conjunto de idiomas como la variedad del tipo de taggers empleados.

ENTREGABLES:

Lo mismo que en el apartado anterior, incluyendo a mayores en la memoria la información concerniente a los diferentes corpus de entrenamiento que se hayan empleado, las características de la máquina empleada y los tiempos de entrenamiento requeridos.


In [None]:
import pandas as pd
import numpy as np
import spacy
import random as rn
from google.colab import drive

In [None]:
# Montamos el Google Drive en el directorio del proyecto y descomprimios el fichero con los datos
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Preprocesado de los datos

En este apartado se preprocesan los treebanks empleados para el entrenamiento y los datasets empleados para los tests.

#### JA

In [None]:
# Se invoca el comando "convert" de Spacy en CLI para transformar los treebanks de formato .conllu a .spacy

!python -m spacy convert "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_train.conllu" "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/" --converter conllu --n-sents 10 --merge-subtokens
!python -m spacy convert "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_dev.conllu" "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/" --converter conllu --n-sents 10 --merge-subtokens

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (705 documents):
/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (51 documents):
/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_dev.spacy[0m


Se decide emplear la particion test del treebank de UD, el formato .txt tiene e texto crudo y el numero de palabras es adecuado, por lo que se puede emplear sin hacer modificaciones sobre el mismo.

#### RU

In [None]:
# Se invoca el comando "convert" de Spacy en CLI para transformar los treebanks de formato .conllu a .spacy

!python -m spacy convert "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_train.conllu" "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/" --converter conllu --n-sents 10 --merge-subtokens
!python -m spacy convert "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_dev.conllu" "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/" --converter conllu --n-sents 10 --merge-subtokens

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1605 documents):
/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (95 documents):
/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_dev.spacy[0m


Se decide emplear la particion test del treebank de UD, el formato .txt tiene el texto crudo y se recorta para que el numero de palabras sea adecuado.

#### ZH

In [None]:
# Se invoca el comando "convert" de Spacy en CLI para transformar los treebanks de formato .conllu a .spacy

!python -m spacy convert "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_train.conllu" "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/" --converter conllu --n-sents 10 --merge-subtokens
!python -m spacy convert "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_dev.conllu" "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/" --converter conllu --n-sents 10 --merge-subtokens

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (400 documents):
/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (50 documents):
/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_dev.spacy[0m


### Entrenamiento

En este apartado se entrenarán los etiquetadores para ambos idiomas.


#### JA

In [None]:
# Se instalan las dependencias para poder entrenar en Japones
!pip install sudachipy sudachidict_core

Collecting sudachipy
  Downloading SudachiPy-0.6.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/2.6 MB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━[0m [32m2.1/2.6 MB[0m [31m30.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sudachidict_core
  Downloading SudachiDict_core-20240109-py3-none-any.whl (71.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.8/71.8 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sudachipy, sudachidict_core
Successfully installed sudachidict_core-20240109 sudachipy-0.6.8


In [None]:
# Se genera el archivo de configuracion definitivo
!python -m spacy init fill-config "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_base_config.cfg" "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_config.cfg"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_config.cfg
You can now add your data and train your pipeline:
python -m spacy train ja_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_config.cfg" --output "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/Trained" --paths.train "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_train.spacy" --paths.dev  "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/ja_dev.spacy"

[38;5;4mℹ Saving to output directory: /content/gdrive/MyDrive/PLN/1/(b)/JA
(GSD)/Trained[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  TAG_ACC  SCORE 
---  ------  ------------  -----------  -------  ------
  0       0          0.00       191.52    76.98    0.77
  0     200        422.05     16352.68    76.98    0.77
  0     400        534.09      7665.17    76.98    0.77
  0     600        515.68      5920.69    76.98    0.77
  1     800        453.05      4530.33    76.98    0.77
  1    1000        447.13      3672.46    76.98    0.77
  1    1200        458.69      3616.99    76.98    0.77
  1    1400        449.00      3400.83    76.98    0.77
  2    1600        315.59      2131.50    76.98    0.77
[38;5;2m✔ Saved pipeline to output directory[0m
/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/Trained/model-last


#### RU

In [None]:
# Se genera el archivo de configuracion definitivo
!python -m spacy init fill-config "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_base_config.cfg" "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_config.cfg"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_config.cfg
You can now add your data and train your pipeline:
python -m spacy train ru_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_config.cfg" --output "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/Trained" --paths.train "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_train.spacy" --paths.dev  "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/ru_dev.spacy"

[38;5;4mℹ Saving to output directory: /content/gdrive/MyDrive/PLN/1/(b)/RU
(Taiga)/Trained[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  TAG_ACC  SCORE 
---  ------  ------------  -----------  -------  ------
  0       0          0.00       221.93    27.12    0.27
  0     200        221.28      8269.95    73.95    0.74
  0     400        258.31      4072.54    79.06    0.79
  0     600        227.22      3505.29    83.90    0.84
  0     800        238.35      3729.30    85.30    0.85
  0    1000        244.76      3789.38    86.86    0.87
  1    1200        269.04      4120.66    87.47    0.87
  1    1400        273.16      3814.18    88.90    0.89
  1    1600        341.07      4799.93    88.50    0.89
  2    1800        367.69      5059.93    89.55    0.90
  2    2000        384.57      5016.32    89.66    0.90
  3    2200      

#### ZH

In [None]:
# Se genera el archivo de configuracion definitivo
!python -m spacy init fill-config "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_base_config.cfg" "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_config.cfg"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_config.cfg
You can now add your data and train your pipeline:
python -m spacy train zh_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_config.cfg" --output "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/Trained" --paths.train "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_train.spacy" --paths.dev  "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/zh_dev.spacy"

[38;5;4mℹ Saving to output directory: /content/gdrive/MyDrive/PLN/1/(b)/ZH
(GSD)/Trained[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  TAG_ACC  SCORE 
---  ------  ------------  -----------  -------  ------
  0       0          0.00       445.23    33.29    0.33
  0     200        852.73     31544.49    79.39    0.79
  1     400       1215.99     20462.36    82.37    0.82
  1     600       1221.03     16127.07    84.66    0.85
  2     800       1347.94     16223.04    85.53    0.86
  2    1000       1314.94     13311.15    85.81    0.86
  3    1200       1328.53     12900.86    86.33    0.86
  3    1400       1295.65     10935.07    87.05    0.87
  4    1600       1367.49     11325.33    87.13    0.87
  4    1800       1255.06      9339.91    87.31    0.87
  5    2000       1400.09     10124.44    87.69    0.88
  5    2200       1

### Test

En este apartado se alimentaran los etiquetadores entrenados en el apartado anteior con CORPUS que anotarán y guardarán.

#### JA

In [None]:
# Se carga el modelo y se especifican los paths del input y output.
nlp = spacy.load("/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/Trained/model-best")
input_file = "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/INPUT_RAW.txt"
output_file = "/content/gdrive/MyDrive/PLN/1/(b)/JA (GSD)/OUTPUT_RAW.txt"

Spacy tiene una limitacion en algunos idiomas (el japonés siendo uno de ellos) en los que el tamaño maximo del texto que se le pasa al modelo no se ajusta al de la documentación de la librería, por lo que hay que separarlo y pasarselo al etiquetador en varias partes.

Este problema nace de una de las dependencias de Spacy, no de la propia librería, en el caso del Japones, el problema se localiza en la librería sudachipy.

Para poder cortar de forma optima el texto, se cuentan los caracteres del texto tras codificarlos en UTF-8 y se divide la longitud total del texto por la recién obtenida.

Para conocer el tamaño maximo permitido solo es necesesario observar el Traceback del error y anotar el valor maximo esperado, que en el caso del japones es 49149 Bytes.

In [None]:
# Funciónes obtenida del repositorio de Spacy: https://github.com/explosion/spaCy/issues/13207 por JWittmeyer:

def __utf8len(s:str):
    return len(s.encode('utf-8'))

# splits not after x bytes but ensures that max x bytes are used without destroying the final character
def __chunk_text_on_bytes(text: str, max_chunk_size: int = 1_000_000):
    factor = len(text) / __utf8len(text)
    increase_by = int(max(min(max_chunk_size*.1, 10), 1))
    initial_size_guess = int(max(max_chunk_size * factor - 10, 1))
    final_list = []
    remaining = text
    while len(remaining):
        part = remaining[:initial_size_guess]
        if __utf8len(part) > max_chunk_size:
            initial_size_guess = max(initial_size_guess - min(max_chunk_size *.001, 10), 1)
            continue
        cut_after = initial_size_guess
        while __utf8len(part) < max_chunk_size and part != remaining:
            cut_after = min(len(remaining), cut_after+increase_by)
            part = remaining[:cut_after]

        if __utf8len(part) > max_chunk_size:
            cut_after-=increase_by
        final_list.append(remaining[:cut_after])
        remaining = remaining[cut_after:]

    return final_list

In [None]:
# Se lee el input y se le pasa al modelo
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
texts = __chunk_text_on_bytes(t_input, 49149)
docs = []
for text in texts:
  docs.append(nlp(text))

In [None]:
# Se guarda el texto etiquetado frase a frase en el archivo de salida.
with open(output_file, 'w') as of:
  for doc in docs:
    for token in doc:
      of.write(f"{token.text}({token.pos_}) ")
    of.write('\n')

#### RU

In [None]:
# Se carga el modelo y se especifican los paths del input y output.
nlp = spacy.load("/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/Trained/model-best")
input_file = "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/INPUT_RAW.txt"
output_file = "/content/gdrive/MyDrive/PLN/1/(b)/RU (Taiga)/OUTPUT_RAW.txt"

In [None]:
# Se lee el input y se le pasa al modelo
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto etiquetado frase a frase en el archivo de salida.
with open(output_file, 'w') as of:
  for token in doc:
    of.write(f"{token.text}({token.tag_}) ")
  of.write('\n')

#### ZH

In [None]:
# Se carga el modelo y se especifican los paths del input y output.
nlp = spacy.load("/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/Trained/model-best")
input_file = "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/INPUT_RAW.txt"
output_file = "/content/gdrive/MyDrive/PLN/1/(b)/ZH (GSD)/OUTPUT_RAW.txt"

In [None]:
# Se lee el input y se le pasa al modelo
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto etiquetado frase a frase en el archivo de salida.
with open(output_file, 'w') as of:
  for token in doc:
    of.write(f"{token.text}({token.tag_}) ")
  of.write('\n')

# EJERCICIO 2. ENTRENAMIENTO Y EVALUACIÓN DE PARSERS DE CONSTITUYENTES

## 2.a. INGLÉS + LENGUA ROMANCE (ELEGIBLE, 1.5 PUNTOS)


Buscar y descargar un parser basado en constituyentes de uso libre y, seguidamente, entrenarlo y evaluarlo para dos idiomas: (1) inglés y (2) una lengua romance. Para ello el alumno deberá localizar y descargar treebanks adecuados: de estructura de frase (i.e. phrase structure grammars) y libre acceso.

A la hora de evaluar el parser puede ser preciso adaptar el formato del treebank, preprocesar el texto de entrada a analizar o postprocesar la salida del parser, por ejemplo. También puede ser necesario dividir el corpus en dos: un (sub)corpus de entrenamiento y un (sub)corpus de evaluación o gold-standard. A la hora de comparar la salida obtenida con la esperada se puede emplear, por ejemplo, la herramienta de evaluación EVALB.

ENTREGABLES:

Para cada parser empleado se incluirá en la memoria un apartado en el que se analicen sus características (modelo en el que se basa, etc.), URL de la web donde se obtuvo, si fue necesario preprocesar el texto de entrada/postprocesar la salida y cómo, etc. De forma similar, deberán incluirse sendos apartados describiendo las características de los treebanks empleados, si fue necesario adaptarlos y cómo, etc.

Finalmente, para cada idioma se incluirá una tabla(s) y/o gráfica(s) comparativa(s) de los resultados obtenidos con cada parser, así como un breve análisis de dichos resultados, junto con las características de la máquina empleada y los tiempos de entrenamiento requeridos.


In [None]:
import pandas as pd
import numpy as np
import re
from google.colab import drive

In [None]:
# Montamos el Google Drive en el directorio del proyecto y descomprimios el fichero con los datos
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.8.1-py3-none-any.whl (970 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m970.4/970.4 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.11.0-py2.py3-none-any.whl (433 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m38.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m60.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvi

In [None]:
import stanza

### Preparación del entorno de Stanza

Se siguen las instrucciones de https://stanfordnlp.github.io/stanza/new_language_constituency.html para preparar el entorno y poder entrenar modelos empleando Stanza.

Los siguientes pasos de la preparacion del entorno solo es necesario realizarlos la primera vez que se prepara el entorno de Stanza

In [None]:
!cd /content/gdrive/MyDrive/PLN/2 && git clone https://github.com/stanfordnlp/stanza.git

Cloning into 'stanza'...
remote: Enumerating objects: 40391, done.[K
remote: Counting objects: 100% (2489/2489), done.[K
remote: Compressing objects: 100% (717/717), done.[K
remote: Total 40391 (delta 1897), reused 2282 (delta 1770), pack-reused 37902[K
Receiving objects: 100% (40391/40391), 83.30 MiB | 11.08 MiB/s, done.
Resolving deltas: 100% (30968/30968), done.
Updating files: 100% (519/519), done.


In [None]:
import stanza
stanza.install_corenlp(dir="/content/gdrive/MyDrive/PLN/2/CoreNLP")

INFO:stanza:Installing CoreNLP package into /content/gdrive/MyDrive/PLN/2/CoreNLP


Downloading https://huggingface.co/stanfordnlp/CoreNLP/resolve/main/stanford-corenlp-latest.zip:   0%|        …

INFO:stanza:Downloaded file to /content/gdrive/MyDrive/PLN/2/CoreNLP/corenlp.zip


### EN

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.constituency.prepare_con_dataset en_masc

Read 5201 natural trees
Split 5201 trees into 4160 train 520 dev 521 test
Total lengths 4160 train 520 dev 521 test
Writing 4160 trees to /content/gdrive/MyDrive/PLN/2/constituency/en_masc_train.mrg
Writing 520 trees to /content/gdrive/MyDrive/PLN/2/constituency/en_masc_dev.mrg
Writing 521 trees to /content/gdrive/MyDrive/PLN/2/constituency/en_masc_test.mrg


In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency en_masc --epochs 20

2024-03-26 10:26:37 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py en_masc --epochs 100
2024-03-26 10:26:37 INFO: Using default pretrain for language, found in /root/stanza_resources/en/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-26 10:26:37 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-03-26 10:26:37 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
2024-03-26 10:26:37 INFO: Expanded save_name: en_masc_charlm_constituency.pt
2024-03-26 10:26:37 INFO: Expanded save_name: saved_models/constituency/en_masc_charlm_constituency.pt
2024-03-26 10:26:37 INFO: en_masc: saved_models/constituency/en_masc_charlm_constituency.pt does not exist, training new model
2024-03-26 10:26:37 INFO: Using default pretrain for language, found in /root/stanza_resources/en/pretrain/conll17.pt  To use a differe

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency en_masc --score_dev

2024-03-27 10:51:34 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py en_masc --score_dev
2024-03-27 10:51:34 INFO: Default pretrain should be /root/stanza_resources/en/pretrain/conll17.pt  Attempting to download
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0% 0.00/47.2k [00:00<?, ?B/s]Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 379kB [00:00, 20.6MB/s]        
2024-03-27 10:51:34 INFO: Downloaded file to /root/stanza_resources/resources.json
2024-03-27 10:51:34 INFO: Downloading these customized packages for language: en (English)...
| Processor | Package |
-----------------------
| pretrain  | conll17 |

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/pretrain/conll17.pt: 100% 107M/107M [00:00<00:00, 245MB/s] 
2024-03-27 10:51:35 INFO: Downloaded file to /root/stanza_resour

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency en_masc --score_test

2024-03-27 10:52:10 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py en_masc --score_test
2024-03-27 10:52:10 INFO: Using default pretrain for language, found in /root/stanza_resources/en/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-27 10:52:10 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-03-27 10:52:10 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
2024-03-27 10:52:10 INFO: Running test step with args: ['--eval_file', '/content/gdrive/MyDrive/PLN/2/constituency/en_masc_test.mrg', '--shorthand', 'en_masc', '--mode', 'predict', '--wordvec_pretrain_file', '/root/stanza_resources/en/pretrain/conll17.pt', '--charlm_forward_file', '/root/stanza_resources/en/forward_charlm/1billion.pt', '--charlm_backward_file', '/root/stanza_resources/en/backward_charlm/1billion.pt']
2024-03-27 10:52:10 I

### PT

Se modifica la función read_xml_file del script convert_cintil.py de Stanza para que parsee correctamente el treebank

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.constituency.prepare_con_dataset pt_cintil

Read 785 synthetic trees
Read 9948 natural trees
Split 9948 trees into 7958 train 995 dev 995 test
Total lengths 8743 train 995 dev 995 test
Writing 8743 trees to /content/gdrive/MyDrive/PLN/2/constituency/pt_cintil_train.mrg
Writing 995 trees to /content/gdrive/MyDrive/PLN/2/constituency/pt_cintil_dev.mrg
Writing 995 trees to /content/gdrive/MyDrive/PLN/2/constituency/pt_cintil_test.mrg


In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency pt_cintil --epochs 20

2024-03-24 18:09:57 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py pt_cintil --epochs 100
2024-03-24 18:09:57 INFO: Using default pretrain for language, found in /root/stanza_resources/pt/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-24 18:09:57 INFO: Using model /root/stanza_resources/pt/forward_charlm/oscar2023.pt for forward charlm
2024-03-24 18:09:57 INFO: Using model /root/stanza_resources/pt/backward_charlm/oscar2023.pt for backward charlm
2024-03-24 18:09:57 INFO: Expanded save_name: pt_cintil_charlm_constituency.pt
2024-03-24 18:09:57 INFO: Expanded save_name: saved_models/constituency/pt_cintil_charlm_constituency.pt
2024-03-24 18:09:57 INFO: pt_cintil: saved_models/constituency/pt_cintil_charlm_constituency.pt does not exist, training new model
2024-03-24 18:09:57 INFO: Using default pretrain for language, found in /root/stanza_resources/pt/pretrain/conll17.pt  To u

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency pt_cintil --score_dev

2024-03-24 19:18:30 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py pt_cintil --score_dev
2024-03-24 19:18:30 INFO: Using default pretrain for language, found in /root/stanza_resources/pt/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-24 19:18:30 INFO: Using model /root/stanza_resources/pt/forward_charlm/oscar2023.pt for forward charlm
2024-03-24 19:18:30 INFO: Using model /root/stanza_resources/pt/backward_charlm/oscar2023.pt for backward charlm
2024-03-24 19:18:30 INFO: Running dev step with args: ['--eval_file', '/content/gdrive/MyDrive/PLN/2/constituency/pt_cintil_dev.mrg', '--shorthand', 'pt_cintil', '--mode', 'predict', '--retag_method', 'upos', '--wordvec_pretrain_file', '/root/stanza_resources/pt/pretrain/conll17.pt', '--charlm_forward_file', '/root/stanza_resources/pt/forward_charlm/oscar2023.pt', '--charlm_backward_file', '/root/stanza_resources/pt/backward_charlm/osc

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency pt_cintil --score_test

2024-03-24 19:18:57 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py pt_cintil --score_test
2024-03-24 19:18:57 INFO: Using default pretrain for language, found in /root/stanza_resources/pt/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-24 19:18:57 INFO: Using model /root/stanza_resources/pt/forward_charlm/oscar2023.pt for forward charlm
2024-03-24 19:18:57 INFO: Using model /root/stanza_resources/pt/backward_charlm/oscar2023.pt for backward charlm
2024-03-24 19:18:57 INFO: Running test step with args: ['--eval_file', '/content/gdrive/MyDrive/PLN/2/constituency/pt_cintil_test.mrg', '--shorthand', 'pt_cintil', '--mode', 'predict', '--retag_method', 'upos', '--wordvec_pretrain_file', '/root/stanza_resources/pt/pretrain/conll17.pt', '--charlm_forward_file', '/root/stanza_resources/pt/forward_charlm/oscar2023.pt', '--charlm_backward_file', '/root/stanza_resources/pt/backward_charlm/

## 2.b. OTROS PARSERS Y/O IDIOMAS (OPTATIVO, HASTA 3 PUNTOS)

Al igual que con el Apartado 1.b, se trataría de ampliar nuestro estudio a más parsers y, sobre todo, más idiomas. Otra vez se valorarán especialmente (de menos a más): lenguas no latinas, lenguas no indo-europeas y lenguas con alfabeto diferente al latino. De nuevo se tendrá en cuenta positivamente la variedad tanto del conjunto de idiomas como la variedad del tipo de parsers empleados.


In [None]:
import pandas as pd
import numpy as np
import re
from google.colab import drive

In [None]:
# Montamos el Google Drive en el directorio del proyecto y descomprimios el fichero con los datos
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
!pip install stanza

### JA

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.constituency.prepare_con_dataset ja_alt

Eliminated 9 trees for having wide spaces in it
Eliminated 19 trees for not being correctly encoded
Writing 17195 trees to /content/gdrive/MyDrive/PLN/2/constituency/ja_alt_train.mrg
Writing 934 trees to /content/gdrive/MyDrive/PLN/2/constituency/ja_alt_dev.mrg
Writing 931 trees to /content/gdrive/MyDrive/PLN/2/constituency/ja_alt_test.mrg


In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency ja_alt --epochs 20

2024-03-27 10:52:54 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py ja_alt --epochs 20
2024-03-27 10:52:54 INFO: Default pretrain should be /root/stanza_resources/ja/pretrain/conll17.pt  Attempting to download
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0% 0.00/47.2k [00:00<?, ?B/s]Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 379kB [00:00, 17.9MB/s]        
2024-03-27 10:52:54 INFO: Downloaded file to /root/stanza_resources/resources.json
2024-03-27 10:52:54 INFO: Downloading these customized packages for language: ja (Japanese)...
| Processor | Package |
-----------------------
| pretrain  | conll17 |

Downloading https://huggingface.co/stanfordnlp/stanza-ja/resolve/v1.8.0/models/pretrain/conll17.pt: 100% 107M/107M [00:01<00:00, 106MB/s] 
2024-03-27 10:52:56 INFO: Downloaded file to /root/stanza_resour

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency ja_alt --score_dev

2024-03-28 14:26:15 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py ja_alt --score_dev
2024-03-28 14:26:15 INFO: Default pretrain should be /root/stanza_resources/ja/pretrain/conll17.pt  Attempting to download
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0% 0.00/47.2k [00:00<?, ?B/s]Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 379kB [00:00, 19.0MB/s]        
2024-03-28 14:26:15 INFO: Downloaded file to /root/stanza_resources/resources.json
2024-03-28 14:26:15 INFO: Downloading these customized packages for language: ja (Japanese)...
| Processor | Package |
-----------------------
| pretrain  | conll17 |

Downloading https://huggingface.co/stanfordnlp/stanza-ja/resolve/v1.8.0/models/pretrain/conll17.pt: 100% 107M/107M [00:00<00:00, 211MB/s]
2024-03-28 14:26:16 INFO: Downloaded file to /root/stanza_resourc

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency ja_alt --score_test

2024-03-28 14:27:04 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py ja_alt --score_test
2024-03-28 14:27:04 INFO: Using default pretrain for language, found in /root/stanza_resources/ja/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-28 14:27:04 INFO: Using model /root/stanza_resources/ja/forward_charlm/conll17.pt for forward charlm
2024-03-28 14:27:04 INFO: Using model /root/stanza_resources/ja/backward_charlm/conll17.pt for backward charlm
2024-03-28 14:27:04 INFO: Running test step with args: ['--eval_file', '/content/gdrive/MyDrive/PLN/2/constituency/ja_alt_test.mrg', '--shorthand', 'ja_alt', '--mode', 'predict', '--wordvec_pretrain_file', '/root/stanza_resources/ja/pretrain/conll17.pt', '--charlm_forward_file', '/root/stanza_resources/ja/forward_charlm/conll17.pt', '--charlm_backward_file', '/root/stanza_resources/ja/backward_charlm/conll17.pt']
2024-03-28 14:27:04 INFO: Ex

### ID

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.constituency.prepare_con_dataset id_icon

100% 8000/8000 [00:03<00:00, 2118.25it/s]
Writing 8000 trees to /content/gdrive/MyDrive/PLN/2/constituency/id_icon_train.mrg
Writing 1000 trees to /content/gdrive/MyDrive/PLN/2/constituency/id_icon_dev.mrg
Writing 1000 trees to /content/gdrive/MyDrive/PLN/2/constituency/id_icon_test.mrg


In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency id_icon --epochs 20

2024-03-28 14:28:22 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py id_icon --epochs 20
2024-03-28 14:28:22 INFO: Default pretrain should be /root/stanza_resources/id/pretrain/conll17.pt  Attempting to download
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0% 0.00/47.2k [00:00<?, ?B/s]Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 379kB [00:00, 19.5MB/s]        
2024-03-28 14:28:22 INFO: Downloaded file to /root/stanza_resources/resources.json
2024-03-28 14:28:22 INFO: Downloading these customized packages for language: id (Indonesian)...
| Processor | Package |
-----------------------
| pretrain  | conll17 |

Downloading https://huggingface.co/stanfordnlp/stanza-id/resolve/v1.8.0/models/pretrain/conll17.pt: 100% 107M/107M [00:01<00:00, 68.4MB/s]
2024-03-28 14:28:24 INFO: Downloaded file to /root/stanza_res

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency id_icon --score_dev

2024-03-28 15:25:40 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py id_icon --score_dev
2024-03-28 15:25:40 INFO: Default pretrain should be /root/stanza_resources/id/pretrain/conll17.pt  Attempting to download
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 379kB [00:00, 13.5MB/s]        
2024-03-28 15:25:40 INFO: Downloaded file to /root/stanza_resources/resources.json
2024-03-28 15:25:40 INFO: Downloading these customized packages for language: id (Indonesian)...
| Processor | Package |
-----------------------
| pretrain  | conll17 |

Downloading https://huggingface.co/stanfordnlp/stanza-id/resolve/v1.8.0/models/pretrain/conll17.pt: 100% 107M/107M [00:02<00:00, 46.9MB/s]
2024-03-28 15:25:43 INFO: Downloaded file to /root/stanza_resources/id/pretrain/conll17.pt
2024-03-28 15:25:43 INFO: Finished downloading models and saved to /root/stanza_resources
2024-03-28 15:2

In [None]:
!cd /content/gdrive/MyDrive/PLN/2/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_constituency id_icon --score_test

2024-03-28 15:29:59 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/2/stanza/stanza/utils/training/run_constituency.py id_icon --score_test
2024-03-28 15:29:59 INFO: Using default pretrain for language, found in /root/stanza_resources/id/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-28 15:29:59 INFO: Using model /root/stanza_resources/id/forward_charlm/oscar2023.pt for forward charlm
2024-03-28 15:29:59 INFO: Using model /root/stanza_resources/id/backward_charlm/oscar2023.pt for backward charlm
2024-03-28 15:29:59 INFO: Running test step with args: ['--eval_file', '/content/gdrive/MyDrive/PLN/2/constituency/id_icon_test.mrg', '--shorthand', 'id_icon', '--mode', 'predict', '--retag_method', 'upos', '--wordvec_pretrain_file', '/root/stanza_resources/id/pretrain/conll17.pt', '--charlm_forward_file', '/root/stanza_resources/id/forward_charlm/oscar2023.pt', '--charlm_backward_file', '/root/stanza_resources/id/backward_charlm/oscar2

# EJERCICIO 3. ENTRENAMIENTO Y EVALUACIÓN DE PARSERS DE DEPENDENCIAS

Básicamente igual al Ejercicio 2 pero esta vez con parsers basados en dependencias. Un par de apuntes:

*	Aconsejamos al alumno limitarse a trabajar con el formalismo de anotación basado en dependencias universales (UD, por universal dependencies). La mejor referencia al respecto es la web http://universaldependencies.org, dedicada a la creación y publicación de treebanks y otras herramientas y recursos que emplean dicho formalismo.

*	En el caso del parsing de dependencias, y si se emplean UD, la herramienta de evaluación a emplear sería alguno de los scripts de evaluación del CoNLL Shared Task (2018 o 2017).


ENTREGABLES:

Para cada parser empleado se incluirá en la memoria un apartado en el que se analicen sus características (modelo en el que se basa, etc.), URL de la web donde se obtuvo, si fue necesario preprocesar el texto de entrada/postprocesar la salida y cómo, etc. De forma similar, deberán incluirse sendos apartados describiendo las características de los treebanks empleados, si fue necesario adaptarlos y cómo, etc.
Finalmente, para cada idioma se incluirá una tabla(s) y/o gráfica(s) comparativa(s) de los resultados obtenidos con cada parser, así como un breve análisis de dichos resultados, junto con las características de la máquina empleada y los tiempos de entrenamiento requeridos.

## (a) INGLÉS + LENGUA ROMANCE (ELEGIBLE, 2 PUNTOS)

El enunciado viene a ser el mismo que para el caso del análisis basado en constituyentes (Apartado 2.a), pero esta vez: (1) dos parsers en vez de sólo uno; (2) análisis de dependencias en vez de constituyentes.

In [18]:
import pandas as pd
import numpy as np
import spacy
import random as rn
import re
from google.colab import drive

In [19]:
# Montamos el Google Drive en el directorio del proyecto y descomprimios el fichero con los datos
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Preparacion entorno Stanza

In [3]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.8.1-py3-none-any.whl (970 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m970.4/970.4 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.11.0-py2.py3-none-any.whl (433 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m433.8/433.8 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.3.0->stanza)
  Using cached nvidia_cudnn_cu12-8.9.2.2

In [22]:
import stanza

Se siguen las instrucciones de https://github.com/stanfordnlp/stanza-train/tree/master para preparar el entorno y poder entrenar modelos empleando Stanza.

Los siguientes pasos de la preparacion del entorno solo es necesario realizarlos la primera vez que se prepara el entorno de Stanza

In [None]:
!cd /content/gdrive/MyDrive/PLN/3 && git clone https://github.com/stanfordnlp/stanza-train.git

Cloning into 'stanza-train'...
remote: Enumerating objects: 187, done.[K
remote: Counting objects: 100% (187/187), done.[K
remote: Compressing objects: 100% (101/101), done.[K
remote: Total 187 (delta 80), reused 171 (delta 66), pack-reused 0[K
Receiving objects: 100% (187/187), 35.37 KiB | 739.00 KiB/s, done.
Resolving deltas: 100% (80/80), done.


In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train && git clone https://github.com/stanfordnlp/stanza.git

Cloning into 'stanza'...
remote: Enumerating objects: 40307, done.[K
remote: Counting objects: 100% (2405/2405), done.[K
remote: Compressing objects: 100% (770/770), done.[K
remote: Total 40307 (delta 1827), reused 2125 (delta 1633), pack-reused 37902[K
Receiving objects: 100% (40307/40307), 83.29 MiB | 14.05 MiB/s, done.
Resolving deltas: 100% (30897/30897), done.
Updating files: 100% (519/519), done.


In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train && cp config/config.sh stanza/scripts/config.sh

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train && cp config/xpos_vocab_factory.py stanza/stanza/models/pos/xpos_vocab_factory.py

### Preprocesado de datos

#### SpaCy

##### EN

In [None]:
# Se invoca el comando "convert" de Spacy en CLI para transformar los treebanks de formato .conllu a .spacy

!python -m spacy convert "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_train.conllu" "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/" --converter conllu --n-sents 10 --merge-subtokens
!python -m spacy convert "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_dev.conllu" "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/" --converter conllu --n-sents 10 --merge-subtokens

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1255 documents):
/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (201 documents):
/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_dev.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (208 documents):
/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_test.spacy[0m


##### IT

In [None]:
# Se invoca el comando "convert" de Spacy en CLI para transformar los treebanks de formato .conllu a .spacy

!python -m spacy convert "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_isdt_train.conllu" "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/" --converter conllu --n-sents 10 --merge-subtokens
!python -m spacy convert "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_isdt_dev.conllu" "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/" --converter conllu --n-sents 10 --merge-subtokens

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (1313 documents):
/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_isdt_train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (57 documents):
/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_isdt_dev.spacy[0m


### Entrenamiento

#### SpaCy

##### EN

In [None]:
# Se genera el archivo de configuracion definitivo
!python -m spacy init fill-config "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_base_config.cfg" "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_config.cfg"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_config.cfg
You can now add your data and train your pipeline:
python -m spacy train en_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_config.cfg" --output "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/Trained" --paths.train "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_train.spacy" --paths.dev  "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_dev.spacy"

[38;5;4mℹ Saving to output directory:
/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/Trained[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       133.66         134.94         149.73       303.27    21.53    24.51      24.75      77.09    18.93     4.52     0.61    0.34
  0     200       4107.87     16020.21       17236.33        9966.39     31016.08    80.92    83.46      81.87      86.80    64.78    52.55    56.45    0.78
  0     400       6146.95      6733.73        7989.81        4329.39  

##### IT

In [None]:
# Se genera el archivo de configuracion definitivo
!python -m spacy init fill-config "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_base_config.cfg" "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_config.cfg"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_config.cfg
You can now add your data and train your pipeline:
python -m spacy train it_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_config.cfg" --output "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/Trained" --paths.train "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_isdt_train.spacy" --paths.dev  "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_isdt_dev.spacy"

[38;5;4mℹ Saving to output directory:
/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/Trained[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       208.86         213.21         236.67       482.78    31.71    32.05      24.54      55.33    28.99    10.81     1.02    0.34
  0     200       4581.50     14062.22       18709.24       16303.58     31114.47    87.81    88.49      82.89      85.75    71.21    60.11    76.34    0.81
  0     400       6786.83      5875.72        8619.94        7657.37  

#### Stanza

##### EN

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-EWT

2024-03-13 19:06:07 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/a/Stanza/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_English-EWT
Preparing data for UD_English-EWT: en_ewt, en
Reading from ../data/udbase/UD_English-EWT/en_ewt-ud-train.conllu and writing to ../data/processed/tokenize/en_ewt.train.gold.conllu
Augmented 66 quotes: Counter({'″″': 10, '““': 10, '《》': 8, '«»': 7, '「」': 6, '„”': 6, '""': 6, '„“': 5, '””': 5, '»«': 3})
Swapped 'w1, w2' for 'w1 ,w2' 74 times
Added 109 new sentences with asdf, zzzz -> asdf,zzzz
Reading from ../data/udbase/UD_English-EWT/en_ewt-ud-dev.conllu and writing to ../data/processed/tokenize/en_ewt.dev.gold.conllu
Reading from ../data/udbase/UD_English-EWT/en_ewt-ud-test.conllu and writing to ../data/processed/tokenize/en_ewt.test.gold.conllu
Tokenizer labels written to ../data/processed/tokenize/en_ewt-ud-train.toklabels
  579 unique MWTs found in data.  MWTs written to ../data/processed/tokenize/en_ewt

In [36]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_tokenizer UD_English-EWT --step 1000

2024-04-09 17:46:53 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_tokenizer.py UD_English-EWT --step 1000
2024-04-09 17:46:53 DEBUG: UD_English-EWT: en_ewt
2024-04-09 17:46:53 INFO: Save file for en_ewt model: en_ewt_tokenizer.pt
2024-04-09 17:46:53 INFO: UD_English-EWT: saved_models/tokenize/en_ewt_tokenizer.pt does not exist, training new model
2024-04-09 17:46:54 INFO: Running train step with args: ['--label_file', '../data/processed/tokenize/en_ewt-ud-train.toklabels', '--txt_file', '../data/processed/tokenize/en_ewt.train.txt', '--lang', 'en', '--max_seqlen', '300', '--mwt_json_file', '../data/processed/tokenize/en_ewt-ud-dev-mwt.json', '--dev_txt_file', '../data/processed/tokenize/en_ewt.dev.txt', '--dev_label_file', '../data/processed/tokenize/en_ewt-ud-dev.toklabels', '--dev_conll_gold', '../data/processed/tokenize/en_ewt.dev.gold.conllu', '--conll_file', '/tmp/tmppwmyxuhe', '--shorthand', 'en_ewt', '--step', '10

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_pos_treebank UD_English-EWT

2024-03-14 18:02:12 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/a/Stanza/stanza-train/stanza/stanza/utils/datasets/prepare_pos_treebank.py UD_English-EWT
Preparing data for UD_English-EWT: en_ewt, en
Reading from ../data/udbase/UD_English-EWT/en_ewt-ud-train.conllu and writing to /tmp/tmpr95htl7m/en_ewt.train.gold.conllu
Augmented 66 quotes: Counter({'″″': 10, '““': 10, '《》': 8, '«»': 7, '「」': 6, '„”': 6, '""': 6, '„“': 5, '””': 5, '»«': 3})
Swapped 'w1, w2' for 'w1 ,w2' 74 times
Added 109 new sentences with asdf, zzzz -> asdf,zzzz
Reading from ../data/udbase/UD_English-EWT/en_ewt-ud-dev.conllu and writing to /tmp/tmpr95htl7m/en_ewt.dev.gold.conllu
Reading from ../data/udbase/UD_English-EWT/en_ewt-ud-test.conllu and writing to /tmp/tmpr95htl7m/en_ewt.test.gold.conllu
Copying from /tmp/tmpr95htl7m/en_ewt.train.gold.conllu to ../data/processed/pos/en_ewt.train.in.conllu
Copying from /tmp/tmpr95htl7m/en_ewt.dev.gold.conllu to ../data/processed/pos/en_ewt.dev.in.conll

In [11]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_pos UD_English-EWT --max_steps 500

2024-04-09 16:58:31 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_pos.py UD_English-EWT --max_steps 500
2024-04-09 16:58:31 DEBUG: UD_English-EWT: en_ewt
2024-04-09 16:58:31 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-04-09 16:58:31 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
2024-04-09 16:58:31 INFO: Using default pretrain for language, found in /root/stanza_resources/en/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-04-09 16:58:31 INFO: UD_English-EWT: saved_models/pos/en_ewt_charlm_tagger.pt does not exist, training new model
2024-04-09 16:58:31 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-04-09 16:58:31 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
2024-04-09 16:58:31 INFO: Using defaul

In [5]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_English-EWT

2024-04-09 16:50:01 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py UD_English-EWT
2024-04-09 16:50:01 INFO: Using tagger model in saved_models/pos/en_ewt_charlm_tagger.pt for en_ewt
2024-04-09 16:50:01 INFO: Using default pretrain for language, found in /root/stanza_resources/en/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-04-09 16:50:01 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-04-09 16:50:01 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
Preparing data for UD_English-EWT: en_ewt, en
Reading from ../data/udbase/UD_English-EWT/en_ewt-ud-train.conllu and writing to /tmp/tmp9k9mv4uz/en_ewt.train.gold.conllu
Augmented 66 quotes: Counter({'″″': 10, '““': 10, '《》': 8, '«»': 7, '「」': 6, '„”': 6, '""': 6, '„“': 5, '””': 5, '»«': 3})
Swapped 'w1, w2' for 'w1 ,w2' 74 t

In [12]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_depparse UD_English-EWT --max_steps 1000

2024-04-09 17:08:01 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_depparse.py UD_English-EWT --max_steps 1000
2024-04-09 17:08:01 DEBUG: UD_English-EWT: en_ewt
2024-04-09 17:08:01 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-04-09 17:08:01 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
2024-04-09 17:08:01 INFO: Using default pretrain for language, found in /root/stanza_resources/en/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-04-09 17:08:01 INFO: UD_English-EWT: saved_models/depparse/en_ewt_charlm_parser.pt does not exist, training new model
2024-04-09 17:08:01 INFO: Using model /root/stanza_resources/en/forward_charlm/1billion.pt for forward charlm
2024-04-09 17:08:01 INFO: Using model /root/stanza_resources/en/backward_charlm/1billion.pt for backward charlm
2024-04-09 17:08:01 INFO: U

##### IT

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_Italian-ISDT

2024-03-16 11:57:40 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/a/Stanza/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_Italian-ISDT
Preparing data for UD_Italian-ISDT: it_isdt, it
Reading from ../data/udbase/UD_Italian-ISDT/it_isdt-ud-train.conllu and writing to ../data/processed/tokenize/it_isdt.train.gold.conllu
Augmented 140 quotes: Counter({'„”': 21, '″″': 18, '""': 15, '「」': 14, '»«': 13, '““': 13, '””': 12, '„“': 12, '《》': 12, '«»': 10})
Swapped 'w1, w2' for 'w1 ,w2' 125 times
Added 159 new sentences with asdf, zzzz -> asdf,zzzz
Reading from ../data/udbase/UD_Italian-ISDT/it_isdt-ud-dev.conllu and writing to ../data/processed/tokenize/it_isdt.dev.gold.conllu
Reading from ../data/udbase/UD_Italian-ISDT/it_isdt-ud-test.conllu and writing to ../data/processed/tokenize/it_isdt.test.gold.conllu
Tokenizer labels written to ../data/processed/tokenize/it_isdt-ud-train.toklabels
  807 unique MWTs found in data.  MWTs written to ../data/pr

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_tokenizer UD_Italian-ISDT --step 1000

2024-03-16 11:57:59 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/a/Stanza/stanza-train/stanza/stanza/utils/training/run_tokenizer.py UD_Italian-ISDT --step 1000
2024-03-16 11:57:59 DEBUG: UD_Italian-ISDT: it_isdt
2024-03-16 11:57:59 INFO: Save file for it_isdt model: it_isdt_tokenizer.pt
2024-03-16 11:57:59 INFO: UD_Italian-ISDT: saved_models/tokenize/it_isdt_tokenizer.pt does not exist, training new model
2024-03-16 11:57:59 INFO: Running train step with args: ['--label_file', '../data/processed/tokenize/it_isdt-ud-train.toklabels', '--txt_file', '../data/processed/tokenize/it_isdt.train.txt', '--lang', 'it', '--max_seqlen', '400', '--mwt_json_file', '../data/processed/tokenize/it_isdt-ud-dev-mwt.json', '--dev_txt_file', '../data/processed/tokenize/it_isdt.dev.txt', '--dev_label_file', '../data/processed/tokenize/it_isdt-ud-dev.toklabels', '--dev_conll_gold', '../data/processed/tokenize/it_isdt.dev.gold.conllu', '--conll_file', '/tmp/tmpur1pn5ge', '--shorthand', '

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_pos_treebank UD_Italian-ISDT

2024-03-16 11:58:53 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/a/Stanza/stanza-train/stanza/stanza/utils/datasets/prepare_pos_treebank.py UD_Italian-ISDT
Preparing data for UD_Italian-ISDT: it_isdt, it
Reading from ../data/udbase/UD_Italian-ISDT/it_isdt-ud-train.conllu and writing to /tmp/tmps__y1fqn/it_isdt.train.gold.conllu
Augmented 140 quotes: Counter({'„”': 21, '″″': 18, '""': 15, '「」': 14, '»«': 13, '““': 13, '””': 12, '„“': 12, '《》': 12, '«»': 10})
Swapped 'w1, w2' for 'w1 ,w2' 125 times
Added 159 new sentences with asdf, zzzz -> asdf,zzzz
Reading from ../data/udbase/UD_Italian-ISDT/it_isdt-ud-dev.conllu and writing to /tmp/tmps__y1fqn/it_isdt.dev.gold.conllu
Reading from ../data/udbase/UD_Italian-ISDT/it_isdt-ud-test.conllu and writing to /tmp/tmps__y1fqn/it_isdt.test.gold.conllu
Copying from /tmp/tmps__y1fqn/it_isdt.train.gold.conllu to ../data/processed/pos/it_isdt.train.in.conllu
Copying from /tmp/tmps__y1fqn/it_isdt.dev.gold.conllu to ../data/processe

In [35]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_pos UD_Italian-ISDT --max_steps 500

2024-04-09 17:30:31 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_pos.py UD_Italian-ISDT --max_steps 500
2024-04-09 17:30:31 DEBUG: UD_Italian-ISDT: it_isdt
2024-04-09 17:30:31 INFO: Using model /root/stanza_resources/it/forward_charlm/conll17.pt for forward charlm
2024-04-09 17:30:31 INFO: Using model /root/stanza_resources/it/backward_charlm/conll17.pt for backward charlm
2024-04-09 17:30:31 INFO: Using default pretrain for language, found in /root/stanza_resources/it/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-04-09 17:30:31 INFO: UD_Italian-ISDT: saved_models/pos/it_isdt_charlm_tagger.pt does not exist, training new model
2024-04-09 17:30:31 INFO: Using model /root/stanza_resources/it/forward_charlm/conll17.pt for forward charlm
2024-04-09 17:30:31 INFO: Using model /root/stanza_resources/it/backward_charlm/conll17.pt for backward charlm
2024-04-09 17:30:31 INFO: Using defau

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Italian-ISDT

2024-03-16 12:14:36 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/a/Stanza/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py UD_Italian-ISDT
2024-03-16 12:14:36 INFO: Using tagger model in saved_models/pos/it_isdt_charlm_tagger.pt for it_isdt
2024-03-16 12:14:36 INFO: Using default pretrain for language, found in /root/stanza_resources/it/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-16 12:14:36 INFO: Using model /root/stanza_resources/it/forward_charlm/conll17.pt for forward charlm
2024-03-16 12:14:36 INFO: Using model /root/stanza_resources/it/backward_charlm/conll17.pt for backward charlm
Preparing data for UD_Italian-ISDT: it_isdt, it
Reading from ../data/udbase/UD_Italian-ISDT/it_isdt-ud-train.conllu and writing to /tmp/tmpoziz9yga/it_isdt.train.gold.conllu
Augmented 140 quotes: Counter({'„”': 21, '″″': 18, '""': 15, '「」': 14, '»«': 13, '““': 13, '””': 12, '„“': 12, '《》': 12, '«»': 10})
Swapped 'w

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_depparse UD_Italian-ISDT --max_steps 1000

2024-03-16 12:16:38 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/a/Stanza/stanza-train/stanza/stanza/utils/training/run_depparse.py UD_Italian-ISDT --max_steps 1000
2024-03-16 12:16:38 DEBUG: UD_Italian-ISDT: it_isdt
2024-03-16 12:16:38 INFO: Using model /root/stanza_resources/it/forward_charlm/conll17.pt for forward charlm
2024-03-16 12:16:38 INFO: Using model /root/stanza_resources/it/backward_charlm/conll17.pt for backward charlm
2024-03-16 12:16:38 INFO: Using default pretrain for language, found in /root/stanza_resources/it/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-16 12:16:38 INFO: UD_Italian-ISDT: saved_models/depparse/it_isdt_charlm_parser.pt does not exist, training new model
2024-03-16 12:16:38 INFO: Using model /root/stanza_resources/it/forward_charlm/conll17.pt for forward charlm
2024-03-16 12:16:38 INFO: Using model /root/stanza_resources/it/backward_charlm/conll17.pt for backward charlm
2024-03-16 12:16:

### Test

In [27]:
# Función para imprimir por pantalla la evaluación, como el conll18_ud_eval tiene esta implementación en main,
# se hace un wrapper para emplearlo como función fuera del script
def print_eval(verbose:bool, counts:bool, evaluation):
  if not verbose and not counts:
    print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
    print("MLAS Score: {:.2f}".format(100 * evaluation["MLAS"].f1))
    print("BLEX Score: {:.2f}".format(100 * evaluation["BLEX"].f1))
  else:
    if counts:
      print("Metric     | Correct   |      Gold | Predicted | Aligned")
    else:
      print("Metric     | Precision |    Recall |  F1 Score | AligndAcc")
    print("-----------+-----------+-----------+-----------+-----------")
    for metric in["Tokens", "Sentences", "Words", "UPOS", "XPOS", "UFeats", "AllTags", "Lemmas", "UAS", "LAS", "CLAS", "MLAS", "BLEX"]:
      if counts:
        print("{:11}|{:10} |{:10} |{:10} |{:10}".format(
        metric,
        evaluation[metric].correct,
        evaluation[metric].gold_total,
        evaluation[metric].system_total,
        evaluation[metric].aligned_total or (evaluation[metric].correct if metric == "Words" else "")
        ))
      else:
        print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
        metric,
        100 * evaluation[metric].precision,
        100 * evaluation[metric].recall,
        100 * evaluation[metric].f1,
        "{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
        ))

In [29]:
import sys
sys.path.append('/content/gdrive/MyDrive/PLN/3')
import conll18_ud_eval as conll

#### SpaCy

In [None]:
!pip install spacy_conll

Collecting spacy_conll
  Downloading spacy_conll-3.4.0-py3-none-any.whl (21 kB)
Installing collected packages: spacy_conll
Successfully installed spacy_conll-3.4.0


In [None]:
import spacy_conll

##### EN

In [None]:
nlp = spacy.load("/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/Trained/model-best")
config = {"include_headers": True}
nlp.add_pipe("conll_formatter", config=config, last=True)

ConllFormatter(conversion_maps=None, ext_names={'conll_str': 'conll_str', 'conll': 'conll', 'conll_pd': 'conll_pd'}, field_names={'ID': 'ID', 'FORM': 'FORM', 'LEMMA': 'LEMMA', 'UPOS': 'UPOS', 'XPOS': 'XPOS', 'FEATS': 'FEATS', 'HEAD': 'HEAD', 'DEPREL': 'DEPREL', 'DEPS': 'DEPS', 'MISC': 'MISC'}, include_headers=True, disable_pandas=False)

In [None]:
input_file = "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_test.txt"
output_file = "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/output.conllu"
gold_standard = "/content/gdrive/MyDrive/PLN/3/a/Spacy/EN/en_ewt_gold_test.conllu"

In [None]:
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto procesado en el archivo de salida.
with open(output_file, 'w') as of:
  of.write(doc._.conll_str)
  of.write('\n')

In [None]:
gold = conll.load_conllu_file(gold_standard)
test = conll.load_conllu_file(output_file)
metrics = conll.evaluate(gold, test)

In [None]:
print_eval(True, True, metrics)
print('\n')
print_eval(True, False, metrics)

Metric     | Correct   |      Gold | Predicted | Aligned
-----------+-----------+-----------+-----------+-----------
Tokens     |     24031 |     24740 |     25511 |          
Sentences  |      1448 |      2077 |      2024 |          
Words      |     24647 |     25094 |     25511 |     24647
UPOS       |     23016 |     25094 |     25511 |     24647
XPOS       |     22509 |     25094 |     25511 |     24647
UFeats     |     23122 |     25094 |     25511 |     24647
AllTags    |     22077 |     25094 |     25511 |     24647
Lemmas     |     23376 |     25094 |     25511 |     24647
UAS        |     20405 |     25094 |     25511 |     24647
LAS        |     17799 |     25094 |     25511 |     24647
CLAS       |      9536 |     15177 |     13330 |     14883
MLAS       |      8598 |     15177 |     13330 |     14883
BLEX       |      9126 |     15177 |     13330 |     14883


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-------

##### IT

In [None]:
nlp = spacy.load("/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/Trained/model-best")
config = {"include_headers": True}
nlp.add_pipe("conll_formatter", config=config, last=True)

ConllFormatter(conversion_maps=None, ext_names={'conll_str': 'conll_str', 'conll': 'conll', 'conll_pd': 'conll_pd'}, field_names={'ID': 'ID', 'FORM': 'FORM', 'LEMMA': 'LEMMA', 'UPOS': 'UPOS', 'XPOS': 'XPOS', 'FEATS': 'FEATS', 'HEAD': 'HEAD', 'DEPREL': 'DEPREL', 'DEPS': 'DEPS', 'MISC': 'MISC'}, include_headers=True, disable_pandas=False)

In [None]:
input_file = "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_isdt_test.txt"
output_file = "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/output.conllu"
gold_standard = "/content/gdrive/MyDrive/PLN/3/a/Spacy/IT/it_isdt_gold_test.conllu"

In [None]:
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto procesado en el archivo de salida.
with open(output_file, 'w') as of:
  of.write(doc._.conll_str)
  of.write('\n')

In [None]:
gold = conll.load_conllu_file(gold_standard)
test = conll.load_conllu_file(output_file)
metrics = conll.evaluate(gold, test)

In [None]:
print_eval(True, True, metrics)
print('\n')
print_eval(True, False, metrics)

Metric     | Correct   |      Gold | Predicted | Aligned
-----------+-----------+-----------+-----------+-----------
Tokens     |      9665 |      9680 |      9674 |          
Sentences  |       478 |       482 |       486 |          
Words      |      8929 |     10417 |      9674 |      8929
UPOS       |      8654 |     10417 |      9674 |      8929
XPOS       |      8627 |     10417 |      9674 |      8929
UFeats     |      8649 |     10417 |      9674 |      8929
AllTags    |      8521 |     10417 |      9674 |      8929
Lemmas     |      8657 |     10417 |      9674 |      8929
UAS        |      7756 |     10417 |      9674 |      8929
LAS        |      7062 |     10417 |      9674 |      8929
CLAS       |      3579 |      5133 |      4568 |      5021
MLAS       |      2874 |      5133 |      4568 |      5021
BLEX       |      3452 |      5133 |      4568 |      5021


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-------

#### Stanza

In [30]:
from stanza.utils.conll import CoNLL

##### EN

In [20]:
input_file = "/content/gdrive/MyDrive/PLN/3/stanza-train/data/udbase/UD_English-EWT/en_ewt-ud-test.txt"
output_file = "/content/gdrive/MyDrive/PLN/3/a/Stanza/EN_output.conllu"
gold_standard = "/content/gdrive/MyDrive/PLN/3/stanza-train/data/processed/depparse/en_ewt.test.gold.conllu"

In [23]:
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse', use_gpu=True,
                      tokenize_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/tokenize/en_ewt_tokenizer.pt',
                      pos_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/pos/en_ewt_charlm_tagger.pt',
                      depparse_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/depparse/en_ewt_charlm_parser.pt')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/tokenize/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/mwt/combined.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/pos/combined_charlm.pt:   0%|  …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/lemma/combined_nocharlm.pt:   0…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/depparse/combined_charlm.pt:   …

INFO:stanza:Loading these models for language: en (English):
| Processor | Package                 |
---------------------------------------
| tokenize  | /content/g...kenizer.pt |
| mwt       | combined                |
| pos       | /content/g..._tagger.pt |
| lemma     | combined_nocharlm       |
| depparse  | /content/g..._parser.pt |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [24]:
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [31]:
# Se guarda el texto procesado en el archivo de salida.
CoNLL.write_doc2conll(doc, output_file)

Se modifica el output porque Sttanza esta añadiendo etiquetas <UNK\>, como unicamente se detectas dos ocurrencias, se ajustan a mano

In [33]:
gold = conll.load_conllu_file(gold_standard)
test = conll.load_conllu_file(output_file)
metrics = conll.evaluate(gold, test)

In [34]:
print_eval(True, True, metrics)
print('\n')
print_eval(True, False, metrics)

Metric     | Correct   |      Gold | Predicted | Aligned
-----------+-----------+-----------+-----------+-----------
Tokens     |     24538 |     24740 |     24725 |          
Sentences  |      1357 |      2077 |      1676 |          
Words      |     24832 |     25094 |     25067 |     24832
UPOS       |     24694 |     25094 |     25067 |     24832
XPOS       |     24677 |     25094 |     25067 |     24832
UFeats     |     24723 |     25094 |     25067 |     24832
AllTags    |     24630 |     25094 |     25067 |     24832
Lemmas     |     24154 |     25094 |     25067 |     24832
UAS        |     21526 |     25094 |     25067 |     24832
LAS        |     20949 |     25094 |     25067 |     24832
CLAS       |     12010 |     15177 |     15107 |     15004
MLAS       |     11853 |     15177 |     15107 |     15004
BLEX       |     11594 |     15177 |     15107 |     15004


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-------

##### IT

In [None]:
input_file = "/content/gdrive/MyDrive/PLN/3/stanza-train/data/udbase/UD_Italian-ISDT/it_isdt-ud-test.txt"
output_file = "/content/gdrive/MyDrive/PLN/3/a/Stanza/IT_output.conllu"
gold_standard = "/content/gdrive/MyDrive/PLN/3/stanza-train/data/processed/depparse/it_isdt.test.gold.conllu"

In [None]:
nlp = stanza.Pipeline('it', processors='tokenize,pos,lemma,depparse', use_gpu=True,
                      tokenize_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/tokenize/it_isdt_tokenizer.pt',
                      pos_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/pos/it_isdt_charlm_tagger.pt',
                      depparse_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/depparse/it_isdt_charlm_parser.pt')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-it/resolve/v1.8.0/models/tokenize/combined.pt:   0%|    …

Downloading https://huggingface.co/stanfordnlp/stanza-it/resolve/v1.8.0/models/mwt/combined.pt:   0%|         …

Downloading https://huggingface.co/stanfordnlp/stanza-it/resolve/v1.8.0/models/pos/combined_charlm.pt:   0%|  …

Downloading https://huggingface.co/stanfordnlp/stanza-it/resolve/v1.8.0/models/lemma/combined_nocharlm.pt:   0…

Downloading https://huggingface.co/stanfordnlp/stanza-it/resolve/v1.8.0/models/depparse/combined_charlm.pt:   …

INFO:stanza:Loading these models for language: it (Italian):
| Processor | Package                 |
---------------------------------------
| tokenize  | /content/g...kenizer.pt |
| mwt       | combined                |
| pos       | /content/g..._tagger.pt |
| lemma     | combined_nocharlm       |
| depparse  | /content/g..._parser.pt |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [None]:
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto procesado en el archivo de salida.
CoNLL.write_doc2conll(doc, output_file)

In [None]:
gold = conll.load_conllu_file(gold_standard)
test = conll.load_conllu_file(output_file)
metrics = conll.evaluate(gold, test)

In [None]:
print_eval(True, True, metrics)
print('\n')
print_eval(True, False, metrics)

Metric     | Correct   |      Gold | Predicted | Aligned
-----------+-----------+-----------+-----------+-----------
Tokens     |      9659 |      9680 |      9674 |          
Sentences  |       477 |       482 |       484 |          
Words      |     10343 |     10417 |     10399 |     10343
UPOS       |     10341 |     10417 |     10399 |     10343
XPOS       |     10341 |     10417 |     10399 |     10343
UFeats     |     10339 |     10417 |     10399 |     10343
AllTags    |     10339 |     10417 |     10399 |     10343
Lemmas     |     10163 |     10417 |     10399 |     10343
UAS        |      9377 |     10417 |     10399 |     10343
LAS        |      9108 |     10417 |     10399 |     10343
CLAS       |      4210 |      5133 |      5107 |      5078
MLAS       |      4165 |      5133 |      5107 |      5078
BLEX       |      4097 |      5133 |      5107 |      5078


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-------

## (b) OTROS PARSERS Y/O IDIOMAS (OPTATIVO, HASTA 3 PUNTOS)

De nuevo el enunciado viene a ser el mismo que para el caso del análisis basado en constituyentes (Apartado 2.b), si bien esta vez para análisis de dependencias.

In [None]:
import pandas as pd
import numpy as np
import spacy
import random as rn
import re
from google.colab import drive

In [None]:
# Montamos el Google Drive en el directorio del proyecto y descomprimios el fichero con los datos
drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Preparacion entorno Stanza

In [None]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.8.1-py3-none-any.whl (970 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m970.4/970.4 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nv

In [None]:
import stanza

### Preprocesado de datos

#### SpaCy

##### EL

In [None]:
# Se invoca el comando "convert" de Spacy en CLI para transformar los treebanks de formato .conllu a .spacy

!python -m spacy convert "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_gdt_train.conllu" "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/" --converter conllu --n-sents 10 --merge-subtokens
!python -m spacy convert "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_gdt_dev.conllu" "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/" --converter conllu --n-sents 10 --merge-subtokens

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (167 documents):
/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_gdt_train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (41 documents):
/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_gdt_dev.spacy[0m


##### ZH

In [None]:
# Se invoca el comando "convert" de Spacy en CLI para transformar los treebanks de formato .conllu a .spacy

!python -m spacy convert "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_gsd_train.conllu" "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/" --converter conllu --n-sents 10 --merge-subtokens
!python -m spacy convert "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_gsd_dev.conllu" "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/" --converter conllu --n-sents 10 --merge-subtokens

[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (400 documents):
/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_gsd_train.spacy[0m
[38;5;4mℹ Grouping every 10 sentences into a document.[0m
[38;5;2m✔ Generated output file (50 documents):
/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_gsd_dev.spacy[0m


##### RU

### Entrenamiento

#### SpaCy

##### EL

In [None]:
# Se genera el archivo de configuracion definitivo
!python -m spacy init fill-config "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_base_config.cfg" "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_config.cfg"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_config.cfg
You can now add your data and train your pipeline:
python -m spacy train el_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_config.cfg" --output "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/Trained" --paths.train "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_gdt_train.spacy" --paths.dev  "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_gdt_dev.spacy"

[38;5;4mℹ Saving to output directory:
/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/Trained[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       262.43         279.88         310.54       612.21    39.23    38.40      22.06      52.69    21.89     8.39     0.29    0.35
  1     200       5567.76     13862.37       25498.12       23945.71     39980.36    90.00    89.13      74.59      78.02    76.78    68.51    77.88    0.81
  2     400       7967.11      4362.40       12391.19       11459.80  

##### ZH

In [None]:
# Se genera el archivo de configuracion definitivo
!python -m spacy init fill-config "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_base_config.cfg" "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_config.cfg"

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_config.cfg
You can now add your data and train your pipeline:
python -m spacy train zh_config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_config.cfg" --output "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/Trained" --paths.train "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_gsd_train.spacy" --paths.dev  "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_gsd_dev.spacy"

[38;5;2m✔ Created output directory:
/content/gdrive/MyDrive/PLN/3/a/Spacy/ZH/Trained[0m
[38;5;4mℹ Saving to output directory:
/content/gdrive/MyDrive/PLN/3/a/Spacy/ZH/Trained[0m
[38;5;4mℹ Using CPU[0m
[1m
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'tagger', 'morphologizer',
'trainable_lemmatizer', 'parser'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS TAGGER  LOSS MORPH...  LOSS TRAIN...  LOSS PARSER  TAG_ACC  POS_ACC  MORPH_ACC  LEMMA_ACC  DEP_UAS  DEP_LAS  SENTS_F  SCORE 
---  ------  ------------  -----------  -------------  -------------  -----------  -------  -------  ---------  ---------  -------  -------  -------  ------
  0       0          0.00       445.23         443.90         475.31       421.12    33.29    31.38      88.42      40.17     4.85     4.30     0.00    0.35
  0     200       3979.12     33418.12       33975.74        1867.88     24312.45    78.30    77.07      96.72      40.17    29.06    17.

#### Stanza

##### EL

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_Greek-GDT

2024-03-17 14:35:18 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_Greek-GDT
Preparing data for UD_Greek-GDT: el_gdt, el
Reading from ../data/udbase/UD_Greek-GDT/el_gdt-ud-train.conllu and writing to ../data/processed/tokenize/el_gdt.train.gold.conllu
Augmented 15 quotes: Counter({'""': 4, '„“': 4, '″″': 2, '»«': 2, '「」': 2, '““': 1})
Swapped 'w1, w2' for 'w1 ,w2' 22 times
Added 0 new sentences with asdf, zzzz -> asdf,zzzz
Added 4 sentences with parens replaced with square brackets
Reading from ../data/udbase/UD_Greek-GDT/el_gdt-ud-dev.conllu and writing to ../data/processed/tokenize/el_gdt.dev.gold.conllu
Reading from ../data/udbase/UD_Greek-GDT/el_gdt-ud-test.conllu and writing to ../data/processed/tokenize/el_gdt.test.gold.conllu
Tokenizer labels written to ../data/processed/tokenize/el_gdt-ud-train.toklabels
  14 unique MWTs found in data.  MWTs written to ../data/processed/tokenize/el_gdt

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_tokenizer UD_Greek-GDT --step 1000

2024-03-17 14:35:41 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_tokenizer.py UD_Greek-GDT --step 1000
2024-03-17 14:35:41 DEBUG: UD_Greek-GDT: el_gdt
2024-03-17 14:35:41 INFO: Save file for el_gdt model: el_gdt_tokenizer.pt
2024-03-17 14:35:41 INFO: UD_Greek-GDT: saved_models/tokenize/el_gdt_tokenizer.pt does not exist, training new model
2024-03-17 14:35:41 INFO: Running train step with args: ['--label_file', '../data/processed/tokenize/el_gdt-ud-train.toklabels', '--txt_file', '../data/processed/tokenize/el_gdt.train.txt', '--lang', 'el', '--max_seqlen', '500', '--mwt_json_file', '../data/processed/tokenize/el_gdt-ud-dev-mwt.json', '--dev_txt_file', '../data/processed/tokenize/el_gdt.dev.txt', '--dev_label_file', '../data/processed/tokenize/el_gdt-ud-dev.toklabels', '--dev_conll_gold', '../data/processed/tokenize/el_gdt.dev.gold.conllu', '--conll_file', '/tmp/tmpvarwop8b', '--shorthand', 'el_gdt', '--step', '1000', '

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_pos_treebank UD_Greek-GDT

2024-03-17 14:36:21 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/datasets/prepare_pos_treebank.py UD_Greek-GDT
Preparing data for UD_Greek-GDT: el_gdt, el
Reading from ../data/udbase/UD_Greek-GDT/el_gdt-ud-train.conllu and writing to /tmp/tmpr4yf5zuw/el_gdt.train.gold.conllu
Augmented 15 quotes: Counter({'""': 4, '„“': 4, '″″': 2, '»«': 2, '「」': 2, '““': 1})
Swapped 'w1, w2' for 'w1 ,w2' 22 times
Added 0 new sentences with asdf, zzzz -> asdf,zzzz
Added 4 sentences with parens replaced with square brackets
Reading from ../data/udbase/UD_Greek-GDT/el_gdt-ud-dev.conllu and writing to /tmp/tmpr4yf5zuw/el_gdt.dev.gold.conllu
Reading from ../data/udbase/UD_Greek-GDT/el_gdt-ud-test.conllu and writing to /tmp/tmpr4yf5zuw/el_gdt.test.gold.conllu
Copying from /tmp/tmpr4yf5zuw/el_gdt.train.gold.conllu to ../data/processed/pos/el_gdt.train.in.conllu
Copying from /tmp/tmpr4yf5zuw/el_gdt.dev.gold.conllu to ../data/processed/pos/el_gdt.dev.in.conl

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_pos UD_Greek-GDT --max_steps 1000

2024-03-17 14:36:40 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_pos.py UD_Greek-GDT --max_steps 500
2024-03-17 14:36:40 DEBUG: UD_Greek-GDT: el_gdt
2024-03-17 14:36:40 INFO: Default pretrain should be /root/stanza_resources/el/pretrain/conll17.pt  Attempting to download
2024-03-17 14:36:40 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0% 0.00/46.6k [00:00<?, ?B/s]Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 373kB [00:00, 21.1MB/s]        
2024-03-17 14:36:40 INFO: Downloaded file to /root/stanza_resources/resources.json
2024-03-17 14:36:40 DEBUG: Processing parameter "processors"...
2024-03-17 14:36:40 DEBUG: Found pretrain: conll17.
2024-03-17 14:36:40 DEBUG: Found dependencies 

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Greek-GDT

2024-03-17 14:42:08 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py UD_Greek-GDT
2024-03-17 14:42:08 INFO: Using tagger model in saved_models/pos/el_gdt_nocharlm_tagger.pt for el_gdt
2024-03-17 14:42:08 INFO: Using default pretrain for language, found in /root/stanza_resources/el/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
Preparing data for UD_Greek-GDT: el_gdt, el
Reading from ../data/udbase/UD_Greek-GDT/el_gdt-ud-train.conllu and writing to /tmp/tmpja36wjwh/el_gdt.train.gold.conllu
Augmented 15 quotes: Counter({'""': 4, '„“': 4, '″″': 2, '»«': 2, '「」': 2, '““': 1})
Swapped 'w1, w2' for 'w1 ,w2' 22 times
Added 0 new sentences with asdf, zzzz -> asdf,zzzz
Added 4 sentences with parens replaced with square brackets
Reading from ../data/udbase/UD_Greek-GDT/el_gdt-ud-dev.conllu and writing to /tmp/tmpja36wjwh/el_gdt.dev.gold.conllu
Reading from ../data/udbase/UD_

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_depparse UD_Greek-GDT --max_steps 1000

2024-03-17 14:42:54 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_depparse.py UD_Greek-GDT --max_steps 1000
2024-03-17 14:42:54 DEBUG: UD_Greek-GDT: el_gdt
2024-03-17 14:42:54 INFO: Using default pretrain for language, found in /root/stanza_resources/el/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-17 14:42:54 INFO: UD_Greek-GDT: saved_models/depparse/el_gdt_nocharlm_parser.pt does not exist, training new model
2024-03-17 14:42:54 INFO: Using default pretrain for language, found in /root/stanza_resources/el/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-17 14:42:54 INFO: Running train depparse for UD_Greek-GDT with args ['--wordvec_dir', '../data/wordvec', '--train_file', '../data/processed/depparse/el_gdt.train.in.conllu', '--eval_file', '../data/processed/depparse/el_gdt.dev.in.conllu', '--output_file', '/tmp/tmpqvvkdoey', '--gold_fi

##### RU

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_Russian-GSD

2024-03-17 14:51:05 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_Russian-GSD
Preparing data for UD_Russian-GSD: ru_gsd, ru
Reading from ../data/udbase/UD_Russian-GSD/ru_gsd-ud-train.conllu and writing to ../data/processed/tokenize/ru_gsd.train.gold.conllu
Augmented 0 quotes: Counter()
Swapped 'w1, w2' for 'w1 ,w2' 39 times
Added 0 new sentences with asdf, zzzz -> asdf,zzzz
Changed 4 sentences to use fancy unicode ellipses
Added 72 sentences with parens replaced with square brackets
Reading from ../data/udbase/UD_Russian-GSD/ru_gsd-ud-dev.conllu and writing to ../data/processed/tokenize/ru_gsd.dev.gold.conllu
Reading from ../data/udbase/UD_Russian-GSD/ru_gsd-ud-test.conllu and writing to ../data/processed/tokenize/ru_gsd.test.gold.conllu
Tokenizer labels written to ../data/processed/tokenize/ru_gsd-ud-train.toklabels
  0 unique MWTs found in data.  MWTs written to ../data/processed/tokenize/r

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_tokenizer UD_Russian-GSD --step 1000

2024-03-17 14:53:00 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_tokenizer.py UD_Russian-GSD --step 1000
2024-03-17 14:53:00 DEBUG: UD_Russian-GSD: ru_gsd
2024-03-17 14:53:00 INFO: Save file for ru_gsd model: ru_gsd_tokenizer.pt
2024-03-17 14:53:00 INFO: UD_Russian-GSD: saved_models/tokenize/ru_gsd_tokenizer.pt does not exist, training new model
2024-03-17 14:53:00 INFO: Running train step with args: ['--label_file', '../data/processed/tokenize/ru_gsd-ud-train.toklabels', '--txt_file', '../data/processed/tokenize/ru_gsd.train.txt', '--lang', 'ru', '--max_seqlen', '400', '--mwt_json_file', '../data/processed/tokenize/ru_gsd-ud-dev-mwt.json', '--dev_txt_file', '../data/processed/tokenize/ru_gsd.dev.txt', '--dev_label_file', '../data/processed/tokenize/ru_gsd-ud-dev.toklabels', '--dev_conll_gold', '../data/processed/tokenize/ru_gsd.dev.gold.conllu', '--conll_file', '/tmp/tmp0hcwpe_8', '--shorthand', 'ru_gsd', '--step', '10

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_pos_treebank UD_Russian-GSD

2024-03-17 14:53:38 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/datasets/prepare_pos_treebank.py UD_Russian-GSD
Preparing data for UD_Russian-GSD: ru_gsd, ru
Reading from ../data/udbase/UD_Russian-GSD/ru_gsd-ud-train.conllu and writing to /tmp/tmpjhw2xtha/ru_gsd.train.gold.conllu
Augmented 0 quotes: Counter()
Swapped 'w1, w2' for 'w1 ,w2' 39 times
Added 0 new sentences with asdf, zzzz -> asdf,zzzz
Changed 4 sentences to use fancy unicode ellipses
Added 72 sentences with parens replaced with square brackets
Reading from ../data/udbase/UD_Russian-GSD/ru_gsd-ud-dev.conllu and writing to /tmp/tmpjhw2xtha/ru_gsd.dev.gold.conllu
Reading from ../data/udbase/UD_Russian-GSD/ru_gsd-ud-test.conllu and writing to /tmp/tmpjhw2xtha/ru_gsd.test.gold.conllu
Copying from /tmp/tmpjhw2xtha/ru_gsd.train.gold.conllu to ../data/processed/pos/ru_gsd.train.in.conllu
Copying from /tmp/tmpjhw2xtha/ru_gsd.dev.gold.conllu to ../data/processed/pos/ru_gsd.dev.i

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_pos UD_Russian-GSD --max_steps 500

2024-03-17 14:53:45 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_pos.py UD_Russian-GSD --max_steps 500
2024-03-17 14:53:45 DEBUG: UD_Russian-GSD: ru_gsd
2024-03-17 14:53:45 DEBUG: Downloading resource file from https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0% 0.00/46.6k [00:00<?, ?B/s]Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json: 373kB [00:00, 19.7MB/s]        
2024-03-17 14:53:45 INFO: Downloaded file to /root/stanza_resources/resources.json
2024-03-17 14:53:45 DEBUG: Processing parameter "processors"...
2024-03-17 14:53:45 DEBUG: Found forward_charlm: newswiki.
2024-03-17 14:53:45 DEBUG: Found dependencies [] for processor forward_charlm model newswiki
2024-03-17 14:53:45 INFO: Downloading these customized packages f

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.datasets.prepare_depparse_treebank UD_Russian-GSD

2024-03-17 15:01:53 INFO: Datasets program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/datasets/prepare_depparse_treebank.py UD_Russian-GSD
2024-03-17 15:01:53 INFO: Using tagger model in saved_models/pos/ru_gsd_charlm_tagger.pt for ru_gsd
2024-03-17 15:01:53 INFO: Using default pretrain for language, found in /root/stanza_resources/ru/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-17 15:01:53 INFO: Using model /root/stanza_resources/ru/forward_charlm/newswiki.pt for forward charlm
2024-03-17 15:01:53 INFO: Using model /root/stanza_resources/ru/backward_charlm/newswiki.pt for backward charlm
Preparing data for UD_Russian-GSD: ru_gsd, ru
Reading from ../data/udbase/UD_Russian-GSD/ru_gsd-ud-train.conllu and writing to /tmp/tmpzummupqa/ru_gsd.train.gold.conllu
Augmented 0 quotes: Counter()
Swapped 'w1, w2' for 'w1 ,w2' 39 times
Added 0 new sentences with asdf, zzzz -> asdf,zzzz
Changed 4 sentences to use fancy unic

In [None]:
!cd /content/gdrive/MyDrive/PLN/3/stanza-train/stanza && source scripts/config.sh && python3 -m stanza.utils.training.run_depparse UD_Russian-GSD --max_steps 1000

2024-03-17 15:02:48 INFO: Training program called with:
/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/stanza/utils/training/run_depparse.py UD_Russian-GSD --max_steps 1000
2024-03-17 15:02:48 DEBUG: UD_Russian-GSD: ru_gsd
2024-03-17 15:02:48 INFO: Using model /root/stanza_resources/ru/forward_charlm/newswiki.pt for forward charlm
2024-03-17 15:02:48 INFO: Using model /root/stanza_resources/ru/backward_charlm/newswiki.pt for backward charlm
2024-03-17 15:02:48 INFO: Using default pretrain for language, found in /root/stanza_resources/ru/pretrain/conll17.pt  To use a different pretrain, specify --wordvec_pretrain_file
2024-03-17 15:02:48 INFO: UD_Russian-GSD: saved_models/depparse/ru_gsd_charlm_parser.pt does not exist, training new model
2024-03-17 15:02:48 INFO: Using model /root/stanza_resources/ru/forward_charlm/newswiki.pt for forward charlm
2024-03-17 15:02:48 INFO: Using model /root/stanza_resources/ru/backward_charlm/newswiki.pt for backward charlm
2024-03-17 15:02:48 INFO: U

### Test

In [None]:
# Función para imprimir por pantalla la evaluación, como el conll18_ud_eval tiene esta implementación en main,
# se hace un wrapper para emplearlo como función fuera del script
def print_eval(verbose:bool, counts:bool, evaluation):
  if not verbose and not counts:
    print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1))
    print("MLAS Score: {:.2f}".format(100 * evaluation["MLAS"].f1))
    print("BLEX Score: {:.2f}".format(100 * evaluation["BLEX"].f1))
  else:
    if counts:
      print("Metric     | Correct   |      Gold | Predicted | Aligned")
    else:
      print("Metric     | Precision |    Recall |  F1 Score | AligndAcc")
    print("-----------+-----------+-----------+-----------+-----------")
    for metric in["Tokens", "Sentences", "Words", "UPOS", "XPOS", "UFeats", "AllTags", "Lemmas", "UAS", "LAS", "CLAS", "MLAS", "BLEX"]:
      if counts:
        print("{:11}|{:10} |{:10} |{:10} |{:10}".format(
        metric,
        evaluation[metric].correct,
        evaluation[metric].gold_total,
        evaluation[metric].system_total,
        evaluation[metric].aligned_total or (evaluation[metric].correct if metric == "Words" else "")
        ))
      else:
        print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format(
        metric,
        100 * evaluation[metric].precision,
        100 * evaluation[metric].recall,
        100 * evaluation[metric].f1,
        "{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else ""
        ))

In [None]:
import sys
sys.path.append('/content/gdrive/MyDrive/PLN/3')
import conll18_ud_eval as conll

#### SpaCy

In [None]:
!pip install spacy_conll

Collecting spacy_conll
  Downloading spacy_conll-3.4.0-py3-none-any.whl (21 kB)
Installing collected packages: spacy_conll
Successfully installed spacy_conll-3.4.0


In [None]:
import spacy_conll

##### EL

In [None]:
nlp = spacy.load("/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/Trained/model-best")
config = {"include_headers": True}
nlp.add_pipe("conll_formatter", config=config, last=True)

ConllFormatter(conversion_maps=None, ext_names={'conll_str': 'conll_str', 'conll': 'conll', 'conll_pd': 'conll_pd'}, field_names={'ID': 'ID', 'FORM': 'FORM', 'LEMMA': 'LEMMA', 'UPOS': 'UPOS', 'XPOS': 'XPOS', 'FEATS': 'FEATS', 'HEAD': 'HEAD', 'DEPREL': 'DEPREL', 'DEPS': 'DEPS', 'MISC': 'MISC'}, include_headers=True, disable_pandas=False)

In [None]:
input_file = "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_gdt_test.txt"
output_file = "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/output.conllu"
gold_standard = "/content/gdrive/MyDrive/PLN/3/b/Spacy/EL/el_gdt_gold_test.conllu"

In [None]:
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto procesado en el archivo de salida.
with open(output_file, 'w') as of:
  of.write(doc._.conll_str)
  of.write('\n')

In [None]:
gold = conll.load_conllu_file(gold_standard)
test = conll.load_conllu_file(output_file)
metrics = conll.evaluate(gold, test)

In [None]:
print_eval(True, True, metrics)
print('\n')
print_eval(True, False, metrics)

Metric     | Correct   |      Gold | Predicted | Aligned
-----------+-----------+-----------+-----------+-----------
Tokens     |     10407 |     10422 |     10431 |          
Sentences  |       401 |       456 |       439 |          
Words      |     10157 |     10672 |     10431 |     10157
UPOS       |      9734 |     10672 |     10431 |     10157
XPOS       |      9718 |     10672 |     10431 |     10157
UFeats     |      9088 |     10672 |     10431 |     10157
AllTags    |      8915 |     10672 |     10431 |     10157
Lemmas     |      9076 |     10672 |     10431 |     10157
UAS        |      8659 |     10672 |     10431 |     10157
LAS        |      7894 |     10672 |     10431 |     10157
CLAS       |      3847 |      5633 |      5164 |      5624
MLAS       |      3100 |      5633 |      5164 |      5624
BLEX       |      3272 |      5633 |      5164 |      5624


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-------

##### ZH

In [None]:
nlp = spacy.load("/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/Trained/model-best")
config = {"include_headers": True}
nlp.add_pipe("conll_formatter", config=config, last=True)

ConllFormatter(conversion_maps=None, ext_names={'conll_str': 'conll_str', 'conll': 'conll', 'conll_pd': 'conll_pd'}, field_names={'ID': 'ID', 'FORM': 'FORM', 'LEMMA': 'LEMMA', 'UPOS': 'UPOS', 'XPOS': 'XPOS', 'FEATS': 'FEATS', 'HEAD': 'HEAD', 'DEPREL': 'DEPREL', 'DEPS': 'DEPS', 'MISC': 'MISC'}, include_headers=True, disable_pandas=False)

In [None]:
input_file = "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_gsd_test.txt"
output_file = "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/output.conllu"
gold_standard = "/content/gdrive/MyDrive/PLN/3/b/Spacy/ZH/zh_gsd_gold_test.conllu"

In [None]:
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto procesado en el archivo de salida.
with open(output_file, 'w') as of:
  of.write(doc._.conll_str)
  of.write('\n')

In [None]:
gold = conll.load_conllu_file(gold_standard)
test = conll.load_conllu_file(output_file)
metrics = conll.evaluate(gold, test)

In [None]:
print_eval(True, True, metrics)
print('\n')
print_eval(True, False, metrics)

Metric     | Correct   |      Gold | Predicted | Aligned
-----------+-----------+-----------+-----------+-----------
Tokens     |      6157 |     12012 |     19206 |          
Sentences  |       240 |       500 |      1084 |          
Words      |      6157 |     12012 |     19206 |      6157
UPOS       |      5488 |     12012 |     19206 |      6157
XPOS       |      5580 |     12012 |     19206 |      6157
UFeats     |      6014 |     12012 |     19206 |      6157
AllTags    |      5394 |     12012 |     19206 |      6157
Lemmas     |      6157 |     12012 |     19206 |      6157
UAS        |      1455 |     12012 |     19206 |      6157
LAS        |      1296 |     12012 |     19206 |      6157
CLAS       |       445 |      7782 |     14124 |      2205
MLAS       |       364 |      7782 |     14124 |      2205
BLEX       |       445 |      7782 |     14124 |      2205


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-------

#### Stanza

In [None]:
from stanza.utils.conll import CoNLL

##### EL

In [None]:
input_file = "/content/gdrive/MyDrive/PLN/3/stanza-train/data/udbase/UD_Greek-GDT/el_gdt-ud-test.txt"
output_file = "/content/gdrive/MyDrive/PLN/3/b/Stanza/EL_output.conllu"
gold_standard = "/content/gdrive/MyDrive/PLN/3/stanza-train/data/processed/depparse/el_gdt.test.gold.conllu"

In [None]:
nlp = stanza.Pipeline('el', processors='tokenize,pos,lemma,depparse', use_gpu=True,
                      tokenize_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/tokenize/el_gdt_tokenizer.pt',
                      pos_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/pos/el_gdt_nocharlm_tagger.pt',
                      depparse_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/depparse/el_gdt_nocharlm_parser.pt')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: el (Greek):
| Processor | Package                 |
---------------------------------------
| tokenize  | /content/g...kenizer.pt |
| mwt       | gdt                     |
| pos       | /content/g..._tagger.pt |
| lemma     | gdt_nocharlm            |
| depparse  | /content/g..._parser.pt |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [None]:
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto procesado en el archivo de salida.
CoNLL.write_doc2conll(doc, output_file)

In [None]:
gold = conll.load_conllu_file(gold_standard)
test = conll.load_conllu_file(output_file)
metrics = conll.evaluate(gold, test)

In [None]:
print_eval(True, True, metrics)
print('\n')
print_eval(True, False, metrics)

Metric     | Correct   |      Gold | Predicted | Aligned
-----------+-----------+-----------+-----------+-----------
Tokens     |     10395 |     10422 |     10444 |          
Sentences  |       399 |       456 |       442 |          
Words      |     10643 |     10672 |     10693 |     10643
UPOS       |     10641 |     10672 |     10693 |     10643
XPOS       |     10641 |     10672 |     10693 |     10643
UFeats     |     10633 |     10672 |     10693 |     10643
AllTags    |     10633 |     10672 |     10693 |     10643
Lemmas     |     10240 |     10672 |     10693 |     10643
UAS        |      9537 |     10672 |     10693 |     10643
LAS        |      9237 |     10672 |     10693 |     10643
CLAS       |      4560 |      5633 |      5630 |      5609
MLAS       |      4486 |      5633 |      5630 |      5609
BLEX       |      4284 |      5633 |      5630 |      5609


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-------

##### RU

In [None]:
input_file = "/content/gdrive/MyDrive/PLN/3/stanza-train/data/udbase/UD_Russian-GSD/ru_gsd-ud-test.txt"
output_file = "/content/gdrive/MyDrive/PLN/3/b/Stanza/RU_output.conllu"
gold_standard = "/content/gdrive/MyDrive/PLN/3/stanza-train/data/processed/depparse/ru_gsd.test.gold.conllu"

In [None]:
nlp = stanza.Pipeline('ru', processors='tokenize,pos,lemma,depparse', use_gpu=True,
                      tokenize_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/tokenize/ru_gsd_tokenizer.pt',
                      pos_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/pos/ru_gsd_charlm_tagger.pt',
                      depparse_model_path = '/content/gdrive/MyDrive/PLN/3/stanza-train/stanza/saved_models/depparse/ru_gsd_charlm_parser.pt')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json


Downloading https://huggingface.co/stanfordnlp/stanza-ru/resolve/v1.8.0/models/tokenize/syntagrus.pt:   0%|   …

Downloading https://huggingface.co/stanfordnlp/stanza-ru/resolve/v1.8.0/models/pos/syntagrus_charlm.pt:   0%| …

Downloading https://huggingface.co/stanfordnlp/stanza-ru/resolve/v1.8.0/models/lemma/syntagrus_nocharlm.pt:   …

Downloading https://huggingface.co/stanfordnlp/stanza-ru/resolve/v1.8.0/models/depparse/syntagrus_charlm.pt:  …

INFO:stanza:Loading these models for language: ru (Russian):
| Processor | Package                 |
---------------------------------------
| tokenize  | /content/g...kenizer.pt |
| pos       | /content/g..._tagger.pt |
| lemma     | syntagrus_nocharlm      |
| depparse  | /content/g..._parser.pt |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!


In [None]:
with open(input_file, 'r') as file:
  t_input = re.sub("\s\s+", " ", " ".join(file.read().splitlines()))
doc = nlp(t_input)

In [None]:
# Se guarda el texto procesado en el archivo de salida.
CoNLL.write_doc2conll(doc, output_file)

In [None]:
gold = conll.load_conllu_file(gold_standard)
test = conll.load_conllu_file(output_file)
metrics = conll.evaluate(gold, test)

In [None]:
print_eval(True, True, metrics)
print('\n')
print_eval(True, False, metrics)

Metric     | Correct   |      Gold | Predicted | Aligned
-----------+-----------+-----------+-----------+-----------
Tokens     |     11329 |     11385 |     11391 |          
Sentences  |       574 |       601 |       598 |          
Words      |     11329 |     11385 |     11391 |     11329
UPOS       |     11329 |     11385 |     11391 |     11329
XPOS       |     11325 |     11385 |     11391 |     11329
UFeats     |     11317 |     11385 |     11391 |     11329
AllTags    |     11314 |     11385 |     11391 |     11329
Lemmas     |     10666 |     11385 |     11391 |     11329
UAS        |     10007 |     11385 |     11391 |     11329
LAS        |      9559 |     11385 |     11391 |     11329
CLAS       |      5993 |      7343 |      7349 |      7304
MLAS       |      5917 |      7343 |      7349 |      7304
BLEX       |      5539 |      7343 |      7349 |      7304


Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-------