#### Introduction


In this tutorial, we will refine a transformation model for *Named Entity Recognition* based on the BETO model (a Spanish adaptation of the BERT model). 
This tutorial aims at producing a tranformers model via the spacy and sklearn library by proceeding to the cross validation technique.

**Warning** : This processing chain works only via the notebook interface (when using the GPU colab). It is possible to use the native *subsystem* library to execute bash commands.

##### Flow of the notebook

The notebook will be divided into seperate sections to provide a organized walk through for the process used.

1. [Checking system and requirements](#section00)
2. [Configuration system](#section01)
3. [Finetuning](#section02) \
3-1. [Preprocessing](#section021) \
3-2. [Training](#section022)
4. [Global evaluation](#section03)

<a name="section00"></a>
####Checking

In [10]:
#Check gpu activity
!nvidia-smi

Wed Jul 20 19:09:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0    31W /  70W |   2368MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
#open google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##### install dependencies

In [None]:
!pip install -U pandas
!pip install -U scikit-learn
!pip install -U spacy[transformers]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


<a name="section01"></a>
####Configuration

In [None]:
#for dev and train data
import os
#specific train
if not os.path.isdir('/content/drive/MyDrive/nlp'):
  os.mkdir('/content/drive/MyDrive/nlp')
#data
if not os.path.isdir('/content/drive/MyDrive/nlp/spacyNER_data'):
  os.mkdir('/content/drive/MyDrive/nlp/spacyNER_data')
#model 
if not os.path.isdir('/content/drive/MyDrive/nlp/spacyNER_model'):
  os.mkdir('/content/drive/MyDrive/nlp/spacyNER_model')
#cross_validation
if not os.path.isdir('/content/drive/MyDrive/nlp/spacyNER_model/cross_valid'):
  os.mkdir('/content/drive/MyDrive/nlp/spacyNER_model/cross_valid')

In [None]:
import spacy
spacy.require_gpu()

True

<a id='section02'></a>
### Finetuning

<a id='section021'></a>
####Preprocessing

In [None]:
def split_dataset(dataframe, train_ratio: float):
  from sklearn.model_selection import train_test_split

  assert(train_ratio < 1), "the number must be value between 0 and 1"

  train_df, test_df = train_test_split(dataframe, test_size= 1 - train_ratio)
  
  return train_df, test_df

In [None]:
def convert_binary(name: str, data):
  """
  function to convert dataset (Dataframe) in binary spacy format
  :name: str, name of binary file outpout
  :data: dataframe
  :return: count of tokens and entities in file
  """
  from spacy.tokens import DocBin
  
  #Generate tokenization
  nlp = spacy.blank("es")
  # the DocBin will store the example documents
  db = DocBin(attrs=["ENT_IOB", "ENT_TYPE"])

  #counting variable
  n_token = 0
  n_entities = 0

  for index, row in data.iterrows():
    doc = nlp(row["text"])
    n_token += len(doc)
    n_entities += len(row["label"])
    ents = []
    for ent in row["label"]:
      start, end, label = tuple(ent)
      span = doc.char_span(start, end, label=label)
      if span is not None:
        ents.append(span)
      else:
        n_entities -= 1
    try:
      doc.ents = ents
    except TypeError:
      print(ents)
      pass
    db.add(doc)
    db.to_disk(f"/content/drive/MyDrive/nlp/spacyNER_data/{name}.spacy")
  return n_token, n_entities

In [None]:
def export_json(data: dict):
  import json

  with open("/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/metadata.json", "w", encoding="utf-8") as json_files:
    json.dump(data, json_files, ensure_ascii=False, indent=3)

<a name="section022"></a>
#### Training

In [None]:
from sklearn.model_selection import KFold
import pandas as pd

#import dataset
df = pd.read_json("/content/drive/MyDrive/nlp/spacyNER_data/dataset_araucania.jsonl", lines = True, encoding="utf-8")

#structure dataset

dataset = split_dataset(df, 0.90)

X = dataset[0]
test_dataset = convert_binary("test", dataset[1])
print("test: " + str(test_dataset[0]) + " tokens, " + str(test_dataset[1]) + " entities")

#cross validation

kf = KFold(n_splits=6, shuffle = True, random_state = 2)

#variables
n = 0
results = {}

for train_index , test_index in kf.split(X):
  
  #enumeration
  n += 1
  print("<----- run k " + str(n) + "----->")

  #path
  os.mkdir(f"/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_{str(n)}")
  path_ouput = f"/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_{str(n)}"
  best_model = f"/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_{str(n)}/model-best"
  json_output = f"/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_{str(n)}/scores.json"

  #convert in binary format
  X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
  train_binary = convert_binary("train",X_train)
  dev_binary = convert_binary("dev",X_test)

  #metadata
  results[n] ={
                    "test_data": [test_dataset[0],test_dataset[1]],
                    "train_data" : [train_binary[0],train_binary[1]],
                    "dev_data" : [dev_binary[0],dev_binary[1]]
                }

  #init config
  !python -m spacy init fill-config /content/drive/MyDrive/nlp/base_config.cfg config.cfg

  #training
  !python -m spacy train config.cfg --paths.train /content/drive/MyDrive/nlp/spacyNER_data/train.spacy --paths.dev /content/drive/MyDrive/nlp/spacyNER_data/dev.spacy -g 0 --output $path_ouput
  
  #evaluation
  !python -m spacy evaluate $best_model /content/drive/MyDrive/nlp/spacyNER_data/test.spacy --output $json_output --gold-preproc --gpu-id 0

#export json metadata
export_json(results)


Aborted!
^C
Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 183, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.7/runpy.py", line 142, in _get_module_details
    return _get_module_details(pkg_main_name, error)
  File "/usr/lib/python3.7/runpy.py", line 109, in _get_module_details
    __import__(pkg_name)
  File "/usr/local/lib/python3.7/dist-packages/spacy/__init__.py", line 11, in <module>
    from thinc.api import prefer_gpu, require_gpu, require_cpu  # noqa: F401
  File "/usr/local/lib/python3.7/dist-packages/thinc/__init__.py", line 2, in <module>
    import numpy
  File "/usr/local/lib/python3.7/dist-packages/numpy/__init__.py", line 160, in <module>
    from . import polynomial
  File "/usr/local/lib/python3.7/dist-packages/numpy/polynomial/__init__.py", line 121, in <module>
    from .laguerre import Laguerre
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  

KeyboardInterrupt: ignored

<a name="section03"></a>
### Global Evaluation

#####Choose model

In [11]:
path_model = '/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_1/model-best' #@param {type:"string"}

####Metrics

#### Visualisation Model

In [12]:
#init GPU spacy
import spacy
gpu = spacy.prefer_gpu()
print('GPU:', gpu)

GPU: True


##### Choose text to analyze

In [13]:
#if img 189 - 203
text_clean = """
Comandancia de Armas
Arauco Diciembre 14 de 1859
En este momento recibí la nota de Usted fecha de ayer.
Inmediatamente hice propio a Los Ángeles, participando al Señor Intendente, el éxito que ha tenido.
Tendré mucho cuidado en sujetar todos los animales que arreen ilegalmente.
Todavía no he recibido los 80 animales que le mandé, para hacer la correspondiente devolución a sus dueños.
Aquí no hay novedad. El conductor don Salvador [Hermosilla] lleva 25 lanzas, con éstas son 124.
Dios guarde a Usted,
José del Carmen Díaz
Al Señor Comandante en Jefe de la División Pacificadora de Arauco

--------------------

Gobierno Interino de
Arauco Enero 2 de 1860
Por el Gobernador del Departamento de Lautaro se me comunica lo que sigue:
Santa Juana, Enero 2 de 1860
Por la Intendencia de mi provincia en nota oficial fecha 29 del mes próximo pasado Nº 472, se me ordena poner a disposición de Usted al reo Juan [Hermosilla], titulado Sargento Mayor de la montonera de Patricio Silva; para que allí sea juzgado, y en su consecuencia se lo remito bajo segura custodia, y Usted se servirá acusar recibo. Dios guarde a Usted, Pascual Ruiz
Yo lo transcribo a Usted para su conocimiento.
Mientras Usted se sirva determinar de dicho reo, he dispuesto mandarlo a bordo del Vapor “Maipú”, para la mayor seguridad.
Dios guarde a Usted,
José del Carmen Díaz
"""

text_brut = """"""

text_htr = """"""

In [14]:
text = text_clean #@param ["text_clean", "text_brut", "text_htr"] {type:"raw"}

##### Process

In [15]:
nlp = spacy.load(path_model)
doc_clean = nlp(text)

In [16]:
from spacy import displacy
from pathlib import Path

#option visualizers ent
colors = {
    "MISC": "#808D8E",
    "LOC": "#766C7F",
    "PERS": "#947EB0",
    "DATE": "#A3A5C3",
    "ORG": "#A9D2D5"
    }
options= {"ents": ["MISC", "LOC", "PERS", "DATE", "ORG"], "colors": colors}

#render
html = displacy.render(doc_clean, style="ent", jupyter=True, options=options, page=True)