#### Introduction


In this tutorial, we will refine a transformation model for *Named Entity Recognition* based on the BETO model (a Spanish adaptation of the BERT model). 
This tutorial aims at producing a tranformers model via the spacy and sklearn library by proceeding to the cross validation technique.

**Warning** : This processing chain works only via the notebook interface (when using the GPU colab). It is possible to use the native *subsystem* library to execute bash commands.

##### Flow of the notebook

The notebook will be divided into seperate sections to provide a organized walk through for the process used.

1. [Checking system and requirements](#section00)
2. [Configuration system](#section01)
3. [Finetuning](#section02) \
3-1. [Preprocessing](#section021) \
3-2. [Training](#section022)
4. [Global evaluation](#section03)

<a name="section00"></a>
####Checking

In [None]:
#Check gpu activity
!nvidia-smi

Mon Jul 25 16:57:17 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P8    10W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
#open google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##### install dependencies

In [None]:
!pip install -U pandas
!pip install -U scikit-learn
!pip install -U spacy[transformers]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-transformers<1.2.0,>=1.1.2
  Downloading spacy_transformers-1.1.7-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 1.7 MB/s 
Collecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 9.3 MB/s 
[?25hCollecting transformers<4.21.0,>=3.4.0
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 54.0 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |███████

<a name="section01"></a>
####Configuration

In [None]:
#for dev and train data
import os
#specific train
if not os.path.isdir('/content/drive/MyDrive/nlp'):
  os.mkdir('/content/drive/MyDrive/nlp')
#data
if not os.path.isdir('/content/drive/MyDrive/nlp/spacyNER_data'):
  os.mkdir('/content/drive/MyDrive/nlp/spacyNER_data')
#model 
if not os.path.isdir('/content/drive/MyDrive/nlp/spacyNER_model'):
  os.mkdir('/content/drive/MyDrive/nlp/spacyNER_model')
#cross_validation
if not os.path.isdir('/content/drive/MyDrive/nlp/spacyNER_model/cross_valid'):
  os.mkdir('/content/drive/MyDrive/nlp/spacyNER_model/cross_valid')

In [None]:
import spacy
spacy.require_gpu()

True

<a id='section02'></a>
### Finetuning

<a id='section021'></a>
####Preprocessing

In [None]:
def split_dataset(dataframe, train_ratio: float):
  """
  function to split dataframe with ration as you want
  :dataframe: dataframe
  :train_ratio: float, ratio of split training
  :return: None
  """
  from sklearn.model_selection import train_test_split

  assert(train_ratio < 1), "the number must be value between 0 and 1"

  train_df, test_df = train_test_split(dataframe, test_size= 1 - train_ratio)
  
  return train_df, test_df

In [None]:
def convert_binary(name: str, data):
  """
  function to convert dataset (Dataframe) in binary spacy format
  :name: str, name of binary file outpout
  :data: dataframe
  :return: count of tokens and entities in file
  """
  from spacy.tokens import DocBin
  
  #Generate tokenization
  nlp = spacy.blank("es")
  # the DocBin will store the example documents
  db = DocBin(attrs=["ENT_IOB", "ENT_TYPE"])

  #counting variable
  n_token = 0
  n_entities = 0

  for index, row in data.iterrows():
    doc = nlp(row["text"])
    n_token += len(doc)
    n_entities += len(row["label"])
    ents = []
    for ent in row["label"]:
      start, end, label = tuple(ent)
      span = doc.char_span(start, end, label=label)
      if span is not None:
        ents.append(span)
      else:
        n_entities -= 1
    try:
      doc.ents = ents
    except TypeError:
      print(ents)
      pass
    db.add(doc)
    db.to_disk(f"/content/drive/MyDrive/nlp/spacyNER_data/{name}.spacy")
  return n_token, n_entities

In [None]:
def export_json(data: dict):
  """
  Export json file with metadata's models
  
  :data: dictionary of metadata
  """
  import json

  with open("/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/metadata.json", "w", encoding="utf-8") as json_files:
    json.dump(data, json_files, ensure_ascii=False, indent=3)

<a name="section022"></a>
#### Training

In [None]:
# Configuration file
config = '/content/drive/MyDrive/nlp/base_config_NoTransformers.cfg' #@param {type:"string"}

In [None]:
from sklearn.model_selection import KFold
import pandas as pd

#import dataset
df = pd.read_json("/content/drive/MyDrive/nlp/spacyNER_data/dataset_araucania.jsonl", lines = True, encoding="utf-8")

#structure dataset

dataset = split_dataset(df, 0.90)

X = dataset[0]
test_dataset = convert_binary("test", dataset[1])
print("test: " + str(test_dataset[0]) + " tokens, " + str(test_dataset[1]) + " entities")

#K-Fold configuration
kf = KFold(n_splits=6, shuffle = True, random_state = 2)

#variables
n = 0
results = {}

#loop on k
for train_index , test_index in kf.split(X):
  
  #enumeration
  n += 1
  print("<----- run k " + str(n) + "----->")

  #path
  os.mkdir(f"/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_{str(n)}")
  path_ouput = f"/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_{str(n)}"
  best_model = f"/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_{str(n)}/model-best"
  json_output = f"/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_{str(n)}/scores.json"

  #convert in binary format
  X_train , X_test = X.iloc[train_index,:],X.iloc[test_index,:]
  train_binary = convert_binary("train",X_train)
  dev_binary = convert_binary("dev",X_test)

  #metadata
  results[n] ={
                    "test_data": [test_dataset[0],test_dataset[1]],
                    "train_data" : [train_binary[0],train_binary[1]],
                    "dev_data" : [dev_binary[0],dev_binary[1]]
                }

  #init config
  !python -m spacy init fill-config $config config.cfg

  #training
  !python -m spacy train config.cfg --paths.train /content/drive/MyDrive/nlp/spacyNER_data/train.spacy --paths.dev /content/drive/MyDrive/nlp/spacyNER_data/dev.spacy -g 0 --output $path_ouput
  
  #evaluation
  !python -m spacy evaluate $best_model /content/drive/MyDrive/nlp/spacyNER_data/test.spacy --output $json_output --gold-preproc --gpu-id 0

#export json metadata
export_json(results)

test: 3104 tokens, 276 entities
<----- run k 1----->
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
[38;5;4mℹ Saving to output directory:
/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_1[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2022-07-25 15:30:09,741] [INFO] Set up nlp object from config
[2022-07-25 15:30:09,751] [INFO] Pipeline: ['tok2vec', 'ner']
[2022-07-25 15:30:09,755] [INFO] Created vocabulary
[2022-07-25 15:30:09,756] [INFO] Finished initializing nlp object
[2022-07-25 15:30:11,929] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  --

<a name="section03"></a>
### Global Evaluation

#####Choose model

Put the path of your select model

In [None]:
path_model = '/content/drive/MyDrive/nlp/spacyNER_model/cross_valid/fold_6/model-best' #@param {type:"string"}

####Metrics

#### Visualisation Model

In [None]:
#init GPU spacy
import spacy
#gpu = spacy.prefer_gpu()
#print('GPU:', gpu)

##### Choose text to analyze

`text_clean` corresponds to a corrected and modernized transcription to the contemporary language. `text_brut` corresponds to the spellchecker/postprocess output of XML-ALTO. Finally, `text_htr` is an HTR transcription without filter or correction.
The purpose of these different transcriptions is to visually observe the efficiency of the NER model

In [None]:
text_clean = """
Comandancia de Armas
Arauco Diciembre 14 de 1859
En este momento recibí la nota de Usted fecha de ayer.
Inmediatamente hice propio a Los Ángeles, participando al Señor Intendente, el éxito que ha tenido.
Tendré mucho cuidado en sujetar todos los animales que arreen ilegalmente.
Todavía no he recibido los 80 animales que le mandé, para hacer la correspondiente devolución a sus dueños.
Aquí no hay novedad. El conductor don Salvador [Hermosilla] lleva 25 lanzas, con éstas son 124.
Dios guarde a Usted,
José del Carmen Díaz
Al Señor Comandante en Jefe de la División Pacificadora de Arauco

--------------------

Gobierno Interino de
Arauco Enero 2 de 1860
Por el Gobernador del Departamento de Lautaro se me comunica lo que sigue:
Santa Juana, Enero 2 de 1860
Por la Intendencia de mi provincia en nota oficial fecha 29 del mes próximo pasado Nº 472, se me ordena poner a disposición de Usted al reo Juan Hermosilla, titulado Sargento Mayor de la montonera de Patricio Silva; para que allí sea juzgado, y en su consecuencia se lo remito bajo segura custodia, y Usted se servirá acusar recibo. Dios guarde a Usted, Pascual Ruiz
Yo lo transcribo a Usted para su conocimiento.
Mientras Usted se sirva determinar de dicho reo, he dispuesto mandarlo a bordo del Vapor “Maipú”, para la mayor seguridad.
Dios guarde a Usted,
José del Carmen Díaz
"""

text_brut = """"""

text_htr = """
Comand^oo de armas
En este promente recibi la nota de Ue pha. de Aytt.
Inmediatamente hire prepio a los Anpeles, participando al Sõr Intend^te, el endito que ha tenido.
Tendré mucho cuidado en sufetar todos los animales que Vanien ilegalmente.
Jodabia no he recibo do los 80 animales que le mande, para hacer la corespondiente débolucion asus dueños
Aqui no hai novedad.
El conductor D^n Salvador Elmnosilla lleba 25 lanzar, con estas son 124.
Naueo Obre. 14 de 1859.
Dios gue. aUd.
Jdel C. Dize
Al Señor Coronel Comand^
U Jefe dela Divicion pacificadora de Arauco.

--------------------

E^te inteime de
or el Gobemador del dep^t de Sauteno se me comunica lo que sigue.
Santa Juana Conero 2 de 1860.
Por la Intend^a demi provincia en nota oficial tha 29 del mes ep^a No N72, seme ordena poner a disporicion de Ud al reo Juan llmorcilla, tetulado Sarjento mayor dela montonera de Patricio Silva; para que allí sea juegado; i en su concecuencio selo remito bajo segura custodia; i Ud.
se servirá acuzarme recibo. Dios que a Us Parcual Ruir.
Tolo lo trarcribo a Ul para su conocimiento.
Mientras Ud se sirva determinas de dicho reo, he dispueto mandarlo abordo del vapor Maipú, para la mayor seguridad.
Arauco Lonezo 2 de 1860
No 3
35
1
Al Sõr Comand^o en fefe dela Piarcion deoperaciones de Arauco.
Dios gu~e a Ud
Jel C. Diaz
"""

In [None]:
text = text_htr #@param ["text_clean", "text_brut", "text_htr"] {type:"raw"}

##### Process
If you want to use a non-transformers model, you need to install spacy classical version

In [None]:
#Processing tokenization
nlp = spacy.load(path_model)
doc_clean = nlp(text)



In [None]:
from spacy import displacy
from pathlib import Path

#option visualizers ent
colors = {
    "MISC": "#808D8E",
    "LOC": "#766C7F",
    "PERS": "#947EB0",
    "DATE": "#A3A5C3",
    "ORG": "#A9D2D5"
    }
options= {"ents": ["MISC", "LOC", "PERS", "DATE", "ORG"], "colors": colors}

#render
html = displacy.render(doc_clean, style="ent", jupyter=True, options=options, page=True)