**First step: Go to the "runtime" field in google collab and in "change runtime type" select the GPU.**

**Connenction to google drive**

Click the "play" button to run the cell. You will be connected to your personal google drive.

In [1]:
#connects to google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Set working directory's path**

Open the "file folder" on the right and navigate into the drive folder in order to find the EMODNET folder.Then by clicking the three dots next to EMODNET folder, copy the path. Delete the existing path in the os.chdir below before "models/.." and then paste yours to conclude in something like: os.chdir(r"/your path/EMODnet/models/bert")

In [2]:
#changes the path with the location of the folder in your drive
import os
os.chdir(r"/content/drive/MyDrive/Workspace/Melina_Loulakaki/EMODnet/models/bert")

**Required installations**

In [None]:
#downloads the CUDA 9.2 installer for Ubuntu 16.04 and saves it as a .deb file
!wget https://developer.nvidia.com/compute/cuda/9.2/Prod/local_installers/cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64 -O cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
#installs the downloaded CUDA 9.2 package
!dpkg -i cuda-repo-ubuntu1604-9-2-local_9.2.88-1_amd64.deb
#adds the public key to the system to authenticate the package
!apt-key add /var/cuda-repo-9-2-local/7fa2af80.pub
#updates the package list
!apt-get update
#installs CUDA 9.2
!apt-get install cuda-9.2

Pip install commands for installing different versions of PyTorch and its related libraries

In [None]:
#installs PyTorch version 1.7.1 with CUDA 9.2 support, torchvision version 0.8.2 with CUDA 9.2 support, and torchaudio version 0.7.2.
%pip install torch==1.7.1+cu92 torchvision==0.8.2+cu92 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
#installs the torchtext library, which provides data processing utilities and popular datasets for natural language processing tasks.
%pip install torchtext 
#installs the torchvision library, which provides datasets, transforms, and models for computer vision tasks.
%pip install torchvision 
#installs the torchaudio library, which provides audio processing utilities and popular datasets for speech and audio-related tasks.
%pip install torchaudio  

Installs spacy and downloads the english model en_core_web_trf


In [None]:
%pip install -U spacy==3.4.2
!python -m spacy download en_core_web_trf

Installs the latest version of spaCy with CUDA 9.2 and Transformers support.


In [None]:
%pip install -U spacy[cuda92,transformers]

This command installs the CuPy library with support for CUDA 9.2. CuPy is a library that provides NumPy-like arrays for GPU computation, which can greatly accelerate certain types of operations. By installing CuPy with support for CUDA 9.2, this command allows for efficient computation on GPUs that support that version of CUDA.

In [None]:
%pip install cupy-cuda92

The !export commands set environment variables in the current shell session.

CUDA_PATH is set to the path where CUDA 9.2 is installed (/usr/local/cuda-9.2), which is used by other programs to locate the CUDA libraries and binaries.

LD_LIBRARY_PATH is a list of directories that the operating system searches when looking for shared libraries. Here, it is set to include the CUDA 9.2 library directory (\$CUDA_PATH/lib64) so that programs linked against CUDA can find the necessary libraries. The existing value of LD_LIBRARY_PATH is preserved using the $LD_LIBRARY_PATH variable.

In [None]:
!export CUDA_PATH="/usr/local/cuda-9.2"
!export LD_LIBRARY_PATH=$CUDA_PATH/lib64:$LD_LIBRARY_PATH

The script collapses multi-word tokens into a single word by concatenating the first two words if the second word is only two characters long. The resulting data frames are stored in the variables dftrain, dfdev, and dfeval.

In [None]:
#script in order to fix format from ubiai (it was an error every time that a string has 2 words and the second was smaller or equal than 2 chars e.g "harbour :" or "harbour at")
import pandas as pd #load pandas library for read_csv
#reads the three datasets into dftrain, dfdev, dfeval variables as dataframes.
dftrain = pd.read_csv("data/training_IOBall.tsv", sep='\t+', header=None, skiprows=1, engine='python', encoding="utf8")
dfdev = pd.read_csv("data/development_IOBall.tsv", sep='\t+', header=None, skiprows=1, engine='python', encoding="utf8")
dfeval = pd.read_csv("data/evaluation_IOBall.tsv", sep='\t+', header=None, skiprows=1, engine='python', encoding="utf8")

def df_collapse(df):
  for i in range(len(df[0])):
      a=df[0][i].split(" ")
      if(len(a) >= 2):
        if(len(a[1])<=2):
          df[0][i]=a[0] + a[1]
      else:
        df[0][i]=df[0][i]
  return df

dftrain=df_collapse(dftrain)
dfdev=df_collapse(dfdev)
dfeval=df_collapse(dfeval)

#saves the files into .csv in the working directory
dftrain.to_csv("formatted_data/dftrain.tsv",index=False,sep="\t",header=None)
dfdev.to_csv("formatted_data/dfdev.tsv",index=False,sep="\t",header=None)
dfeval.to_csv("formatted_data/dfeval.tsv",index=False,sep="\t",header=None)


**Convert data into .spacy [doc bin file](https://spacy.io/api/docbin)**

This is a Python function that takes in two arguments: train_data, which is a list of training examples, and name, which is a string that will be used as the filename when the training data is saved in the desired format.

In [None]:
#in order to convert from IOB to JSON
!python -m spacy convert formatted_data/dftrain.tsv ./json_data -t json -s -n 1 -c iob #-n 10
!python -m spacy convert formatted_data/dfdev.tsv ./json_data -t json -s -n 1 -c iob
!python -m spacy convert formatted_data/dfeval.tsv ./json_data -t json -s -n 1 -c iob

#in order to convert from json to .spacy
!python -m spacy convert json_data/dftrain.json ./spacy_data -t spacy
!python -m spacy convert json_data/dfdev.json ./spacy_data -t spacy
!python -m spacy convert json_data/dfeval.json ./spacy_data -t spacy

2023-02-20 13:04:17.231921: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-20 13:04:18.486226: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-20 13:04:18.486356: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-20 13:04:20.467065: E tensorfl

**Initialization of the configuration file**

Training config files include all settings and hyperparameters for training the pipeline, such as the number of epochs (an "epoch" refers to a full iteration over the entire training dataset.. Instead of providing lots of arguments on the command line, 

In [None]:
# #load locale library. Locale may be a string, or an iterable of two strings (language code and encoding).
# import locale
# locale.getpreferredencoding = (lambda *args: 'UTF-8') #sets the encoding to UTF-8

#base.config configuration file initiallization and config.cfg fill for training
!python -m spacy init fill-config config_files/base_config.cfg config_files/config.cfg

In [None]:
#debugging the config file
!python -m spacy debug data config.cfg

**Model Training**

In order to train the pipeline uses the train command of spacy.

In [None]:
#this command is using the spaCy library to train a new language model using the configuration file config.cfg. The training data will be loaded from dftrain.spacy and the development/validation data from dfdev.spacy.
#you can set additional command-line options starting with -- that correspond to the config section and value to override. For example, --paths.train ./corpus/train.spacy sets the train value in the [paths] block.
!python -m spacy train config_files/config.cfg --output ./ --paths.train ./spacy_data/dftrain.spacy --paths.dev ./spacy_data/dfdev.spacy
#after training is complete, the trained model is saved in the working directory as model-best and model-last, use the model-best
#if you want to keep the trained model and not overwrite it if you choose to train another one, after training is complete go to the folder of the project in your working directory and rename it 

**Model's Evaluation**

This command is using the spaCy library to evaluate the performance of a trained language model on a validation dataset loaded from the train_eval_full.spacy file. The model being evaluated is the one that performed best during training and was saved as 

In [None]:
#evaluation of the model
!python -m spacy evaluate model-best spacy_data/dfeval.spacy --gpu-id 0

**Dictionaries load**

Created dictionaries for distribution discriptors, life stages, body size and sampling devices are placed in the models' folder as .csv files.

In [None]:
#read the dictionaries into variables df,df1,df2,df3 as pandas dataframes
#read_csv() function imports a CSV file to DataFrame format
df=pd.read_csv("dictionaries/Distribution_descriptors.csv",header=None)
df1=pd.read_csv("dictionaries/Life_stages.csv",header=None)
df2=pd.read_csv("dictionaries/Body_size.csv",header=None)
df3=pd.read_csv("dictionaries/Sampling_devices.csv",header=None)

#keeps the first column of each dictionary into variables to have the ability to access them later
df_distr_descr=df.iloc[:,0]
df_life_stages=df1.iloc[:,0]
df_body_size=df2.iloc[:,0]
df_sampl_devices=df3.iloc[:,0]

**Add dictionaries to the [entity ruler](https://spacy.io/api/entityruler)**

In order to improve the model, dictionaries are added in the pipeline with the help of the entity ruler.

In [None]:
#import tokenizer from spacy and save it in the tokenizer variable
from spacy.tokenizer import Tokenizer
nlp=spacy.blank("en")
tokenizer = nlp.tokenizer

#script in order to put the entity_ruler with dictionaries into ner pipeline
#entity ruler's patterns(entities) evaluate only if they are not annotaded in training data, so the ”entity_ruler” will only add new entities that match to the patterns only if they don’t overlap with existing entities predicted by the statistical model
def entity_ruler(nlp_model,model):
    if "entity_ruler" not in nlp_model.pipe_names:
        ruler=nlp_model.add_pipe("entity_ruler")
    else:
        ruler=nlp_model.get_pipe("entity_ruler")

    #script for adding the desirable patterns of dictionaries' entries into the entity ruler pipe
    def dict_func(df,linkdf,label):
        patterns=[]
        j=0
        for i in df:
            dict={"label": label}
            dict["pattern"]=[{"LOWER" : str.lower(i)}]
            #in order to take the id link of dictionaries's entities if exists, for now it is commented out cause there are not links for all the entities
            # dict["id"]=linkdf[[1]][1][j]
            patterns.append(dict)
            tokens=tokenizer(i)
            if len(tokens) == 2:
                dict={"label": label}
                dict["pattern"]=[{"LOWER" : str.lower(str(tokens[0]))}, {"IS_PUNCT": True}, {"LOWER" : str.lower(str(tokens[1]))}]
                # dict["id"]=linkdf[[1]][1][j]
                patterns.append(dict)
            j=j+1
        ruler.add_patterns(patterns)
    #calls dict func to add the dictionaries
    dict_func(df_distr_descr,df,"DISTRIBUTION_DESCRIPTOR")
    dict_func(df_life_stages,df1,"LIFE_STAGE")
    dict_func(df_body_size,df2,"BODY_SIZE")
    dict_func(df_sampl_devices,df3,"SAMPLING_DEVICE")

    #puts entity ruler into the trained model pipeline
    nlp_model.to_disk(model)

nlp_full=spacy.load("model-best")#loads the trained model
entity_ruler(nlp_full,"model-best_ruler")#calls the entity_ruler script and saves the model with entity ruler in the name of model-best_ruler


**Model Evaluation after dictionaries addition**

This command is using the spaCy library to evaluate the performance of a trained language model on a validation dataset loaded from the train_eval_full.spacy file. The model being evaluated was saved as 

In [None]:
#evaluation of the model after dictionaries addition
!python -m spacy evaluate model-best_ruler ./spacy_data/dfeval.spacy --gpu-id 0

**Model's Performance Testing**

In this section the trained model is loaded as an nlp variable and tested in an example text. The output result is the extracted entities founded by the model in the text.

In [None]:
import spacy
from spacy import displacy

#load trained model
nlp_full=spacy.load("model-best_ruler") 
#put the text you want to extract entities into the nlp_full(" "), it will be saved to the doc variable
doc=nlp_full("Atherinids are small marine, estuarine and freshwater fishes not exceeding 120 mm SL (a soon to be described species of Craterocephalus may reach 300 mm SL), occurring predominantly in the Old World, with only Alepidomus evermanni (freshwaters of Cuba) and two marine species, Atherinomorus stipes and Hypoatherina harringtonensis (predominantly in the shore waters of the Caribbean) known from the New World. I")

print([(ent.text, ent.label_ ,ent.start_char, ent.end_char, ent.ent_id_) for ent in doc.ents]) #prints the text, the entity-label, the start char, the end char and an id link of extracted entities, if it exists
print("\n")

displacy.render(doc, style="ent") #call displacy.render in order to visualise the resulted entities