**Connenction to google drive**

Click the "play" button to run the cell. You will be connected to your personal google drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#loads locale library. Locale may be a string, or an iterable of two strings (language code and encoding).
#used for fixing the error of UTF-8 encoding
import locale
locale.getpreferredencoding = (lambda *args: 'UTF-8') #sets the encoding to UTF-8


**Dependency package installation**

In [3]:
%pip install spacy
%pip install tabulate 
%pip install spacy-transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy-transformers
  Downloading spacy_transformers-1.2.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (192 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m192.1/192.1 KB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.27.0,>=3.4.0
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m104.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.9.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0

**Load required python libraries for further use**

A Python library is a reusable chunk of code that you may want to include in your programs/ projects. e.g pandas library is for data analysis

In [4]:
import spacy
import spacy_transformers
import pandas as pd
import numpy as np
from spacy.tokens.doc import Doc
from spacy.tokens import DocBin

**Set working directory's path**

Open the "file folder" on the right and navigate into the drive folder in order to find the EMODNET folder.Then by clicking the three dots next to EMODNET folder, copy the path. Delete the existing path in the os.chdir below before "models/.." and then paste yours to conclude in something like: os.chdir(r"/your path/EMODnet/models/trained_models")

In [6]:
#Change the path with the location of the folder in your drive
import os
os.chdir(r"/content/drive/Shareddrives/Advance/Projects/EMODnet_training/models/trained_models")

**Load Models**

Loads trained models from the working directory.

In [7]:
#load trained model from working directory
#to load another model go to the drive directory and change the name of the model (spacy_model) in the path "trained_models/spacy_model" with the desired one.
model=spacy.load("spacy_model")

#load trained model with entity ruler from working directory 
#to load another model go to the drive directory and change the name of the model (spacy_model_ruler) in the path "trained_models/spacy_model" with the desired one.
model_ruler=spacy.load("spacy_model_ruler")



**Model's Performance Testing**

In this section the trained model is loaded as an nlp variable and tested in an example text. The output result is the extracted entities founded by the model in the text.

In [8]:
#select the desired model and put the text you want to extract entities into the model(" "), it will be saved to the doc variable
doc=model("The maximum body length is 8 mm.")

print([(ent.text, ent.label_,ent.start_char, ent.end_char, ent.ent_id) for ent in doc.ents]) #prints the text, the entity-label, the start char, the end char and an id link of extracted entities, if it exists
print("\n")


[('maximum body length', 'BODY_SIZE', 4, 23, 0), ('8 mm', 'BODY_SIZE', 27, 31, 0)]




**Performance Testing in TXT file/Output Entities Matrix Extraction**

In order to give a TXT and take the entities that model recognised put your file into the drive folder named "txt_files" and change the name in the filename below.

In [None]:
filename = "txt_files/Rees_12_Ecology_Cc_in_Amvrakikos.txt" #the name of the file you want to extract entities from
file = open(filename, "r", encoding="utf-8") #opens the file
context = file.read()

spacy_doc=model(context) #or spacy_model_ruler for model with dictionaries

#output dataframe set up
header = ["Label", "Trait", "Starting_Character_Position", "Ending_character_Postion"] 
listOfLists = [[ent.text, ent.label_ ,ent.start_char, ent.end_char] for ent in spacy_doc.ents]
df = pd.DataFrame(listOfLists, columns=header)
#in order to change the filename (output) replace the "output.csv" and "output.json" below, with the desired name
df.to_csv("extracted_entities/output.csv", index=True, index_label="Index") #save the dataframe as ouput.csv in the extracted entities folder
df.to_json("extracted_entities/output.json", orient="records") #save the dataframe as outpout.json in the extracted entities folder
print(df.to_markdown(tablefmt="grid")) #prints the dataframe 

+-----+--------------------------------------------------------------+-------------------------+-------------------------------+----------------------------+
|     | Label                                                        | Trait                   |   Starting_Character_Position |   Ending_character_Postion |
|   0 |  Springer                                                              | BODY_SIZE               |                           321 |                        331 |
+-----+--------------------------------------------------------------+-------------------------+-------------------------------+----------------------------+
|   1 | marine turtles                                               | DISTRIBUTION_DESCRIPTOR |                           438 |                        452 |
+-----+--------------------------------------------------------------+-------------------------+-------------------------------+----------------------------+
|   2 | juveniles                        