Import packages

In [1]:
!pip install spacy
!pip install spacy-transformers

Collecting spacy-transformers
  Downloading spacy_transformers-1.3.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting transformers<4.37.0,>=3.4.0 (from spacy-transformers)
  Downloading transformers-4.36.2-py3-none-any.whl.metadata (126 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers)
  Downloading spacy_alignments-0.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.7 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers<4.37.0,>=3.4.0->spacy-transformers)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading spacy_transformers-1.3.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (197 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m17.8 MB/s[0m eta [36m0:0

In [2]:
import json
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import spacy_transformers

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [3]:
# import bibel data
import pandas as pd
data = pd.read_csv('/content/drive/MyDrive/NLP/NER/t_bbe.csv')
data.head()

Unnamed: 0,id,b,c,v,t
0,1001001,1,1,1,At the first God made the heaven and the earth.
1,1001002,1,1,2,And the earth was waste and without form; and ...
2,1001003,1,1,3,"And God said, Let there be light: and there wa..."
3,1001004,1,1,4,"And God, looking on the light, saw that it was..."
4,1001005,1,1,5,"Naming the light, Day, and the dark, Night. An..."


**Column Explanation**

*   **id** : unique id for each verse
*   **b** : book labeled by number
*   **c** : chapter number from the book
*   **v** : verse number of the chapter
*   **t** : text of the verse

Since our work is on Book os Genesis, we select only the book labeled by number one (b == 1)

As annotation takes a long time and needs a team to do it. I just used 10 chapter for annotation and finetuning the spacy pretrained model.

In [4]:
genesis = data[ (data['b'] == 1 & ( data['c'] < 11 ))]

In [5]:
genesis.tail()

Unnamed: 0,id,b,c,v,t
262,1010028,1,10,28,And Obal and Abimael and Sheba
263,1010029,1,10,29,And Ophir and Havilah and Jobab; all these wer...
264,1010030,1,10,30,"And their country was from Mesha, in the direc..."
265,1010031,1,10,31,"These, with their families and their languages..."
266,1010032,1,10,32,"These are the families of the sons of Noah, in..."


In [6]:
senetences = genesis['t'].tolist()
senetences[0:5]

['At the first God made the heaven and the earth.',
 'And the earth was waste and without form; and it was dark on the face of the deep: and the Spirit of God was moving on the face of the waters.',
 'And God said, Let there be light: and there was light.',
 'And God, looking on the light, saw that it was good: and God made a division between the light and the dark,',
 'Naming the light, Day, and the dark, Night. And there was evening and there was morning, the first day.']

## **File Selection**

Let us save each chapter by their chapter number. This help us to annotate each chapter individually .

In [9]:
for i in tqdm(range(1,51)):

  my_file = data[ data['c'] == i ] # chapter select

  sentences = []
  for sent in my_file['t']:
    sentences.append(sent)
  sentences = ' '.join(sentences)  # convert to string

  file_path = '/content/drive/MyDrive/NLP/NER/GENESIS' + '/bibel_' + str(i) + '.txt' # create path name to save


  with open(file_path, 'w') as file:
    file.write(sentences)

100%|██████████| 50/50 [00:00<00:00, 122.26it/s]


***Now go to doccano annotation work and when finished come back.***

## **Doccano to Spacy Conversion**

Spacy need data oriented by its own format. So we need to convert the doccano formated annotation to spacy.

### Example of doccano format

```
{ "id":1,
  "text": "In the beginning, God created the heavens and the earth.",
  "entities": [
    {"id":1, "start_offset": 3, "end_offset": 16, "label": "TIME"},
    {"id":2, "start_offset": 18, "end_offset": 21, "label": "PERSON"},
    {"id":3, "start_offset": 31, "end_offset": 38, "label": "LOCATION"},
    {"id":4, "start_offset": 43, "end_offset": 48, "label": "LOCATION"}
  ]
}


```

### Expected spacy format

```
[
    (
        "In the beginning, God created the heavens and the earth.",
            [
                (3, 16, "TIME"),
                (18, 21, "PERSON"),
                (31, 38, "LOCATION"),
                (43, 48, "LOCATION")
            ]
    )
]



This is a function used to convert doccano formated file into spacy oriented format.

In [10]:
def direct_convert(json_data):
  text = json_data['text']
  annot = json_data['entities']
  annot
  my_annot = []
  for entity in annot:

    start = entity['start_offset']
    end = entity['end_offset']
    label = entity['label']


    my_annot.append((start, end, label))
  finall = (text, my_annot)

  return finall

First save your annotated files into one folder, advice able  to used google drive. Then by using loop iterate to each annotated file. For each json file call the function to convert.

In [11]:
spacy_data = []
i = 0
for i in range(1,11):
  file_path = '/content/drive/MyDrive/NLP/NER/FROM DOCCANO/bibel_' + str(i) + '.jsonl'
  data = json.load(open(file_path))

  spacy_data.append( direct_convert(data) ) # merge all files into one variable

In [12]:
spacy_data[0]

('At the first God made the heaven and the earth. And the earth was waste and without form; and it was dark on the face of the deep: and the Spirit of God was moving on the face of the waters. And God said, Let there be light: and there was light. And God, looking on the light, saw that it was good: and God made a division between the light and the dark, Naming the light, Day, and the dark, Night. And there was evening and there was morning, the first day. And God said, Let there be a solid arch stretching over the waters, parting the waters from the waters. And God made the arch for a division between the waters which were under the arch and those which were over it: and it was so. And God gave the arch the name of Heaven. And there was evening and there was morning, the second day. And God said, Let the waters under the heaven come together in one place, and let the dry land be seen: and it was so. And God gave the dry land the name of Earth; and the waters together in their place we

## **Split Data**

The training process needs train and test data. So split the spacy formated data.

In [13]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(spacy_data, test_size=0.2) # 20% for test data

In [14]:
len(train), len(test)

(8, 2)

## **Convert to DocBin**

DocBin handle annotation to be ready for training

In [15]:
# Load a pre-trained spaCy model (e.g., 'en_core_web_sm')
nlp = spacy.load("en_core_web_sm")

In [16]:
def to_docbin(data, path):
  db = DocBin()
  for text, annotations in data:
      doc = nlp(text)
      ents = []
      for start, end, label in annotations:
          span = doc.char_span(start, end, label=label)
          if span:
            ents.append(span)
          else:
            print("Skipping entity:", (start, end, label))
      try:
        doc.ents = ents
        db.add(doc)
      except:
        pass
  db.to_disk(path)

In [17]:
to_docbin(train, "./train.spacy")
to_docbin(test, "./test.spacy")

Skipping entity: (1705, 1716, 'QUANTITY')
Skipping entity: (2962, 2990, 'DATE')
Skipping entity: (2354, 2361, 'PERSON')
Skipping entity: (189, 192, 'PERSON')
Skipping entity: (902, 905, 'PERSON')
Skipping entity: (22, 28, 'LOC')


## **Configuring Spacy**

we need to create a configuration file from base file. Check the following link for [Spacy Usage](https://spacy.io/usage/training)

In [18]:
# change the path based on your location
!python -m spacy init fill-config '/content/drive/MyDrive/NLP/NER/config_files/base_config.cfg' config.cfg

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## **Debug Spacy data**

We should check our spacy data. open config file and give path of train and test data. This help us to see if overlapped annotation is there, low data labels, invalid annotation and more.

In [19]:
!python -m spacy debug data /content/config.cfg --paths.train /content/train.spacy --paths.dev /content/test.spacy

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[1m
tokenizer_config.json: 100% 25.0/25.0 [00:00<00:00, 183kB/s]
config.json: 100% 481/481 [00:00<00:00, 3.25MB/s]
vocab.json: 100% 899k/899k [00:00<00:00, 1.94MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 660kB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 1.52MB/s]
  _torch_pytree._register_pytree_node(
model.safetensors: 100% 499M/499M [00:04<00:00, 115MB/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  with torch.cuda.amp.autocast(self._mixed_precision):
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: en
Training pipeline: transformer, ner
8 training docs
1 evaluation docs
[38;5;2m✔ No overlap between t

*    We have 4 warning and 1 error.
*    The error is raised due to low number of example and 4 warnings due to small example of labels.

## **Train Spacy pretrained Model**

In [20]:
!python -m spacy train /content/config.cfg --output ./output --paths.train /content/train.spacy --paths.dev /content/test.spacy --gpu-id 0

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  with torch.cuda.amp.autocast(self._mixed_precision):
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['transformer', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  with torch.cuda.amp.autocast(self._mixed_precision):
  with torch.cuda.amp.autocast(self._mixed_precision):
  0       0        1197.61    

We can abort the training process if we won't to make further changes to the performace.

## **Model Evaluation**

In [21]:
# Load the trained spaCy model
nlp = spacy.load('./output/model-best')

In [23]:
text = "Noah built an ark, and God made a covenant with him at Mount Ararat. Euphritus"

# Process the text with the loaded model
doc = nlp(text)

# Print the entities
print("Entities:", [(ent.text, ent.label_) for ent in doc.ents])

Entities: [('Noah', 'PERSON'), ('God', 'Creator'), ('Mount Ararat', 'PERSON'), ('Euphritus', 'PERSON')]


The result shows that selection of entities, one thing to analyze is '**Euphritus** is not a **PERSON**, rather a river. This raised because the model has insufficient amount labels referes to river. On the other hand the book of genesis has many '**PERSON**' which is going to bias the models to predict as a '**PERSON**' most of the texts.

## **Recommendation**

The more annotation file, the more generalization performance. I recommend to use 50 to 200 number of examples for the finetuning process.

## **Save packages**

In [None]:
!pip freeze > requirements.txt

In [None]:
from google.colab import files

files.download('requirements.txt')


# **MANY REGARDS ...**