# ❗ GENERAL INSTRUCTIONS ❗
- You should be running this notebook using Google Colab.
- Make sure to change your runtime type by pressing 'Runtime' -> 'Change runtime type' and use one of the available GPUs instead of using the CPU.
---
- Before uploading, compress the `Google_Colab` folder into a zip file on your local machine.
- In Google Colab, use the file upload feature to upload the zipped folder.
After uploading, unzip the folder in Colab using the code block below.
- After unzipping, the `Google_Colab` folder should have the following structure:
```
Google_Colab/
│
├── data/
│ ├── training/
│ │ ├── train_data.json
│ │ └── ...
```
---
- Additional relevant information and instructions for training a custom NER (Named Entity Recognition) model will be provided throughout the remaining code blocks.

In [1]:
!unzip -q /content/Google_Colab.zip

# 💻 CODE

In [2]:
# spacy_transformers provides spaCy components and architectures for using transformer models, such as BERT, with spaCy.
!pip install spacy_transformers
# Upgrade the spacy package to the latest version to ensure compatibility with recent features and improvements in natural language processing.
!pip install -U spacy

Collecting spacy_transformers
  Downloading spacy_transformers-1.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (197 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m197.8/197.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy_transformers)
  Downloading spacy_alignments-0.9.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.0/314.0 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: spacy-alignments, spacy_transformers
Successfully installed spacy-alignments-0.9.1 spacy_transformers-1.3.3
Collecting spacy
  Downloading spacy-3.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
Collecting weasel<0.4.0,>=0.1.0 (from spacy)
  Downloading weasel-0.3.4-py3-n

In [3]:
# tokens: A submodule in spaCy that deals with tokenization. Tokenization is the process of splitting text into words, phrases, symbols, or other meaningful elements called tokens.
# DocBin: class from spaCy to serialize and deserialize collections of Doc objects efficiently.
from spacy.tokens import DocBin
# Library that provides a fast, extensible progress bar.
from tqdm import tqdm
import spacy
import json

In [4]:
cv_data = json.load(open('/content/Google_Colab/data/training/train_data.json'))

In [5]:
# This helps in setting up a spaCy training configuration with appropriate defaults based on given setup and requirements.
# It will read the base_config.cfg, fill in any missing values with defaults, and then save the complete configuration to config.cfg.
!python -m spacy init fill-config /content/Google_Colab/data/training/base_config.cfg config.cfg

2023-12-15 23:27:03.753487: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 23:27:03.753538: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 23:27:03.755078: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [6]:
# The function processes training data for NER by creating spaCy Doc objects with their respective entity annotations,
# handles potential errors, and stores these Doc objects in a DocBin for efficient batch processing.
def get_spacy_doc(file, data):

  # Create a new spaCy pipeline for English without any NLP components.
  nlp = spacy.blank("en")
  # Initialize a DocBin object for efficient storage of spaCy Doc objects.
  db = DocBin()

  for text, annot in tqdm(data):

     # Create a Doc object from the text. This only tokenizes the text.
    doc = nlp.make_doc(text)

    # Extract the entity annotations for the current text.
    annot = annot['entities']

    # Initialize a list to store the entity spans.
    ents = []

    # Keep track of character indices already part of an entity.
    entity_indices = []

    # start: The starting index (position) of an entity in the text.
    # end: The ending index of the entity in the text.
    # label: The label/classification of the entity (e.g., ORG for organization, PERSON for person's name).
    for start, end, label in annot:
      # Check if the current span overlaps with previous entities.
      skip_entity = False
      for idx in range(start, end):
        if idx in entity_indices:
          skip_entity = True
          break
      # Skip the current entity if it overlaps.
      # Overlapping entities can confuse the model and are generally not allowed in most NER systems.
      if skip_entity == True:
        continue

      entity_indices = entity_indices + list(range(start, end))

      # Try to create a character span for the entity.
      # Creating spans is crucial because it directly maps the annotated labels (like "PERSON", "ORGANIZATION") to specific segments of text.
      # This mapping is where the NER model learns from.
      try:
        span = doc.char_span(start, end, label=label, alignment_mode='strict')
      except:
        continue

      # Handle cases where the span is None (invalid span).
      if span is None:
        # Write error data to the file.
        err_data = str([start, end]) + "   " + str(text) + "\n"
        file.write(err_data)
      else:
        ents.append(span)

    # Try to set the entities for the Doc and add it to the DocBin.
    try:
      doc.ents = ents
      db.add(doc)
    except:
      pass

  # Return the filled DocBin object.
  return db

In [7]:
from sklearn.model_selection import train_test_split
# The purpose of splitting data into training and test sets is to evaluate the performance of model on unseen data.
# The model is trained on the training set and then tested on the test set.
# 70% of the data is allocated to the train set, consequently, the remaining 30% represents the test set.
train, test = train_test_split(cv_data, test_size=0.3)

In [8]:
# Processes training and test data, converts them to a format that spaCy can use for training a model,
# and logs any errors that occur during processing to a text file.

file = open('error.txt', 'w')

db = get_spacy_doc(file, train)
db.to_disk('train_data.spacy')

db = get_spacy_doc(file, test)
db.to_disk('test_data.spacy')

file.close()

100%|██████████| 140/140 [00:01<00:00, 95.26it/s] 
100%|██████████| 60/60 [00:00<00:00, 90.58it/s] 


In [9]:
# This command starts the training process for a spaCy model using the specified configuration file and data.
# The trained model and other related files will be saved in the ./output directory.
!python -m spacy train ./config.cfg --output ./output --paths.train ./train_data.spacy --paths.dev ./test_data.spacy --gpu-id 0

2023-12-15 23:27:45.115653: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-15 23:27:45.115706: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-15 23:27:45.116885: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
config.json: 100% 481/481 [00:00<00:00, 2.52MB/s]
vocab.json: 100% 899k/899k [00:00<00:00, 18.6MB/s]
merges.txt: 100% 456k/456k [00:00<00:00, 33.9MB/s]
tokenizer.json: 100% 1.36M/1.36M [00:00<00:00, 57.4MB/s]
model.safetensors: 100% 4

In [10]:
# Save your trained spaCy model to Google Drive so you can then download it for local use.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [11]:
from datetime import datetime
current_time = datetime.now().strftime("%Y%m%d%H%M%S")
model_name = f"model-best-{current_time}.zip"

# Zip the model.
!zip -r "/content/{model_name}" /content/output/model-best

  adding: content/output/model-best/ (stored 0%)
  adding: content/output/model-best/transformer/ (stored 0%)
  adding: content/output/model-best/transformer/cfg (stored 0%)
  adding: content/output/model-best/transformer/model (deflated 13%)
  adding: content/output/model-best/ner/ (stored 0%)
  adding: content/output/model-best/ner/cfg (deflated 33%)
  adding: content/output/model-best/ner/moves (deflated 73%)
  adding: content/output/model-best/ner/model (deflated 8%)
  adding: content/output/model-best/config.cfg (deflated 61%)
  adding: content/output/model-best/meta.json (deflated 65%)
  adding: content/output/model-best/tokenizer (deflated 81%)
  adding: content/output/model-best/vocab/ (stored 0%)
  adding: content/output/model-best/vocab/vectors (deflated 45%)
  adding: content/output/model-best/vocab/vectors.cfg (stored 0%)
  adding: content/output/model-best/vocab/lookups.bin (stored 0%)
  adding: content/output/model-best/vocab/key2row (stored 0%)
  adding: content/output/m

In [12]:
# Copy the zipped model to Google Drive.
!cp "/content/{model_name}" /content/drive/MyDrive/

# When you download your zipped model, you only have to use the 'model-best' directory and replace existing one used locally with the new one.