<a href="https://colab.research.google.com/github/Satwikram/NLP-Implementations/blob/main/NER/Custom%20NER%20Training%20using%20spaCy%20v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Author: Satwik Ram K

### Setup

In [None]:
!pip install spacy-transformers

### Imports

In [2]:
import spacy
from spacy.tokens import DocBin
from tqdm import tqdm
import os
from spacy.cli.train import train

### Training Data

In [3]:
TRAIN_DATA = [
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
              ("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(12,22, "PRODUCT")]})
]

### Test Data

In [4]:
TEST_DATA = [
              ("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]})
]

### Initialize the blank en model

In [5]:
nlp = spacy.blank('en')

### Convert Data into spaCy format

In [6]:
class Convert:

    def __init__(self, nlp) -> None:
        self.nlp = nlp

    def to_spacy(self, data, split):
        
        db = DocBin()
        for text, annot in data:
            doc = self.nlp.make_doc(text)
            ents = []
            for start, end, label in annot["entities"]:
                span = doc.char_span(start, end, label=label, alignment_mode="contract")
                if span is None:
                    print("Skipping", label)
                else:
                    ents.append(span)
            try:
                doc.ents = ents 
                db.add(doc)

            except Exception as e:
                print(f"Error:{e} {text}, {annot}")
        
        os.makedirs("./data", exist_ok=True)
        db.to_disk(f"data/{split}.spacy")

In [None]:
convert = Convert(nlp)

convert.to_spacy(TRAIN_DATA, "train")
convert.to_spacy(TEST_DATA, "test")

### Config

In [8]:
!python -m spacy init config --help

Usage: python -m spacy init config 
           [OPTIONS] OUTPUT_FILE

  Generate a starter config file for training.
  Based on your requirements specified via the
  CLI arguments, this command generates a config
  with the optimal settings for your use case.
  This includes the choice of architecture,
  pretrained weights and related
  hyperparameters.

  DOCS: https://spacy.io/api/cli#init-config

Arguments:
  OUTPUT_FILE  File to save the config to or - for
               stdout (will only output config and
               no additional logging info)
               [required]


Options:
  -l, --lang TEXT                 Two-letter code
                                  of the language
                                  to use
                                  [default: en]

  -p, --pipeline TEXT             Comma-separated
                                  names of
                                  trainable
                                  pipeline
                                  

In [9]:
!python -m spacy init config config.cfg --lang en --pipeline ner --optimize efficiency --gpu

[38;5;4mℹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: GPU
- Transformer: roberta-base
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
train("config.cfg", output_path="./models", overrides={"paths.train": "data/train.spacy", "paths.dev": "data/test.spacy"},
        use_gpu=0)

[38;5;4mℹ Saving to output directory: models[0m
[38;5;4mℹ Using GPU: 0[0m
[1m


Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['transformer', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.0[0m
E    #       LOSS TRANS...  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  -------------  --------  ------  ------  ------  ------
  0       0          75.54     43.87    0.00    0.00    0.00    0.00
200     200        1597.07   1622.85   66.67   66.67   66.67    0.67


In [None]:
train()

In [None]:
import shutil
shutil.rmtree("models")