<h1 style="background-color:DodgerBlue; color:white" >Custom NER using Spacy 3.0+</h1>

Recently, in my work, I did custom NER using production-level NLP library called spaCy.  

Utilizing that experience, this notebook aims to train a custom NER transformer-based model to detect datasets as entities. For achieving this, we require spaCy 3.0+.

The whole process is quite straightforward:
1. Make your training dataset by marking entities in it. spaCy 3.0 requires DocBin format. 
    - For our problem, the training labels help us mark the entities. (the **positive examples**)
    - Rest lines could be our **negative examples** with start and end indexes of entity has 0,0
    - **Caution:** In this competition, train data is not exhaustively labeled. That means, we have some positive examples inside the examples that we mark as negative. You would ideally want to increase the class-prior weight of the positive examples we already know.
2. Initialize spacy with a config file (**spacy init** command)
3. Train spacy model using the settings mentioned in config file (**spacy train** command)
4. Load the model and use it like any other spacy pipeline (**spacy.load()** command)


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import json
import glob
import re
from tqdm import tqdm

**Note:**  This notebook uses internet, therefore, you cannot submit it as submission. However, you can take the trained model and use it make your submissions.

## Install Spacy 3.0.+ Transformers

In [None]:
!pip install -U spacy[transformers]

## Predefined function for prepropossing

For preprocessing, we stick to the given function which replaces anything apart from letters and digits with a ' '. However, for training our spaCy model, we do not lowercase the text

In [4]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

# does not lowercase the text
def clean_text2(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt))

# Read train csv and create a sample (for faster demo)

In [5]:
df = pd.read_csv("../input/coleridgeinitiative-show-us-the-data/train.csv")

In [6]:
df.head()

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
1,2f26f645-3dec-485d-b68d-f013c9e05e60,Educational Attainment of High School Dropouts...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
2,c5d5cd2c-59de-4f29-bbb1-6a88c7b52f29,Differences in Outcomes for Female and Male St...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
3,5c9a3bc9-41ba-4574-ad71-e25c1442c8af,Stepping Stone and Option Value in a Model of ...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
4,c754dec7-c5a3-4337-9892-c02158475064,"Parental Effort, School Resources, and Student...",National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study


In [7]:
df.shape

(19661, 5)

In [8]:
# number of unique labels
len(df.cleaned_label.unique()) 

130

In [9]:
# create a subset for quick demo
sample = df.sample(500)
sample.shape

(500, 5)

## Create the training dataset by marking entries

In [10]:
# get positive and negative examples for entities
POSITIVE_DATA = []
NEGATIVE_DATA = []
for idx,row in tqdm(sample.iterrows()):
    pub = "../input/coleridgeinitiative-show-us-the-data/train/" + row.Id + ".json"
    f = open(pub)  
    data = json.load(f)
    paper_text = str([sec['text'] for sec in data]).strip("[").strip("]")
    sentences = paper_text.split(".")
    for sentence in sentences:
        sentence2 = clean_text(sentence) # use given clean_text to find cleaned_label
        a = re.search(row.cleaned_label,sentence2)
        if  a != None: # if label is found, make it a positive example
            POSITIVE_DATA.append((clean_text2(sentence),{"entities":[(a.span()[0],a.span()[1],"DATASET")]}))
        else: # if label is not found, make it a negative example
            if len(clean_text2(sentence))>20: # greater than 20 chars
                NEGATIVE_DATA.append((clean_text2(sentence),{"entities":[(0,0,"DATASET")]}))

500it [00:17, 28.35it/s]


In [11]:
POSITIVE_DATA[0:10]

[(' Similarly Ludwig and Miller 2007 using data from the National Education Longitudinal Study NELS reported significant impacts of Head Start on these measures of young adult schooling attainment despite a lack of significant impacts on 8th grade reading and math test scores or grades',
  {'entities': [(63, 91, 'DATASET')]}),
 ('1 years with SMC from the AD Neuroimaging Initiative ADNI were included in the study',
  {'entities': [(53, 57, 'DATASET')]}),
 (' Data used in the preparation of this article were obtained from the AD Neuroimaging Initiative ADNI database 1 during January 2018',
  {'entities': [(96, 100, 'DATASET')]}),
 (' The ADNI was launched in 2003 as a public private partnership led by Principal Investigator Michael W',
  {'entities': [(5, 9, 'DATASET')]}),
 (' The primary goal of ADNI has been to test whether serial MRI PET other biological markers and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD',
  {'entitie

In [12]:
len(POSITIVE_DATA)

1698

In [14]:
len(NEGATIVE_DATA)

185789

## We have an IMBALANCED CLASS problem.
#### For brevity, let's downsample negative class to 2000 examples

In [18]:
import random
NEG_SAMPLE = random.choices(NEGATIVE_DATA, k=2000) # downsampling negative class

In [17]:
TRAIN_DATA = np.array(POSITIVE_DATA + NEG_SAMPLE) # our train data is positive + negative examples
np.random.shuffle(TRAIN_DATA) # shuffle the train data
len(TRAIN_DATA) # total examples in train data

3698

## Spacy 3.0 uses DocBin format - convert train set to this format
####  DocBin is highly efficient serializable format used by spaCy3.0 
Use below converter to change above train_set into new format

In [19]:
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            pass
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object

100%|██████████| 3698/3698 [00:03<00:00, 1107.87it/s]


# Train the spaCy transformer model
https://spacy.io/usage/training#quickstart

In [21]:
# step1: Get baseconfig file from https://spacy.io/usage/training#quickstart
!cp "../input/spacybaseconfigcfg/base_config.cfg" ./

In [22]:
# step2: initialize the base config file. 
# Config file contains the training settings. 
# Init with spacy init initializes it with most common settings
!python -m spacy init fill-config base_config.cfg config.cfg

2021-05-15 10:10:54.640173: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [23]:
# step3: train using spacy train command
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./train.spacy --gpu-id 0

2021-05-15 10:11:09.963082: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2021-05-15 10:11:13,604] [INFO] Set up nlp object from config
[2021-05-15 10:11:13,614] [INFO] Pipeline: ['transformer', 'ner']
[2021-05-15 10:11:13,620] [INFO] Created vocabulary
[2021-05-15 10:11:13,620] [INFO] Finished initializing nlp object
Downloading: 100%|██████████████████████████████| 481/481 [00:00<00:00, 425kB/s]
Downloading: 100%|███████████████████████████| 899k/899k [00:00<00:00, 2.40MB/s]
Downloading: 100%|███████████████████████████| 456k/456k [00:00<00:00, 1.45MB/s]
Downloading: 100%|█████████████████████████| 1.36M/1.36M [00:00<00:00, 3.49MB/s]
Downloading: 100%|███████████████████████████| 501M/501M [00:11<00:00, 43.6MB/s]
[2021-05-15 10:11:53,903] [INFO] Initialized pipeline components: ['transformer', 'ner']
[38;5;2m✔ Initialized pipeline

### Explaining Training Pipeline Variables

- E is epochs
- Loss Transformer
- Loss NER
- ENTS_F is f score
- ENTS_P is precision
- ENTS_R is recall
- Score is to score the model (in order to pick best model later)

# Load the custom NER model and predict.

In [28]:
from thinc.api import set_gpu_allocator, require_gpu
set_gpu_allocator("pytorch")
require_gpu(0)
# Use spacy.load to load your custom model
custom_ner_model = spacy.load("./output/model-best") # output model is stored as "model-best" and "model-last"

In [29]:
test_pubs = glob.glob("../input/coleridgeinitiative-show-us-the-data/test/*.json")

In [31]:
from spacy import displacy

for index, pub in enumerate(test_pubs):
    f = open(pub)
    data = json.load(f)
    paper_text = str([sec['text'] for sec in data]).strip("[").strip("]")
    sentences = paper_text.split(".")
    for sentence in sentences:
        sentence = clean_text2(sentence)
        doc = custom_ner_model(sentence)
        if len(doc.ents) > 0:
            displacy.render(doc, style="ent", jupyter=True)
        

# References
1. https://spacy.io/usage/training

#### This is my first notebook on Kaggle. Your feedback and suggestions would be appreciated! - Shivam