# Build Custom Stanford NER Model

#### Sources


Bochet, Charles, “[Python: How to Train your Own Model with NLTK and Stanford NER 
Tagger?](https://www.sicara.ai/blog/2018-04-25-python-train-model-NTLK-stanford-ner-tagger),” <i>Sicara</i>, Accessed 10/16/2020.

“DataTurks,” “[Stanford CoreNLP: Training your own custom NER tagger](https://medium.com/swlh/stanford-corenlp-training-your-own-custom-ner-tagger-8119cc7dfc06),” <i>medium</i>, Accessed 10/26/2020.

Christina, "[Named Entity Recognition in Python with Stanford-NER and Spacy](https://lvngd.com/blog/named-entity-recognition-in-python-with-stanford-ner-and-spacy/)," <i>LVNG</i>, Accessed 10/26/2020.

In [29]:
# Import necessary libraries.
import re, glob, random,  csv, sys, os, warnings
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET

# Ignore warnings related to deprecated functions.
warnings.simplefilter("ignore")

# Declare directory location to shorten filepaths later.
abs_dir = "/Users/quinn.wi/Documents/SemanticData/"

## 1. Parse XML

In [51]:
%%time

# Declare regex to simplify file paths below
regex = re.compile(r'.*/(.*).xml')

# Get plain text of every element (designated by first argument).
def get_textContent(ancestor, xpath_as_string, namespace):
    text_list = []
    for elem in ancestor.findall(xpath_as_string, namespace):
        text = ''.join(ET.tostring(elem, encoding='unicode', method='text'))

#         Add text (cleaned of additional whitespace) to text_list.
        text_list.append(re.sub(r'\s+', ' ', text))

#     Return concetanate text list.
    return ' '.join(text_list)


# Choose either all .xml files or training set by select dataset = 'all' or 'training'.
# Selection will parse the XML for different elements.

# dataset = 'all'
dataset = 'training'

# Conditionally choose directory and create dataframe.
if dataset == 'all':
    # Gather all .xml files using glob.
    list_of_files = glob.glob(abs_dir + "Data/JQA/*/*.xml")
    
    # Create dataframe to store results.
    data = pd.DataFrame(columns = ['file', 'entry', 'text',
                                       'element', 'refKey', 'entity'])

elif dataset == "training":
    # Or, use training document(s) alone.
    list_of_files = glob.glob(abs_dir + "Data/TestEncoding/TrainingData/*.xml")
    
    # Create dataframe to store results.
    data = pd.DataFrame(columns = ['file', 'entry', 'text',
                                       'element', 'entity'])

else:
    print ('Dataset not found.')

    
# Loop through each file within a directory.
for file in list_of_files:
    tree = ET.parse(file)
    root = tree.getroot()
    namespace = re.match(r"{(.*)}", str(root.tag))
    ns = {"ns":namespace.group(1)}
    reFile = str(regex.match(file).group(1))
    
    for eachDoc in root.findall('.//ns:div/[@type="entry"]', ns):
        entry = eachDoc.get('{http://www.w3.org/XML/1998/namespace}id')
        text = get_textContent(eachDoc, './ns:div/[@type="docbody"]/ns:p', ns)
        
        if dataset == 'all':
            for elem in eachDoc.findall('.//ns:p/ns:persRef/[@ref]', ns):
                name = elem.text
                try:
                    entity = re.sub(r'\s+', ' ', name)
                except TypeError:
                    entity = name

                data = data.append({'file':reFile,
                                'entry':entry,
                                'text':text,
                                'element':re.sub(r'.*}(.*)', '\\1', elem.tag),
                                'refKey':elem.get('ref'),
                                'entity':entity},
                               ignore_index = True)

        elif dataset == 'training':
            for xpath in ['.//ns:p//ns:persName', './/ns:p//ns:placeName']:
                for elem in eachDoc.findall(xpath, ns):
                    name = elem.text
                    try:
                        entity = re.sub(r'\s+', ' ', name)
                    except TypeError:
                        entity = name

                    data = data.append({'file':reFile,
                                    'entry':entry,
                                    'text':text,
                                    'element':re.sub(r'.*}(.*)', '\\1', elem.tag),
                                    'entity':entity},
                                   ignore_index = True)

            
        else:
            print ('Selected dataset not found.')
        

        
# Create a dictionary to change element tags to NER labels.
element_ner_dictionary = {'persName':'PERSON', 'placeName':'LOC'}

# Change elements to NER labels.
labels_for_sentences = (data['element'].map(element_ner_dictionary))

# Attach labels as a column.
data['label'] = labels_for_sentences
                
data.head()

CPU times: user 258 ms, sys: 3.68 ms, total: 261 ms
Wall time: 263 ms


Unnamed: 0,file,entry,text,element,entity,label
0,TrainCopy_JQADiaries-v23-1821-05-p359,jqadiaries-v23-1821-05-01,1 V:15. Tuesday. W. A. Schoolfield at the Offi...,persName,W. A. Schoolfield,PERSON
1,TrainCopy_JQADiaries-v23-1821-05-p359,jqadiaries-v23-1821-05-02,2. VI: Mrs. Adams unwell. Despatches to A. Gal...,persName,Adams,PERSON
2,TrainCopy_JQADiaries-v23-1821-05-p359,jqadiaries-v23-1821-05-02,2. VI: Mrs. Adams unwell. Despatches to A. Gal...,persName,A. Gallatin,PERSON
3,TrainCopy_JQADiaries-v23-1821-05-p359,jqadiaries-v23-1821-05-02,2. VI: Mrs. Adams unwell. Despatches to A. Gal...,persName,La Forêt,PERSON
4,TrainCopy_JQADiaries-v23-1821-05-p359,jqadiaries-v23-1821-05-02,2. VI: Mrs. Adams unwell. Despatches to A. Gal...,persName,Canning,PERSON


## Tokenize Entries and Shape Data for Custom Model

Custom Model should be a tab-separated file (.tsv) with a token column and NER label column (no header).

|No Header | No Header|
|----------|----------|
|En |O|
|2017 |DATE|
|, |O|
|Une |O|
|intelligence |O|
|artificielle |O|
|est |O|
|en |O|
|mesure |O|
|de |O|
|développer |O|
|par |O|
|elle-même |O|
|Super |PERSON|
|Mario |PERSON|
|Bros |PERSON|
|. |O |

For initials and titles ('W. A. Schoolfield'), one row might be ('W. PERSON'). The example file provided by Stanford is [here](https://nlp.stanford.edu/software/crf-faq.shtml#b).

In [52]:
%%time

tokens = data

# Function to link entity-tree (e.g., first name & last name) with underscores.
def link_entity_trees(text_column, entity_column):
    sentence = text_column
    if entity_column in sentence:
        re_entity = re.sub('\s', '_', entity_column)
        sentence = re.sub(entity_column, re_entity, sentence)
    return sentence

# Apply link_entities() to text column & tokenize text.
tokens['text'] = tokens \
    .apply(lambda row: link_entity_trees(row['text'], row['entity']), axis = 1) \
    .str.split(' ')

# Unnest text column.
tokens = tokens.explode('text')

# Replace underscores with whitespace to match 'entity' column.
tokens['text'] = tokens['text'].str.replace('_', ' ')

# Define function to properly label PERSONs and LOCs.
def correct_entity_label(text_column, entity_column, label_column):
    if text_column != entity_column:
        label = '0'
    else:
        label = label_column
    return label

tokens['label'] = tokens \
    .apply(lambda row: correct_entity_label(row['text'], row['entity'], row['label']),
           axis = 1)

# Further tokenize to match example data.
tokens['text'] = tokens['text'].str.split(' ')
tokens = tokens.explode('text')

# Remove whitespace and rows with only whitespace.
tokens['text'] = tokens['text'].str.replace('\s+', '') \
    .replace('', np.nan, regex = True)

tokens = tokens.dropna()

# Subset dataframe by columns
tokens = tokens[['text', 'label']]

tokens.head()

CPU times: user 84.4 ms, sys: 2.44 ms, total: 86.8 ms
Wall time: 85.7 ms


Unnamed: 0,text,label
0,1,0
0,V:15.,0
0,Tuesday.,0
0,W.,PERSON
0,A.,PERSON


## Save Corpus and Model Parameters

In [59]:
%%time

# Create variable for file path.
PATH_TO_STANFORD_FOLDER = "/Users/quinn.wi/stanfordNLP/stanford-ner-4.0.0/custom-models/"

# Write training corpus to folder.
tokens.to_csv(PATH_TO_STANFORD_FOLDER + 'jqa-ner-corpus.tsv',
              sep = '\t', index = False, header = False)


# Save model parameters/properties as txt.
params = """trainFile = custom-models/jqa-ner-corpus.tsv
serializeTo = jqa-ner-model.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true"""

with open(PATH_TO_STANFORD_FOLDER + "jqa-prop.txt", "w") as writeFile:
    writeFile.write(params)

CPU times: user 3.92 ms, sys: 1.87 ms, total: 5.79 ms
Wall time: 5.85 ms


## Train Custom Model

To train custom model, write following code in Terminal

```code
cd stanfordNLP/stanford-ner-4.0.0 [from Home directory]

java -cp "stanford-ner.jar:lib/*" -mx4g edu.stanford.nlp.ie.crf.CRFClassifier -prop custom-models/jqa-prop.txt
```

File path must be correct: https://stackoverflow.com/questions/59681871/customized-stanfordner

In order to call custom model in Python (NLTK):

```python
jar = './stanford-ner-4.0.0/stanford-ner.jar'
model = './stanford-ner-4.0.0/jqa-ner-model.ser'

ner_tagger = StanfordNERTagger(model, jar, encoding='utf8')
```
