# Training Custom NER Models

#### Sources

Bochet, Charles, "[Python:How to Train your Own Model with NLTK and Stanford NER Tagger? (for English, French, German...](https://www.sicara.ai/blog/2018-04-25-python-train-model-NTLK-stanford-ner-tagger)," <i>Sicara</i>, Accessed October 21, 2020.

Christina, "[Named Entity Recognition in Python with Stanford-NER and Spacy](https://lvngd.com/blog/named-entity-recognition-in-python-with-stanford-ner-and-spacy/)," <i>LVNGD</i>, Accessed October 21, 2020.

spaCy, "[Simple training style](https://spacy.io/usage/training#ner)," <i>spaCy</i>, Accessed October 21, 2020.

Stanford NLP Group, "[Stanford NER CRF FAQ](https://nlp.stanford.edu/software/crf-faq.shtml#b)," <i>Stanford NLP</i>, Accessed October 21, 2020.

In [1]:
# Import necessary libraries.
import re, nltk, warnings, glob, csv, sys, os
import pandas as pd
import numpy as np
import seaborn as sns
import xml.etree.ElementTree as ET
from itertools import chain
from nltk import word_tokenize, pos_tag, ne_chunk, Tree
from fuzzywuzzy import fuzz, process

# Ignore warnings related to deprecated functions.
warnings.simplefilter("ignore") # specify ignore: , DeprecationWarning

# Declare directory location to shorten filepaths later.
abs_dir = "/Users/quinn.wi/Documents/SemanticData/"

## Import Names List

In [2]:
%%time

# Read-in excel file & print sheet names.
excel = pd.ExcelFile(abs_dir + 'Data/JQA/DJQA_Names-List_singleSheet.xlsx')
print (excel.sheet_names)

['Sheet1']
CPU times: user 1min 1s, sys: 249 ms, total: 1min 1s
Wall time: 1min 2s


In [3]:
%%time

# Convert excel sheet to dataframe.
names = excel.parse(sheet_name = 'Sheet1')

# Subset dataframe by selecting key columns.
names = names[['Last Name', "First Name", 'Middle Name', 'Maiden Name',
               'Variant form of name', 'Short-hand option for name',
               'Hyogebated-unique-string-of-characters']]

# Drop rows if last and first name are "??"
names = names.drop(names[(names['Last Name'] == "??") \
                        & (names['First Name'] == "??")].index)


# Delete excel to reduce memory usage.
del excel

names.head()

CPU times: user 259 ms, sys: 7.98 ms, total: 267 ms
Wall time: 267 ms


Unnamed: 0,Last Name,First Name,Middle Name,Maiden Name,Variant form of name,Short-hand option for name,Hyogebated-unique-string-of-characters
22,??,Aaron,,,,,aaron
23,??,Abbas Mirza,,,,,abbasmirza
24,??,Abd al-Rahman,,,,,adbalrahman
25,??,Abdiel,,,,,abdiel
26,??,Abdon,,,,,abdon
