# End to end process for adding both entity ruler and word vectors to an NER model
1. Document cleaning and splitting the corpus into test and train sets
2. Build word vectors
3. Build training data with entity ruler and split into train and validation data
4. Add word vectors to model, run

## Notebook 1
- Load the documents
- Segment documents into sentences and strip basic punctuation
- Remove additional white spaces, punctuation, and stop words
- Shuffle the corpus
- Extract hold out data for testing
- Save training and testing data

In [1]:
#import files
import glob
import docx2txt

#splitting into sentences
import spacy
from spacy.lang.en import English

#cleaning
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from gensim.parsing.preprocessing import strip_punctuation
from gensim.parsing.preprocessing import strip_multiple_whitespaces
import contractions

#shuffle corpus
import random

#save data
import json

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sarasharick/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/sarasharick/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Load data

In [2]:
#Import files
files = []
for file in glob.glob('./docs/*.docx'):
    files.append(docx2txt.process(file))

### Split into sentences

In [3]:
nlp = English() 
nlp.add_pipe('sentencizer') #other sentence splitters seemed to have trouble with question marks


corpus = []
for file in files:
    file = file.replace('“', '').replace('”', '') #removes quotation marks
    doc = nlp(file)
    for sent in doc.sents:
        str(sent).strip()
        corpus.append(sent.text)

In [4]:
#review corpus to ensure it came out correctly
print(corpus[1])

It was a nice section of Coruscant, high up, near lots of shopping and entertainment, but also easily accessible from Starfighter Command headquarters.


### Prepare stop words and clean

In [5]:
#prep stop words list with capitals
stops = stopwords.words('English')
add_stops = [word.capitalize() for word in stopwords.words('English')]
stops.extend(add_stops)
more_stops = ['said', 'would', 'could', 'back', 'oh', 'Oh', 'Well', 'like', 'around', 'time', 'one', 'get',
              'to', 'know', 'us' , 'got', 'Um', 'um', 'Look']
stops.extend(more_stops)

In [6]:
corpus_clean = []
for sentence in corpus:
    sentence = contractions.fix(sentence)  #expand contractions
    sentence = sentence.replace("’s", '').replace("'s", '').replace("‘s", '') #remove possessives
    sentence = sentence.replace("'", '').replace("’", '').replace("‘", '').replace("——", '') #remove excess aopstrophes in all forms
    sentence = strip_punctuation(sentence) #strip additional punctuation
    sentence = word_tokenize(sentence) 
    sentence = [word for word in sentence if word not in stops]
    sentence = ' '.join(sentence)
    sentence = strip_multiple_whitespaces(sentence) #strip linebreaks
    sentence = sentence.strip() #strip any additional stray whitespace
    corpus_clean.append(sentence)

In [7]:
#review corpus again
print(corpus_clean[1])

nice section Coruscant high near lots shopping entertainment also easily accessible Starfighter Command headquarters


### Shuffle corpus

In [8]:
# shuffle corpus
random.shuffle(corpus_clean)

In [9]:
print(len(corpus_clean))

18399


### Split corpus into training and test data

In [10]:
#extract 10% hold out test data
test_corpus = corpus_clean[-1839:]
print('Unlabeled sentences for testing: ', len(test_corpus))
corpus_clean = corpus_clean[:-1839]
print('Sentences to be labeled for training and validation: ', len(corpus_clean))

Unlabeled sentences for testing:  1839
Sentences to be labeled for training and validation:  16560


### Save training and test data

In [11]:
#Function to save data as json file
def save_data(file, data):
    with open(file, 'w', encoding='utf-8') as f:
        json.dump(data, f, indent=4)

In [12]:
#save hold out test data
save_data('data/sw_test_ner.json', test_corpus)

In [13]:
#save training data
save_data('data/sw_train_ner.json', corpus_clean)