===========================================

Title: 9.2 Exercises

Author: Chad Wood

Date: 20 Jan 2022

Modified By: Chad Wood

Description: This program demostrates building a Named Entity Recognition tagger, to include building a NER model, and furthermore contrasting the results with SpaCy's NER engine.

===========================================

In [1]:
import pandas as pd


df = pd.read_csv(r'data/ner_dataset.csv.gz', compression='gzip',
                 encoding='ISO-8859-1')
df = df.fillna(method='ffill') # Curious about ffills application in this context.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  1048575 non-null  object
 1   Word        1048575 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB


In [6]:
def word_features(sent, i):
    
    # Current word features
    word = sent[i][0] # Instantiates word
    postag = sent[i][1] # Instantiates POS tag
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(), # Returns lowercase word
        'word[-3:]': word[-3:], # Last 3 chars
        'word[-2:]': word[-2:], # Last 2 chars
        'word.isupper()': word.isupper(), # Boolean if word uppercase 
        'word.istitle()': word.istitle(), # Boolean if word title
        'word.isdigit()': word.isdigit(), # Boolean if word is digit
        'postag': postag, # POS tag
        'postag[:2]': postag[:2] # First 2 chars of POS tag
    }
    
    # Previous word features
    if i > 0:
            word1 = sent[i-1][0] # Instantiates prior word
            postag1 = sent[i-1][1] # Instantiates POS tag
            features.update({
                '-1:word.lower()': word1.lower(), # Returns lowercase word
                '-1:word.istitle()': word1.istitle(), # Boolean if word title
                '-1:word.isupper()': word1.isupper(), # Boolean if word uppercase 
                '-1:postag': postag1, # POS tag
                '-1:postag[:2]': postag1[:2] # First 2 chars of POS tag
            })
    else:
        features['BOS'] = True # 'Beginning of Sentence'
    
    # Next word features  
    if i < len(sent)-1:
        word1 = sent[i+1][0] # Instantiates next word
        postag1 = sent[i+1][1] # Instantiates POS tag
        features.update({
            '+1:word.lower()': word1.lower(), # Returns lowercase word
            '+1:word.istitle()': word1.istitle(), # Boolean if word title
            '+1:word.isupper()': word1.isupper(), # Boolean if word uppercase 
            '+1:postag': postag1, # POS tag
            '+1:postag[:2]': postag1[:2] # First 2 chars of POS tag
        })
    else:
        features['EOS'] = True # 'End of Sentence'
        
    return features

In [7]:
# Generates list of word features for each word in sentence
def sent_features(sent):
    return [word_features(sent, i) for i in range(len(sent))]

# Generates list of NER labels for each sentence
def sent_labels(sent):
    return [label for token, postag, label in sent]

In [8]:
# Generates list of word, POS, and NER tag for each word in each sentence by ziping columns values
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                   s['POS'].values.tolist(),
                                                   s['Tag'].values.tolist())]

# Groups data and applies prior function to wrangle data into desired format
grouped_df = df.groupby('Sentence #').apply(agg_func)

In [26]:
# Converts grouped_df to list of nested lists for each sentence, nested tupple for each (w, p, t) 
sentences = [s for s in grouped_df] # Essentially pd.Series.tolist()
sentences[0][:5]

[('Thousands', 'NNS', 'O'),
 ('of', 'IN', 'O'),
 ('demonstrators', 'NNS', 'O'),
 ('have', 'VBP', 'O'),
 ('marched', 'VBN', 'O')]

In [24]:
sent_features(sentences[0][4:6]) # Demonstrates features of words 4-5 in sentence 1

[{'bias': 1.0,
  'word.lower()': 'marched',
  'word[-3:]': 'hed',
  'word[-2:]': 'ed',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'VBN',
  'postag[:2]': 'VB',
  'BOS': True,
  '+1:word.lower()': 'through',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False,
  '+1:postag': 'IN',
  '+1:postag[:2]': 'IN'},
 {'bias': 1.0,
  'word.lower()': 'through',
  'word[-3:]': 'ugh',
  'word[-2:]': 'gh',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'IN',
  'postag[:2]': 'IN',
  '-1:word.lower()': 'marched',
  '-1:word.istitle()': False,
  '-1:word.isupper()': False,
  '-1:postag': 'VBN',
  '-1:postag[:2]': 'VB',
  'EOS': True}]

In [29]:
from sklearn.model_selection import train_test_split
import numpy as np

# Stores list data in memory as an array for creating model
X = np.array([sent_features(s) for s in sentences], dtype=object)
y = np.array([sent_labels(s) for s in sentences], dtype=object)

# Splits data for observed and predict values for training/testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,
                                                    random_state=42)

X_train.shape, X_test.shape # Outputs size to verify split

((33571,), (14388,))

In [32]:
import sklearn_crfsuite

# Configurations
crf = sklearn_crfsuite.CRF(algorithm='lbfgs', # Training alg. https://en.wikipedia.org/wiki/Limited-memory_BFGS
                           c1=0.1, # Coefficient for Lasso
                           c2=0.1, # Coefficient for Ridge
                           max_iterations=100,
                           all_possible_transitions=True,
                           verbose=True)

# Passes on issue: https://github.com/TeamHG-Memex/sklearn-crfsuite/issues/60
# Alt solution is downgrading package 
try:
    crf.fit(X_train, y_train)
except AttributeError:
    pass

loading training data to CRFsuite: 100%|███████████████████████████████████████| 33571/33571 [00:12<00:00, 2709.61it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 129580
Seconds required: 2.663

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=3.92  loss=1180124.79 active=128640 feature_norm=1.00
Iter 2   time=3.95  loss=927293.60 active=127333 feature_norm=4.42
Iter 3   time=2.03  loss=724497.81 active=122222 feature_norm=3.87
Iter 4   time=9.01  loss=393567.11 active=123154 feature_norm=3.24
Iter 5   time=1.72  loss=330653.15 active=125018 feature_norm=4.04
Iter 6   time=1.91  loss=242580.46 active=118725 feature_norm=6.18
Iter 7   time=1.78  loss=205070.81 active=111791 feature_norm=8.00
Iter 8   time=1.79  loss=183148.35 active=107309 feature_norm=8.86
Iter 9   time=1.81  loss=166599.67 active=102936 feature_norm

In [66]:
from sklearn_crfsuite import metrics as crf_metrics

# Lists all labels, removes 0 (outside NER)
labels = list(crf.classes_)
labels.remove('O')

y_pred = crf.predict(X_test)
crf_metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=labels)

0.8542474536519857

In [None]:
'''
The following produces a TypeError that I can't seem to find a solution for, 
aside from perhaps downgrading my module installation (untested). I don't see many
duplicate issues online, and I'm considering submit this to the appropriate GitHub 
as an issue.

Issue seems to be with module recognizing three positional arguments given. This 
still occurs even when arguments are reduced to only 'y_test, y_pred'. I attempted 
working around this by using sklearn's metrics.classification_report, only to find 
that the datatype used is no longer accepted due to a depreciation.
'''

crf_metrics.flat_classification_report(y_test, y_pred, labels=labels, digits=3)

1. Run the following sentence through your tagger: “Fourteen days ago, Emperor Palpatine left San Diego, CA for Tatooine to follow Luke Skywalker.” Report on the tags applied to the sentence.

In [84]:
import nltk

text = 'Fourteen days ago, Emperor Palpatine left San Diego, CA for Tatooine to follow Luke Skywalker.'

# Retrieves text POS
text_tokens = nltk.word_tokenize(text)
text_pos = nltk.pos_tag(text_tokens)

# Retrieves features
features = [sent_features(text_pos)]

# Generates labels
labels = crf.predict(features)
text_labels = labels[0]

# Formats report
text_ner_df = pd.DataFrame([[token, tag] for token, tag in zip(text_tokens, text_labels)], columns=['Text', 'NER'])
text_ner_df

Unnamed: 0,Text,NER
0,Fourteen,B-per
1,days,O
2,ago,O
3,",",O
4,Emperor,B-per
5,Palpatine,I-per
6,left,O
7,San,B-geo
8,Diego,I-geo
9,",",O


2. Run the same sentence through spaCy’s NER engine.
3. Compare and contrast the results – you can do this in your Jupyter Notebook or as a comment in your .py file.

In [86]:
import spacy

nlp = spacy.load('en_core_web_sm')
nlp_text = nlp(text)

spacy_ner = pd.DataFrame([(word.text, word.ent_type_) for word in nlp_text], columns=['Text', 'SpaCy'])
spacy_ner['Homebrew'] = text_ner_df['NER']

spacy_ner

Unnamed: 0,Text,SpaCy,Homebrew
0,Fourteen,DATE,B-per
1,days,DATE,O
2,ago,DATE,O
3,",",,O
4,Emperor,,B-per
5,Palpatine,GPE,I-per
6,left,,O
7,San,GPE,B-geo
8,Diego,GPE,I-geo
9,",",,O


SpaCy's NER engine seems better suited for the text data provided, mostly because it's able to accurately recognize more entities than the NER engine built here. However, the version built here could likely be optimized to perform better by tuning the model and perhaps adjusting the data used for training. 