# **Title: 9.2 Exercise**
# **Author: Michael J. Montana**
# **Date: 14 May 2023**
# **Modified By: N/A**
# **Description: Creating custom Named Entity Recognition (NER) Model from text and comparing the output with spaCy's builin NER model**

In [148]:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import sklearn_crfsuite
import nltk
from sklearn_crfsuite import metrics as crf_metrics
import re
import string
import spacy
from spacy import displacy

# <font color=46ab18>**Using the Kaggle NER corpus (ner_database.csv), which you can also find in our GitHub, create a NER tagger using Scikit-learn, which implies creating the NER model.**

### I highly encourage you to look at the Author's Notebook for Chapter 8. In the text, this all starts on p. 545 and note the Author's GitHub is a little different than what's in the text. Note that building this model is going to take some time so plan accordingly. For example, the fit() alone was 3 minutes (not too bad, but it could take much longer on your machine).

### There's also a package installed by the author in his Notebook (sklearn-crfsuite). He installs it in-line in the Notebook, which may not work with Visual Studio Code. But you can just install it at a terminal.

In [149]:
#importing data and filling nulls
df = pd.read_csv('data/ner_dataset.csv.gz', compression='gzip', encoding='ISO-8859-1')
df = df.fillna(method='ffill')

In [150]:
def word2features(sent, i):
    #current word
    word = sent[i][0] #instantiates word
    postag = sent[i][1] #instantiates pos
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(), #lowercase
        'word[-3:]': word[-3:], #last 3 characters
        'word[-2:]': word[-2:], #last 2 characters
        'word.isupper()': word.isupper(), #uppercase
        'word.istitle()': word.istitle(), #title
        'word.isdigit()': word.isdigit(), #digit
        'postag': postag, # part of speech tag
        'postag[:2]': postag[:2]} #first two characters of POS tag

    if i > 0:# previous word
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2]})
    else:
        features['BOS'] = True #BOS = begining fo sentence

    if i < len(sent)-1:#next word
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2]})
    else:
        features['EOS'] = True #EOS = end of sentence

    return features

# Generates list of word features for each word in sentence
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

#returns lables
def sent2labels(sent):
        return [label for token, postag, label in sent]

In [151]:
#assigns part of speach and entity type to each word
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                   s['POS'].values.tolist(),
                                                   s['Tag'].values.tolist())]

In [160]:
grouped_df = df.groupby('Sentence #').apply(agg_func) # grouping by sentence

sentences = [s for s in grouped_df] #nesting agg_func output in sentence

X = np.array([sent2features(s) for s in sentences], dtype=object)
y = np.array([sent2labels(s) for s in sentences], dtype=object)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) #declaring varibales for training/testing

In [153]:
crf = sklearn_crfsuite.CRF(algorithm='lbfgs', #training algorithm limited memory broyden,fletcher, goldfarb, shanno algorithm -- https://towardsdatascience.com/limited-memory-broyden-fletcher-goldfarb-shanno-algorithm-in-ml-net-118dec066ba
                           c1=0.1,
                           c2=0.1,
                           max_iterations=100,
                           all_possible_transitions=True,
                           verbose=True)
try:
    crf.fit(X_train, y_train)
except AttributeError: #ignoring the errors
    pass

loading training data to CRFsuite: 100%|██████████| 35969/35969 [00:07<00:00, 4854.33it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 133629
Seconds required: 1.334

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=2.39  loss=1264028.26 active=132637 feature_norm=1.00
Iter 2   time=2.41  loss=994059.01 active=131294 feature_norm=4.42
Iter 3   time=1.20  loss=776413.87 active=125970 feature_norm=3.87
Iter 4   time=5.96  loss=422143.40 active=127018 feature_norm=3.24
Iter 5   time=1.20  loss=355775.44 active=129029 feature_norm=4.04
Iter 6   time=1.19  loss=264125.22 active=124046 feature_norm=6.10
Iter 7   time=1.20  loss=222304.71 active=117183 feature_norm=7.69
Iter 8   time=1.20  loss=197827.17 active=110838 feature_norm=8.75
Iter 9   time=1.66  loss=176877.92 active=105650 feature_norm

# <font color=46ab18>**1. Run the following sentence through your tagger: “Fourteen days ago, Emperor Palpatine left San Diego, CA for Tatooine to follow Luke Skywalker.” Report on the tags applied to the sentence.**

In [154]:
text="Fourteen days ago, Emperor Palpatine left San Diego, CA for Tatooine to follow Luke Skywalker."
#building words only DataFrame for #3 comparison
text2="Fourteen days ago Emperor Palpatine left San Diego CA for Tatooine to follow Luke Skywalker"
words=text2.split()
word_df= pd.DataFrame(words,columns=['word'])

In [155]:
# Retrieves text POS
text_tokens = nltk.word_tokenize(text)
text_pos = nltk.pos_tag(text_tokens)

# Retrieves features
features = [sent2features(text_pos)]

# Generates labels
labels = crf.predict(features)
text_labels = labels[0]

# Formats report
text_ner_df = pd.DataFrame([[token, tag] for token, tag in zip(text_tokens, text_labels)], columns=['word', 'NER'])
text_ner_df

Unnamed: 0,word,NER
0,Fourteen,B-per
1,days,O
2,ago,O
3,",",O
4,Emperor,B-per
5,Palpatine,I-per
6,left,O
7,San,B-geo
8,Diego,I-geo
9,",",O


In [156]:
word_ner_merge_df=pd.merge(word_df, text_ner_df, on='word', how='left') #joining dataframes on words

# <font color=46ab18>**2.Run the same sentence through spaCy’s NER engine.**

In [157]:
nlp = spacy.load('en_core_web_sm') #loading the english model
text_nlp = nlp(text) #creates a doc object with provided text using the english model
displacy.render(text_nlp, style='ent', jupyter=True) #rendering the entity visual

In [158]:
spacy_ner = pd.DataFrame([(word.text, word.ent_type_,word.ent_iob_) for word in text_nlp], columns=['word', 'spaCy', 'spaCy_iob'])# creating a DataFrame for sapCy NER output
comparison_df=pd.merge(word_ner_merge_df, spacy_ner, on='word', how='left')#joining for comparison

# <font color=46ab18>**3.Compare and contrast the results – you can do this in your Jupyter Notebook or as a comment in your .py file.**

In [159]:
# I- prefix__ before a tag indicates that the tag is inside a chunk.
# B- prefix__ before a tag indicates that the tag is the beginning of a chunk.
# O-  tag__ indicates that a token belongs to no chunk (outside).

# The tags in this dataset are explained as follows:

# geo__ = Geographical Entity
# org__ = Organization
# per__ = Person
# gpe__ = Geopolitical Entity
# tim__ = Time indicator
# art__ = Artifact
# eve__ = Event
# nat__ = Natural Phenomenon
comparison_df

Unnamed: 0,word,NER,spaCy,spaCy_iob
0,Fourteen,B-per,DATE,B
1,days,O,DATE,I
2,ago,O,DATE,I
3,Emperor,B-per,,O
4,Palpatine,I-per,PERSON,B
5,left,O,,O
6,San,B-geo,GPE,B
7,Diego,I-geo,GPE,I
8,CA,B-org,PERSON,B
9,for,O,,O


## <font color=46ab18>**Fourteen days ago**
- Both named entity recongition (NER) models idetified Fourteen as the beginning of a chunk
- SpaCy identified "days" "ago" as entities inside the chunk while the text NER (tNER) did not
- SpaCy correctly identified the chunk as a date entity while the tNER incorrectly identified it as a person entity

## <font color=46ab18>**Emperor Palpatine**
- Both idetified Palpatine correctly
- spaCy did not include Emperor as part of the person entity but tNER did

## <font color=46ab18>**San Diego, CA**
- Both models identified San as the begining of the chunk with Diego inside of it
- The tNER correclty identified San Diego as geographical entities while spaCy incorrectly identified them as geopolitical
- Both models struggled with CA with tNER identifying it as an organization and spaCy identifying it as a person

## <font color=46ab18>**Tatooine**
- Both models incorrectly identified the entity for Tatooine with tNER identifying it as an organization and spaCy identifying it as a person

## <font color=46ab18>**Luke Skywalker**
- Both models aced this one idetifing Luke Skywalker as a person

## <font color=46ab18>**Overall**
- I think both models performed well.  Given the small dataset, the outcomes were very similar and in my opinion to close to tell which is better.  A larger dataset may have yeilded more conclusive results.