#### Week 9 Milestone 9.3 Author: Rex Gayas Course & Section: DSC360-T301 Data Mining: Text Analytics an (2243-1) Date: 11 FEB 2024

##### 1. Run the following sentence through your tagger: “Fourteen days ago, Emperor Palpatine left San Diego, CA for Tatooine to follow Luke Skywalker.” Report on the tags applied to the sentence.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_classification_report

# Function to extract features from a sentence
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True

    return features

# Function to transform a sentence into CRF features
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

# Function to extract labels from a sentence
def sent2labels(sent):
    return [label for token, postag, label in sent]

# Load the dataset
df = pd.read_csv("D:\\ALPHA\\Dynamic Folder\\Bellevue\\Winter 2023\\Data Mining\\Week 9\\ner_dataset.csv", encoding="latin1")
df = df.ffill()

# Creating a list of sentences
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                    s["POS"].values.tolist(),
                                                    s["Tag"].values.tolist())]
grouped_df = df.groupby("Sentence #").apply(agg_func)
sentences = [s for s in grouped_df]

# Using a smaller subset for faster training
sentences = sentences[:1000]  # Considering only the first 1000 sentences

# Extract features and labels
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Train the CRF model
crf = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=30, all_possible_transitions=True)
crf.fit(X_train, y_train)

# Predict on a single sentence
test_sentence = [("Fourteen", "CD"), ("days", "NNS"), ("ago", "RB"), ("Emperor", "NNP"), 
                 ("Palpatine", "NNP"), ("left", "VBD"), ("San", "NNP"), ("Diego", "NNP"), 
                 ("CA", "NNP"), ("for", "IN"), ("Tatooine", "NNP"), ("to", "TO"), 
                 ("follow", "VB"), ("Luke", "NNP"), ("Skywalker", "NNP")]

# Extract features from the test sentence
test_sentence_features = sent2features(test_sentence)

# Predict NER tags for the test sentence
test_sentence_tags = crf.predict([test_sentence_features])

# Output the results
print("Predicted NER tags for the test sentence:")
print(test_sentence_tags)


Predicted NER tags for the test sentence:
[['O', 'O', 'O', 'B-per', 'I-per', 'O', 'B-geo', 'I-geo', 'O', 'O', 'B-geo', 'O', 'O', 'B-per', 'I-per']]


##### Run the same sentence through spaCy’s NER engine.

In [7]:
import spacy

# Load the pre-trained spaCy model
nlp = spacy.load("en_core_web_sm")

# The sentence to be processed
sentence = "Fourteen days ago, Emperor Palpatine left San Diego, CA for Tatooine to follow Luke Skywalker."

# Process the sentence using spaCy
doc = nlp(sentence)

# Print the named entities in the sentence
print("Named Entities, Phrase and Label")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

# Visualize the named entities in Jupyter Notebook
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)


Named Entities, Phrase and Label
Fourteen days ago 0 17 DATE
Palpatine 27 36 PERSON
San Diego 42 51 GPE
CA 53 55 WORK_OF_ART
Tatooine 60 68 PERSON
Luke Skywalker 79 93 PERSON


##### Compare and contrast the results

The CRF model and spaCy's NER engine exhibit different strengths in entity recognition. The CRF model, trained on a subset of data, identified "Emperor Palpatine" and "Luke Skywalker" as persons but did not recognize "Fourteen days ago" as a date, while spaCy correctly identified the date and persons but misclassified "CA" as a work of art. The performance of the CRF model is dependent on the quality and quantity of training data, as well as the feature set used, and in this instance, it appears to be underfitting, likely due to insufficient training iterations and data. On the other hand, spaCy's pre-trained model, which requires no additional training, offers a broader range of entity classifications and demonstrates a more general and accurate entity recognition capability.

In terms of practicality, the pre-trained spaCy model is more user-friendly and can be deployed quickly for a wide range of applications, making it highly practical for immediate use. Conversely, the CRF model provides flexibility and can potentially achieve higher accuracy for specific tasks with proper training, but this comes at the cost of increased complexity in model tuning and longer setup times. Therefore, for tasks requiring rapid deployment and general use, spaCy's out-of-the-box functionality is advantageous, whereas for specialized tasks that can benefit from customized training, a CRF model might be the better option despite potential for a more process-intensive setback. 
