# Assignment 3 Part 2: IE

## Overview

In this assignment, the task is to code a Named Entity Recognizer (NER) application in Python using the CRFsuite library.

It is recommended you complete the Named_Entity_Extraction_Tutorial.ipynb tutorial before attemping this.

Your tasks for this assignment are to:
1. Build a NER classifier following the tutorial.
2. Improve the performance of your NER classifier.
3. Answer three written assignments.

* Write answers in this notebook file, and upload the file to Wattle submission site. **Please rename and submit jupyter notebook file (Assignment5.ipynb) to your_uid.ipynb (e.g. u6000001.ipynb) with your written answers therein**. Do not upload any other files to Wattle except this notebook file.

### <span style="color:blue"> Question 1 (2 points) Build a NER model <a id='Task1'></a> </span>
### Part A (1.5 marks)

* Build a NER model using the train and test data files.
* You can use the code provided in [tutorial sheet](Named_Entity_Extraction_Tutorial.ipynb) 
* Try changing the feature extraction, model hyper parameters, or other settings in order to improve your model performance.
* Marks will be awarded based on how well your model performs.


In [81]:
from __future__ import print_function
from sklearn.metrics import confusion_matrix
import io
import nltk
import scipy
import codecs
import sklearn
import pycrfsuite
import pandas as pd
from itertools import chain
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report

print('sklearn version:', sklearn.__version__)
print('Libraries succesfully loaded!')

sklearn version: 0.20.1
Libraries succesfully loaded!


In [82]:
def sent2features(sent, feature_func):
    return [feature_func(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [s[-1] for s in sent]

def sent2tokens(sent):
    return [s[0] for s in sent]

def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(y_true)
    y_pred_combined = lb.transform(y_pred)
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )
            
def word2simple_features(sent, i):
    '''
    This makes a simple baseline.  
    You can add and/or remove features to get (much?) better results.
    Experiment with it as you will need to do this for assignment.
    '''
    word = sent[i][0]
    
    features = {
        'bias': 1.0, # This feature is constant for all words.
        'word.lower()': word.lower(), # This feature is the word, ignoring case.
        'word[-2:]': word[-2:], # This feature is the last two characters of the word (i.e. the suffix).
    }
    if i == 0:
        features['BOS'] = True # Mark the beginning of sentence.
        
    if i == len(sent)-1:
        features['EOS'] = True # Mark the end of sentence.

    return features

# load data and preprocess
def extract_data(path):
    """
    Extracting data from train file or test file. 
    path - the path of the file to extract
    
    return:
        res - a list of sentences, each sentence is a
              a list of tuples. For train file, each tuple
              contains token and label. For test file, each
              tuple only contains token.
        ids - a list of ids for the corresponding token. This
              is mainly for Kaggle submission.
    """
    file = io.open(path, mode="r", encoding="utf-8")
    next(file)
    res = []
    ids = []
    sent = []
    for line in file:
        if line != '\n':
            # Each line contains the position ID, the token, and (for the training set) the label.
            parts = line.strip().split(' ')
            sent.append(tuple(parts[1:]))
            ids.append(parts[0])
        else:
            res.append(sent)
            sent = []
                
    return res, ids

In [83]:
# Load train and test data
train_data, train_ids = extract_data('train')
test_data, test_ids = extract_data('test')

# Load true labels for test data
test_labels = list(pd.read_csv('test_ground_truth').loc[:, 'label'])

print('Train and Test data loaded succesfully!')

# Feature extraction using the word2simple_features function
train_features = [sent2features(s, feature_func=word2simple_features) for s in train_data]
train_labels = [sent2labels(s) for s in train_data]
test_features = [sent2features(s, feature_func=word2simple_features) for s in test_data]

trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(train_features, train_labels):
    trainer.append(xseq, yseq)
print('Feature Extraction done!')    

# Explore the extracted features    
sent2features(train_data[0], word2simple_features)

Train and Test data loaded succesfully!
Feature Extraction done!


[{'bias': 1.0, 'word.lower()': 'también', 'word[-2:]': 'én', 'BOS': True},
 {'bias': 1.0, 'word.lower()': 'el', 'word[-2:]': 'el'},
 {'bias': 1.0, 'word.lower()': 'secretario', 'word[-2:]': 'io'},
 {'bias': 1.0, 'word.lower()': 'general', 'word[-2:]': 'al'},
 {'bias': 1.0, 'word.lower()': 'de', 'word[-2:]': 'de'},
 {'bias': 1.0, 'word.lower()': 'la', 'word[-2:]': 'la'},
 {'bias': 1.0, 'word.lower()': 'asociación', 'word[-2:]': 'ón'},
 {'bias': 1.0, 'word.lower()': 'española', 'word[-2:]': 'la'},
 {'bias': 1.0, 'word.lower()': 'de', 'word[-2:]': 'de'},
 {'bias': 1.0, 'word.lower()': 'operadores', 'word[-2:]': 'es'},
 {'bias': 1.0, 'word.lower()': 'de', 'word[-2:]': 'de'},
 {'bias': 1.0, 'word.lower()': 'productos', 'word[-2:]': 'os'},
 {'bias': 1.0, 'word.lower()': 'petrolíferos', 'word[-2:]': 'os'},
 {'bias': 1.0, 'word.lower()': ',', 'word[-2:]': ','},
 {'bias': 1.0, 'word.lower()': 'aurelio', 'word[-2:]': 'io'},
 {'bias': 1.0, 'word.lower()': 'ayala', 'word[-2:]': 'la'},
 {'bias': 1.

In [84]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

In [85]:
trainer.set_params({
    'c1': 0.3,   # coefficient for L1 penalty
    'c2': 1e-2,  # coefficient for L2 penalty
    'max_iterations': 100,  # stop earlier
    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [86]:
%%time
trainer.train('ner-esp.model')

print('Training done :)')

Training done :)
Wall time: 12.8 s


In [87]:
# Make predictions
tagger = pycrfsuite.Tagger()
tagger.open('ner-esp.model')
test_pred = [tagger.tag(xseq) for xseq in test_features]
test_pred = [s for w in test_pred for s in w]

## Print evaluation
print(bio_classification_report(test_pred, test_labels))

              precision    recall  f1-score   support

       B-LOC       0.74      0.81      0.77      1857
       I-LOC       0.54      0.73      0.62       561
      B-MISC       0.36      0.61      0.45       510
      I-MISC       0.40      0.42      0.41      1199
       B-ORG       0.71      0.84      0.77      2726
       I-ORG       0.62      0.67      0.65      2056
       B-PER       0.78      0.89      0.83      1662
       I-PER       0.85      0.87      0.86      1581

   micro avg       0.67      0.76      0.71     12152
   macro avg       0.63      0.73      0.67     12152
weighted avg       0.67      0.76      0.71     12152
 samples avg       0.08      0.08      0.08     12152



The output of the above cell should look something like this (but with different numbers)

                precision    recall  f1-score   support

      B-LOC       0.68      0.47      0.55      1084
      I-LOC       0.52      0.25      0.34       325
     B-MISC       0.54      0.11      0.19       339
     I-MISC       0.54      0.22      0.32       557
      B-ORG       0.76      0.51      0.61      1400
      I-ORG       0.67      0.44      0.53      1104
      B-PER       0.73      0.68      0.71       735
      I-PER       0.78      0.82      0.80       634

avg / total       0.68      0.48      0.55      6178



### Part B (0.5 marks)

Briefly explain what changes to your model you tried and how these changes affected the model's performance.

I increased the L1 penalty, decreased the L2 penalty and increased the number of iterations.

### <span style="color:blue"> Written Part (3 points) </span>

Answer briefly and concisely the following questions.
Check [this](https://sourceforge.net/p/jupiter/wiki/markdown_syntax/#md_ex_lists) if you are not familiar with markdown syntax.

### Question 2 (0.5 point)
Think of three relevant baselines for the Named Entity Classification task.
Provide answers using bullet list with 3 items. Give a short description of each of them.

YOUR ANSWER HERE

### Question 3 (1.5 point)
How does Maximal Marginal Relevance (MMR) address redundancy issues? (0.5 point)

How can you tell MMR that "Sydney" and "Melbourne" are cities? (0.5 points)

How can you tell MMR that "solar panels" and "photovoltaic cells" have similar meaning? (0.5 points)

a) Reference - https://www.aclweb.org/anthology/X98-1025.pdf


The Maximal Marginal Relevance (MMR) criterion strives to reduce redundancy while maintaining query relevance in reranking retrieved documents and in selecting appropriate passages for text summarization. 

b) MMR can tell that these two words or sentences in a document or corpuse are similar meaning as MMR uses cosine similarity through which we can know that this belongs to the cities group as there similarity of both being in cities group would be close to 1.

c) MMR can tell that these two words are similar meaning as MMR uses cosine similarity which helps us to know whether the word has same meaning or not.

### Question 4 (1 point)

Imagine you are developing an extractive text summarization tool using HMM.

What are the hidden states and the observations of the HMM model? (0.5 point)

Which algorithm is used to compute the probability of a particular observation sequence? (0.5 point)

a)Reference - https://en.wikipedia.org/wiki/Hidden_Markov_model

The states of the stochastic processes are called hidden states whereas the observations is the result that we get after soving the markov hidden model using the hidden layers.

b) Forward algorithm is used to compute the probability of a particular observation sequence.