<a href="https://colab.research.google.com/github/PawinData/TM/blob/main/TM_A2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install sklearn_crfsuite

Collecting sklearn_crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting python-crfsuite>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/95/99/869dde6dbf3e0d07a013c8eebfb0a3d30776334e0097f8432b631a9a3a19/python_crfsuite-0.9.7-cp36-cp36m-manylinux1_x86_64.whl (743kB)
[K     |████████████████████████████████| 747kB 4.0MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.7 sklearn-crfsuite-0.3.6


In [None]:

from itertools import chain
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import sklearn

# Pre-processing

Build the [reader of dataset](https://www.nltk.org/_modules/nltk/corpus/reader/conll.html) and represent every sentence as a list of tuple (word, POS, OBI).

In [2]:
import nltk
#nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus.reader.conll import ConllCorpusReader
# a .ConLL file reader
READER = ConllCorpusReader(root="./", fileids=".conll", columntypes=('words','pos','tree','chunk','ne','srl','ignore'))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
def load(filename):
    word_pos = [nltk.pos_tag(sentence) for sentence in READER.sents(filename)]
    word_obi = list(READER.tagged_sents(filename))
    return [[(a,b,d) for (a,b),(c,d) in zip(lst1, lst2)] for lst1,lst2 in zip(word_pos,word_obi)]

# training set
Train_sents = load("wnut17train.conll")
# test set
Test_sents = load("emerging.test.annotated")

In [4]:
# Development set
word_pos = [nltk.pos_tag(sentence) for sentence in READER.sents("emerging.dev.conll")[:1008]]
word_obi = list(READER.tagged_sents("emerging.dev.conll")[:1008])
Dev_sents = [[(a,b,d) for (a,b),(c,d) in zip(lst1, lst2)] for lst1,lst2 in zip(word_pos,word_obi)]

# Baseline

Extract the OBI label and the following features from each word in a sentence. Build a model of Conditional Random Field (**CRF**) on the training data and evaluate its performance on the test set. As a baseline, generate **transition features** that associate all of possible label pairs and **iterate $100$ times at most** by the **L-BFGS algorithm of Gradient Descent** with Elastic-Net regularization to fit model parameters; in specific, **L1-regularization** is controlled by $c_1 = 0.1$ and **L2-regularization** by $c_2 = 0.1$.

**Features:**
1.   **Word Identity**: lowercased form
2.   **Word Suffix**: the last two and three characters
3.   **Word Shape**: whether a word is a digit, is uppercased, or starts with an uppercase character
4.   **Part-of-Speech Tag**: noun, verb, adjective, e.t.c
5.   **BOS**: whether a word is the start of sentence
6.   **EOS**: whether a word is the end of sentence



In [5]:
from sklearn_crfsuite import CRF, metrics

In [6]:
# extract features and labels
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {'bias': 1.0,
                'word.lower()': word.lower(),        # word identity
                'word[-3:]': word[-3:],              # word suffix 
                'word[-2:]': word[-2:],
                'word.isupper()': word.isupper(),    # word shape
                'word.istitle()': word.istitle(),
                'word.isdigit()': word.isdigit(),
                'postag': postag,                    # POS tag
                'postag[:2]': postag[:2],
               }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({'-1:word.lower()': word1.lower(),
                          '-1:word.istitle()': word1.istitle(),
                          '-1:word.isupper()': word1.isupper(),
                          '-1:postag': postag1,
                          '-1:postag[:2]': postag1[:2],
                      })
    else:
        features['BOS'] = True                      # BOS

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({'+1:word.lower()': word1.lower(),
                         '+1:word.istitle()': word1.istitle(),
                         '+1:word.isupper()': word1.isupper(),
                         '+1:postag': postag1,
                         '+1:postag[:2]': postag1[:2],
                       })
    else:
        features['EOS'] = True                     # EOS

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def setup(data_sents):
    return [sent2features(s) for s in data_sents], [sent2labels(s) for s in data_sents]

In [7]:
# set up datasets
X_train,y_train = setup(Train_sents)
X_test, y_test  = setup(Test_sents)
X_dev,  y_dev   = setup(Dev_sents)

In [None]:
# training
baseline = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True)
baseline.fit(X_train, y_train)

In [13]:
# evaluate
y_pred = baseline.predict(X_test)

labels = list(baseline.classes_)
labels.remove('O')
sorted_labels = sorted(labels, key = lambda name: (name[1:], name[0]))
print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels, digits=4))

                 precision    recall  f1-score   support

  B-corporation     0.0000    0.0000    0.0000        66
  I-corporation     0.0000    0.0000    0.0000        22
B-creative-work     0.3333    0.0352    0.0637       142
I-creative-work     0.2963    0.0367    0.0653       218
        B-group     0.3000    0.0364    0.0649       165
        I-group     0.3571    0.0714    0.1190        70
     B-location     0.3846    0.2333    0.2905       150
     I-location     0.2308    0.0638    0.1000        94
       B-person     0.5514    0.1375    0.2201       429
       I-person     0.5472    0.2214    0.3152       131
      B-product     0.6000    0.0236    0.0455       127
      I-product     0.3750    0.0476    0.0845       126

      micro avg     0.4297    0.0931    0.1530      1740
      macro avg     0.3313    0.0756    0.1141      1740
   weighted avg     0.4009    0.0931    0.1422      1740

14.485710857999948


The baseline run performs rather poorly for recognizing B-tags and I-tags. Recall scores are particularly low.

# Hyperparameters Optimization

Execute a grid search for better algorithm and optimal hyperparameters of the CRF model. Candidate algorithms are **Gradient Descent with L-BFGS method** and **Stochastic Gradient Descent with L2 regularization**, and let $c_1, c_2 \in [0.01, 0.5]$. Use the development set for testing and find the combination of hyperparameters that gives the highest F1-score on it.

In [10]:
import numpy as np
import pandas as pd
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV, cross_val_score

In [None]:
# grid search of hyperparameters (c1,c2)

LB = dict()
a,b,k = 0.01, 0.8, 15
for c1 in np.linspace(a,b,k):
    lst = list()
    for c2 in np.linspace(a,b,k):
        crf = CRF(algorithm='lbfgs', c1=c1, c2=c2, max_iterations=100, all_possible_transitions=True)
        crf.fit(X_train, y_train)
        lst.append(metrics.flat_f1_score(y_dev, crf.predict(X_dev), average='weighted', labels=labels))
    LB["c1 = "+str(round(c1,2))] = lst
LB = pd.DataFrame(LB, index=["c2 = "+str(round(ele,2)) for ele in np.linspace(a,b,k)])

L2 = dict()
for c2 in np.linspace(0.01, 1, 25):
    crf = CRF(algorithm='l2sgd', c2=c2, max_iterations=100, all_possible_transitions=True)
    crf.fit(X_train, y_train)
    L2["c2 = "+str(round(c2,2))] = metrics.flat_f1_score(y_dev, crf.predict(X_dev), average='weighted', labels=labels)
L2 = pd.DataFrame(L2, index=["F1-Score"])

In [23]:
LB

Unnamed: 0,c1 = 0.01,c1 = 0.07,c1 = 0.12,c1 = 0.18,c1 = 0.24,c1 = 0.29,c1 = 0.35,c1 = 0.4,c1 = 0.46,c1 = 0.52,c1 = 0.57,c1 = 0.63,c1 = 0.69,c1 = 0.74,c1 = 0.8
c2 = 0.01,0.174173,0.166999,0.178355,0.180863,0.16794,0.172363,0.167137,0.176358,0.174541,0.179982,0.176777,0.175989,0.176176,0.171714,0.177268
c2 = 0.07,0.170796,0.172471,0.172723,0.171451,0.176263,0.170435,0.176292,0.174425,0.179678,0.177776,0.182451,0.183068,0.1744,0.172945,0.171772
c2 = 0.12,0.162685,0.166598,0.168769,0.170756,0.171139,0.163661,0.174488,0.173265,0.174159,0.173072,0.171218,0.171711,0.170358,0.175267,0.170462
c2 = 0.18,0.166577,0.168122,0.167928,0.170587,0.170514,0.164991,0.157729,0.160636,0.167353,0.160685,0.179673,0.163391,0.173452,0.171811,0.161528
c2 = 0.24,0.161486,0.164725,0.168842,0.168689,0.154363,0.166381,0.157046,0.162348,0.16336,0.15741,0.160676,0.149903,0.157035,0.159091,0.154336
c2 = 0.29,0.158153,0.158154,0.166712,0.163621,0.160555,0.162432,0.160178,0.159947,0.151647,0.150187,0.16187,0.161583,0.163111,0.156691,0.160024
c2 = 0.35,0.150127,0.155762,0.157733,0.154167,0.16095,0.156625,0.1544,0.150427,0.151641,0.154757,0.152533,0.154265,0.157651,0.149658,0.157342
c2 = 0.4,0.154145,0.15901,0.155008,0.155176,0.161758,0.154904,0.155162,0.15115,0.150884,0.149696,0.157346,0.154006,0.15209,0.146283,0.154228
c2 = 0.46,0.149506,0.161904,0.153625,0.154701,0.155259,0.157213,0.153238,0.151508,0.149813,0.145481,0.148806,0.147874,0.150861,0.154616,0.148866
c2 = 0.52,0.145169,0.157734,0.155533,0.156424,0.153177,0.153577,0.149349,0.156995,0.14859,0.154563,0.147727,0.146164,0.153977,0.152677,0.153235


In [25]:
L2 = pd.DataFrame(L2, index=["F1-Score"])
L2

Unnamed: 0,c2 = 0.01,c2 = 0.05,c2 = 0.09,c2 = 0.13,c2 = 0.18,c2 = 0.22,c2 = 0.26,c2 = 0.3,c2 = 0.34,c2 = 0.38,c2 = 0.42,c2 = 0.46,c2 = 0.5,c2 = 0.55,c2 = 0.59,c2 = 0.63,c2 = 0.67,c2 = 0.71,c2 = 0.75,c2 = 0.79,c2 = 0.84,c2 = 0.88,c2 = 0.92,c2 = 0.96,c2 = 1.0
F1-Score,0.187054,0.18208,0.188952,0.185973,0.164164,0.154185,0.153367,0.173979,0.181454,0.17179,0.163563,0.133909,0.195393,0.149855,0.143019,0.1253,0.145295,0.125317,0.131505,0.14596,0.188742,0.152017,0.12958,0.1336,0.152995


In [31]:
# explore whether to generate transition features
from random import seed

seed(59)

print("When using Gradient Descent with L-BFGS method,")
print("c1 = 0.63, c2 = 0.07")
crf = CRF(algorithm='lbfgs', c1=0.63, c2=0.07, max_iterations=1000, all_possible_transitions=True)
crf.fit(X_train, y_train)
print("With Transition Features: F1-Score =",metrics.flat_f1_score(y_dev, crf.predict(X_dev), average='weighted', labels=labels))
crf = CRF(algorithm='lbfgs', c1=0.63, c2=0.07, max_iterations=1000, all_possible_transitions=False)
crf.fit(X_train, y_train)
print("Without Transition Features: F1-Score =",metrics.flat_f1_score(y_dev, crf.predict(X_dev), average='weighted', labels=labels))

print("")
print("When using Stochastic Gradent Descent with L2-regularization,")
print("c2 = 0.5")
crf = CRF(algorithm='l2sgd', c2=0.5, max_iterations=1000, all_possible_transitions=True)
crf.fit(X_train, y_train)
print("With Transition Features: F1-Score =",metrics.flat_f1_score(y_dev, crf.predict(X_dev), average='weighted', labels=labels))
crf = CRF(algorithm='l2sgd', c2=0.5, max_iterations=1000, all_possible_transitions=False)
crf.fit(X_train, y_train)
print("Without Transition Features: F1-Score =",metrics.flat_f1_score(y_dev, crf.predict(X_dev), average='weighted', labels=labels))
print("c2 = 0.09")
crf = CRF(algorithm='l2sgd', c2=0.09, max_iterations=1000, all_possible_transitions=True)
crf.fit(X_train, y_train)
print("With Transition Features: F1-Score =",metrics.flat_f1_score(y_dev, crf.predict(X_dev), average='weighted', labels=labels))
crf = CRF(algorithm='l2sgd', c2=0.09, max_iterations=1000, all_possible_transitions=False)
crf.fit(X_train, y_train)
print("Without Transition Features: F1-Score =",metrics.flat_f1_score(y_dev, crf.predict(X_dev), average='weighted', labels=labels))

When using Gradient Descent with L-BFGS method,
c1 = 0.63, c2 = 0.07
With Transition Features: F1-Score = 0.18338749199487608
Without Transition Features: F1-Score = 0.18240127626053698

When using Stochastic Gradent Descent with L2-regularization,
c2 = 0.5
With Transition Features: F1-Score = 0.13817990173390696
Without Transition Features: F1-Score = 0.13362640398165923
c2 = 0.09
With Transition Features: F1-Score = 0.15803183095715065
Without Transition Features: F1-Score = 0.16508960364465672


In [33]:
# evaluate on test set
opt = CRF(algorithm='lbfgs', c1=0.63, c2=0.07, max_iterations=1000, all_possible_transitions=True)
opt.fit(X_train, y_train)
y_pred = opt.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels, digits=4))

                 precision    recall  f1-score   support

  B-corporation     0.0000    0.0000    0.0000        66
  I-corporation     0.0000    0.0000    0.0000        22
B-creative-work     0.3529    0.0423    0.0755       142
I-creative-work     0.3226    0.0459    0.0803       218
        B-group     0.2500    0.0364    0.0635       165
        I-group     0.2778    0.0714    0.1136        70
     B-location     0.3302    0.2333    0.2734       150
     I-location     0.2759    0.0851    0.1301        94
       B-person     0.5429    0.1329    0.2135       429
       I-person     0.5200    0.1985    0.2873       131
      B-product     0.1667    0.0079    0.0150       127
      I-product     0.1290    0.0317    0.0510       126

      micro avg     0.3753    0.0908    0.1462      1740
      macro avg     0.2640    0.0738    0.1086      1740
   weighted avg     0.3420    0.0908    0.1365      1740



# Features Pruning