<a href="https://colab.research.google.com/github/PawinData/TM/blob/main/TM_A2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install sklearn_crfsuite

Collecting sklearn_crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting python-crfsuite>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/95/99/869dde6dbf3e0d07a013c8eebfb0a3d30776334e0097f8432b631a9a3a19/python_crfsuite-0.9.7-cp36-cp36m-manylinux1_x86_64.whl (743kB)
[K     |████████████████████████████████| 747kB 4.0MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.7 sklearn-crfsuite-0.3.6


In [None]:

from itertools import chain
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import sklearn

# Pre-processing

Build the [reader of dataset](https://www.nltk.org/_modules/nltk/corpus/reader/conll.html) and represent every sentence as a list of tuple (word, POS, OBI).

In [2]:
import nltk
#nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus.reader.conll import ConllCorpusReader
# a .ConLL file reader
READER = ConllCorpusReader(root="./", fileids=".conll", columntypes=('words','pos','tree','chunk','ne','srl','ignore'))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [3]:
def load(filename):
    word_pos = [nltk.pos_tag(sentence) for sentence in READER.sents(filename)]
    word_obi = list(READER.tagged_sents(filename))
    return [[(a,b,d) for (a,b),(c,d) in zip(lst1, lst2)] for lst1,lst2 in zip(word_pos,word_obi)]

# training set
Train_sents = load("wnut17train.conll")
# test set
Test_sents = load("emerging.test.annotated")

In [4]:
# Development set
word_pos = [nltk.pos_tag(sentence) for sentence in READER.sents("emerging.dev.conll")[:1008]]
word_obi = list(READER.tagged_sents("emerging.dev.conll")[:1008])
Dev_sents = [[(a,b,d) for (a,b),(c,d) in zip(lst1, lst2)] for lst1,lst2 in zip(word_pos,word_obi)]

# Baseline

Extract the OBI label and the following features from each word in a sentence. Build a model of Conditional Random Field (**CRF**) on the training data and evaluate its performance on the test set. As a baseline, generate **transition features** that associate all of possible label pairs and **iterate $100$ times at most** by the **L-BFGS algorithm of Gradient Descent** with Elastic-Net regularization to fit model parameters; in specific, **L1-regularization** is controlled by $c_1 = 0.1$ and **L2-regularization** by $c_2 = 0.1$.

**Features:**
1.   **Word Identity**: lowercased form
2.   **Word Suffix**: the last two and three characters
3.   **Word Shape**: whether a word is a digit, is uppercased, or starts with an uppercase character
4.   **Part-of-Speech Tag**: noun, verb, adjective, e.t.c
5.   **BOS**: whether a word is the start of sentence
6.   **EOS**: whether a word is the end of sentence



In [5]:
from sklearn_crfsuite import CRF, metrics

In [6]:
# extract features and labels
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {'bias': 1.0,
                'word.lower()': word.lower(),        # word identity
                'word[-3:]': word[-3:],              # word suffix 
                'word[-2:]': word[-2:],
                'word.isupper()': word.isupper(),    # word shape
                'word.istitle()': word.istitle(),
                'word.isdigit()': word.isdigit(),
                'postag': postag,                    # POS tag
                'postag[:2]': postag[:2],
               }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({'-1:word.lower()': word1.lower(),
                          '-1:word.istitle()': word1.istitle(),
                          '-1:word.isupper()': word1.isupper(),
                          '-1:postag': postag1,
                          '-1:postag[:2]': postag1[:2],
                      })
    else:
        features['BOS'] = True                      # BOS

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({'+1:word.lower()': word1.lower(),
                         '+1:word.istitle()': word1.istitle(),
                         '+1:word.isupper()': word1.isupper(),
                         '+1:postag': postag1,
                         '+1:postag[:2]': postag1[:2],
                       })
    else:
        features['EOS'] = True                     # EOS

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def setup(data_sents):
    return [sent2features(s) for s in data_sents], [sent2labels(s) for s in data_sents]

In [7]:
# set up datasets
X_train,y_train = setup(Train_sents)
X_test, y_test  = setup(Test_sents)
X_dev,  y_dev   = setup(Dev_sents)

In [None]:
# training
baseline = CRF(algorithm='lbfgs', c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True)
baseline.fit(X_train, y_train)

In [9]:
# evaluate
y_pred = baseline.predict(X_test)

labels = list(baseline.classes_)
labels.remove('O')
sorted_labels = sorted(labels, key = lambda name: (name[1:], name[0]))
print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels, digits=4))

                 precision    recall  f1-score   support

  B-corporation     0.0000    0.0000    0.0000        66
  I-corporation     0.0000    0.0000    0.0000        22
B-creative-work     0.3333    0.0352    0.0637       142
I-creative-work     0.2963    0.0367    0.0653       218
        B-group     0.3000    0.0364    0.0649       165
        I-group     0.3571    0.0714    0.1190        70
     B-location     0.3846    0.2333    0.2905       150
     I-location     0.2308    0.0638    0.1000        94
       B-person     0.5514    0.1375    0.2201       429
       I-person     0.5472    0.2214    0.3152       131
      B-product     0.6000    0.0236    0.0455       127
      I-product     0.3750    0.0476    0.0845       126

      micro avg     0.4297    0.0931    0.1530      1740
      macro avg     0.3313    0.0756    0.1141      1740
   weighted avg     0.4009    0.0931    0.1422      1740



The baseline run performs rather poorly for recognizing B-tags and I-tags. Recall scores are particularly low.

# Hyperparameters Optimization

Allow $1000$ iterations and conduct a grid search for better algorithm and optimal hyperparameters of the CRF model. Candidate algorithms are **Gradient Descent with L-BFGS method** and **Stochastic Gradient Descent with L2 regularization**, and let $c_1 \sim \exp(-2t)$ and $c_2 \sim \exp(-20t)$. Use the development set for testing and find the combination of hyperparameters that performs the best on it.

In [10]:
import scipy.stats
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV, cross_val_score

In [None]:
# grid search of hyperparameters
crf = CRF(max_iterations=1000, all_possible_transitions=True)
params_space = {'algorithm':['lbfgs','l2sgd'], 'c1': scipy.stats.expon(scale=0.5), 'c2': scipy.stats.expon(scale=0.05),}

rs = RandomizedSearchCV(crf, params_space, 
                        cv = 5,
                        verbose = 1,
                        n_jobs = -1,
                        n_iter = 50,
                        scoring = make_scorer(metrics.flat_f1_score, average='weighted', labels=labels)
                       )
rs.fit(X=X_train, y=y_train, X_dev=X_dev, y_dev=y_dev)

Fitting 5 folds for each of 50 candidates, totalling 250 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed: 69.5min


In [50]:
# optimization results
print('Best Hyperparameters:', rs.best_params_)
print('Best Cross-Validation Score:', rs.best_score_)
print('Model Size: {:0.2f}M'.format(rs.best_estimator_.size_ / 10**6))

best params: {'c1': 0.0016658821336182827, 'c2': 0.00670123074384953}
best CV score: 0.39981334218537723
model size: 0.63M


In [None]:
# evaluate on test set
optmz = CRF(algorithm=, c1=, c2=, max_iterations=1000, all_possible_transitions=True)
optmz.fit(X_train, y_train)
y_pred = optmz.predict(X_test)
labels = list(optmz.classes_)
labels.remove('O')
sorted_labels = sorted(labels, key = lambda name: (name[1:], name[0]))
print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels, digits=4))

# Experiments with Features