##                              Team members:

> Gao Siwei A0123534N


> Li Yuejun A0213494E


> Liang Zhixi A0119487R


> Wei Yuan A0213480N


> Yan Yixuan A0119431M


## Supervised Learning for Entity and Aspect Mining

This notebook introduces Conditional Random Fields (CRF) for entity and aspect mining. Recall that we have mentioned that entity and aspect mining involves 3 main tasks:
1. Extraction of entity 
2. Extraction of aspects associated with the entity
3. Sentiment classification

In this notebook, we use CRF for the second task. 

### Conditional Random Fields
CRF is a machine learning technique that works on sequences and is very popular in natural language porcessing (NLP), e.g. in Named entity Recogition (NER), Part of speech tagging (POS) and word sense disambiguation. 

The CRF is a subset of HMF (hidden markov fields) in that it may have dependencies beyond the adjacent words.

Earlier, we had introduced several heuristic techniques for the extraction of aspects. These include using dependency parsing, looking at syntactic relations (like 'of', 'from' etc). In this notebook, we try integrate linguistic features into the ML model, e.g. POS information of words. 

In [None]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pandas as pd

!pip install python-crfsuite
import pycrfsuite

print(sklearn.__version__)

Collecting python-crfsuite
  Downloading python_crfsuite-0.9.7-cp37-cp37m-manylinux1_x86_64.whl (743 kB)
[?25l[K     |▍                               | 10 kB 22.4 MB/s eta 0:00:01[K     |▉                               | 20 kB 27.2 MB/s eta 0:00:01[K     |█▎                              | 30 kB 21.0 MB/s eta 0:00:01[K     |█▊                              | 40 kB 16.8 MB/s eta 0:00:01[K     |██▏                             | 51 kB 7.4 MB/s eta 0:00:01[K     |██▋                             | 61 kB 8.5 MB/s eta 0:00:01[K     |███                             | 71 kB 8.0 MB/s eta 0:00:01[K     |███▌                            | 81 kB 8.8 MB/s eta 0:00:01[K     |████                            | 92 kB 9.6 MB/s eta 0:00:01[K     |████▍                           | 102 kB 7.2 MB/s eta 0:00:01[K     |████▉                           | 112 kB 7.2 MB/s eta 0:00:01[K     |█████▎                          | 122 kB 7.2 MB/s eta 0:00:01[K     |█████▊                          

Our labelled data set is in IOB format with 3 columns. The first column is the actual words, the second is the POS and the 3rd column states whether it is B-A, I-A or others O. We write a simple code to convert it into a form for the pycrfsuite library. This is the most accessible library to run CRFs. 

The function word2features extracts out features in the sentence - in this case just POS of the individual tokens. The function is adapted from https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

In [None]:
from google.colab import drive
import os
drive.mount('/content/gdrive')
os.chdir('/content/gdrive/My Drive/Colab Notebooks/4/')

Mounted at /content/gdrive


In [None]:
%%time

def createCRFSet(fname):
    train_sents = []
    tt_sents = []
    t_sents = []
    fp = open(fname,  encoding="utf-8")
   
   #get tuples
    for line in fp.readlines():
        line = tuple(line.split())
        t_sents.append(line)
    
    #put tuples into each sentence
    for t in t_sents:
        if len(t)!=0: 
            tt_sents.append(t)
        else:
            train_sents.append(tt_sents)
            tt_sents=[]
    
    return train_sents

train_sents = createCRFSet("./Restaurants_Train.iob")
test_sents = createCRFSet("./Restaurants_Test.iob")
#test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

CPU times: user 57.1 ms, sys: 12.3 ms, total: 69.4 ms
Wall time: 1.89 s


In [None]:
train_sents[0]

[('But', 'CC', 'O'),
 ('the', 'DT', 'O'),
 ('staff', 'NN', 'B-A'),
 ('was', 'VBD', 'O'),
 ('so', 'RB', 'O'),
 ('horrible', 'JJ', 'O'),
 ('to', 'TO', 'O'),
 ('us', 'PRP', 'O'),
 ('.', '.', 'O')]

In [None]:
import string
print("Hello".isupper())
print("Hello".istitle())
print("S23".isdigit())
print("Hello" in string.punctuation)
print(len([x for x in "Hello"[1:] if x.isupper()])>0)

False
True
False
False
False


In [None]:

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [  # for all words
        'bias',
        'word.lower()=' + word.lower(), 
        'word[:4]='+ word[:4], # first 4, prefix
        'word[:3]='+ word[:3],
        'word[:2]='+ word[:2],
        'word[-4:]='+ word[-4:], # last 4, suffix
        'word[-3:]='+ word[-3:], 
        'word[-2:]='+ word[-2:],
        'word.isupper()='+ str(word.isupper()), 
        'word.islower()='+ str(word.islower()), 
        'word.istitle()='+ str(word.istitle()), 
        'word.isdigit()='+ str(word.isdigit()),
        'word.ispunctuation='+ str((word in string.punctuation)),
        'word.length='+ str(len(word)),
        'wordmixedcap='+  str(len([x for x in word[1:] if x.isupper()])>0),
        'postag=' + postag,
        'postag[:2]='+ postag[:2],  # first 2 char of the pos
        'distfromsentbegin=' + str(i)
    ]
    if i > 0: # if not BOS, check previous word
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower()='+ word1.lower(),
            '-1:word[:4]='+ word1[:4], # first 4, prefix
            '-1:word[:3]='+ word1[:3],
            '-1:word[:2]='+ word1[:2],
            '-1:word[-4:]='+ word1[-4:], # last 4, suffix
            '-1:word[-3:]='+ word1[-3:], 
            '-1:word[-2:]='+ word1[-2:],
            '-1:word.istitle()='+ str(word1.istitle()),
            '-1:word.isupper()='+ str(word1.isupper()),
            '-1:word.islower()='+ str(word1.islower()),
            '-1:word.isdigit()='+ str(word1.isdigit()),
            '-1:word.ispunctuation='+ str((word1 in string.punctuation)),
            '-1:word.length='+ str(len(word1)),
            '-1:wordmixedcap='+  str(len([x for x in word1[1:] if x.isupper()])>0),
            '-1:postag=' + postag1,
            '-1:postag[:2]='+postag1[:2],
            '-1:distfromsentbegin=' + str(i)
        ])
    else:
        features.append('BOS')  # beginning of statement
   
        
    if i < len(sent)-1:  # if not EOS, check next word
        word2 = sent[i+1][0]
        postag2 = sent[i+1][1]
        features.extend([
            '+1:word.lower()='+ word2.lower(),
            '+1:word[:4]='+ word2[:4], # first 4, prefix
            '+1:word[:3]='+ word2[:3],
            '+1:word[:2]='+ word2[:2],
            '+1:word[-4:]='+ word2[-4:], # last 4, suffix
            '+1:word[-3:]='+ word2[-3:], 
            '+1:word[-2:]='+ word2[-2:],
            '+1:word.istitle()='+ str(word2.istitle()),
            '+1:word.isupper()='+ str(word2.isupper()),
            '+1:word.islower()='+ str(word2.islower()),
            '+1:word.isdigit()='+ str(word2.isdigit()),
            '+1:word.ispunctuation='+ str((word2 in string.punctuation)),
            '+1:word.length='+ str(len(word2)),
            '+1:wordmixedcap='+  str(len([x for x in word2[1:] if x.isupper()])>0),
            '+1:postag=' + postag2,
            '+1:postag[:2]='+ postag2[:2],
            '+1:distfromsentbegin=' + str(i)
        ])
    else:
        features.append('EOS')
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

Note the features for one of the sentence - 'To be completely fair, the only redeeming factor was the food which was above average, but couldn't make up for all the other deficiencies of Teodora'. The POS tags (before and after are used as features).

In [None]:
# data before feature extraction, changed to dataframe for easy printing.
df_1 = pd.DataFrame(train_sents[1], columns=["Word","POS","Entity or Aspect Tag"])

df_1


Unnamed: 0,Word,POS,Entity or Aspect Tag
0,To,TO,O
1,be,VB,O
2,completely,RB,O
3,fair,JJ,O
4,",",",",O
5,the,DT,O
6,only,JJ,O
7,redeeming,NN,O
8,factor,NN,O
9,was,VBD,O


In [None]:
# To observe how the training set looks like after feature extraction
# df_2 = pd.DataFrame(sent2features(train_sents[1]), columns=["Bias constant","POS", "POS Before","POS after" ])
df_2 = pd.DataFrame(sent2features(train_sents[1]))
df_2

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51
0,bias,word.lower()=to,word[:4]=To,word[:3]=To,word[:2]=To,word[-4:]=To,word[-3:]=To,word[-2:]=To,word.isupper()=False,word.islower()=False,word.istitle()=True,word.isdigit()=False,word.ispunctuation=False,word.length=2,wordmixedcap=False,postag=TO,postag[:2]=TO,distfromsentbegin=0,BOS,+1:word.lower()=be,+1:word[:4]=be,+1:word[:3]=be,+1:word[:2]=be,+1:word[-4:]=be,+1:word[-3:]=be,+1:word[-2:]=be,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=2,+1:wordmixedcap=False,+1:postag=VB,+1:postag[:2]=VB,+1:distfromsentbegin=0,,,,,,,,,,,,,,,,
1,bias,word.lower()=be,word[:4]=be,word[:3]=be,word[:2]=be,word[-4:]=be,word[-3:]=be,word[-2:]=be,word.isupper()=False,word.islower()=True,word.istitle()=False,word.isdigit()=False,word.ispunctuation=False,word.length=2,wordmixedcap=False,postag=VB,postag[:2]=VB,distfromsentbegin=1,-1:word.lower()=to,-1:word[:4]=To,-1:word[:3]=To,-1:word[:2]=To,-1:word[-4:]=To,-1:word[-3:]=To,-1:word[-2:]=To,-1:word.istitle()=True,-1:word.isupper()=False,-1:word.islower()=False,-1:word.isdigit()=False,-1:word.ispunctuation=False,-1:word.length=2,-1:wordmixedcap=False,-1:postag=TO,-1:postag[:2]=TO,-1:distfromsentbegin=1,+1:word.lower()=completely,+1:word[:4]=comp,+1:word[:3]=com,+1:word[:2]=co,+1:word[-4:]=tely,+1:word[-3:]=ely,+1:word[-2:]=ly,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=10,+1:wordmixedcap=False,+1:postag=RB,+1:postag[:2]=RB,+1:distfromsentbegin=1
2,bias,word.lower()=completely,word[:4]=comp,word[:3]=com,word[:2]=co,word[-4:]=tely,word[-3:]=ely,word[-2:]=ly,word.isupper()=False,word.islower()=True,word.istitle()=False,word.isdigit()=False,word.ispunctuation=False,word.length=10,wordmixedcap=False,postag=RB,postag[:2]=RB,distfromsentbegin=2,-1:word.lower()=be,-1:word[:4]=be,-1:word[:3]=be,-1:word[:2]=be,-1:word[-4:]=be,-1:word[-3:]=be,-1:word[-2:]=be,-1:word.istitle()=False,-1:word.isupper()=False,-1:word.islower()=True,-1:word.isdigit()=False,-1:word.ispunctuation=False,-1:word.length=2,-1:wordmixedcap=False,-1:postag=VB,-1:postag[:2]=VB,-1:distfromsentbegin=2,+1:word.lower()=fair,+1:word[:4]=fair,+1:word[:3]=fai,+1:word[:2]=fa,+1:word[-4:]=fair,+1:word[-3:]=air,+1:word[-2:]=ir,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=4,+1:wordmixedcap=False,+1:postag=JJ,+1:postag[:2]=JJ,+1:distfromsentbegin=2
3,bias,word.lower()=fair,word[:4]=fair,word[:3]=fai,word[:2]=fa,word[-4:]=fair,word[-3:]=air,word[-2:]=ir,word.isupper()=False,word.islower()=True,word.istitle()=False,word.isdigit()=False,word.ispunctuation=False,word.length=4,wordmixedcap=False,postag=JJ,postag[:2]=JJ,distfromsentbegin=3,-1:word.lower()=completely,-1:word[:4]=comp,-1:word[:3]=com,-1:word[:2]=co,-1:word[-4:]=tely,-1:word[-3:]=ely,-1:word[-2:]=ly,-1:word.istitle()=False,-1:word.isupper()=False,-1:word.islower()=True,-1:word.isdigit()=False,-1:word.ispunctuation=False,-1:word.length=10,-1:wordmixedcap=False,-1:postag=RB,-1:postag[:2]=RB,-1:distfromsentbegin=3,"+1:word.lower()=,","+1:word[:4]=,","+1:word[:3]=,","+1:word[:2]=,","+1:word[-4:]=,","+1:word[-3:]=,","+1:word[-2:]=,",+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=False,+1:word.isdigit()=False,+1:word.ispunctuation=True,+1:word.length=1,+1:wordmixedcap=False,"+1:postag=,","+1:postag[:2]=,",+1:distfromsentbegin=3
4,bias,"word.lower()=,","word[:4]=,","word[:3]=,","word[:2]=,","word[-4:]=,","word[-3:]=,","word[-2:]=,",word.isupper()=False,word.islower()=False,word.istitle()=False,word.isdigit()=False,word.ispunctuation=True,word.length=1,wordmixedcap=False,"postag=,","postag[:2]=,",distfromsentbegin=4,-1:word.lower()=fair,-1:word[:4]=fair,-1:word[:3]=fai,-1:word[:2]=fa,-1:word[-4:]=fair,-1:word[-3:]=air,-1:word[-2:]=ir,-1:word.istitle()=False,-1:word.isupper()=False,-1:word.islower()=True,-1:word.isdigit()=False,-1:word.ispunctuation=False,-1:word.length=4,-1:wordmixedcap=False,-1:postag=JJ,-1:postag[:2]=JJ,-1:distfromsentbegin=4,+1:word.lower()=the,+1:word[:4]=the,+1:word[:3]=the,+1:word[:2]=th,+1:word[-4:]=the,+1:word[-3:]=the,+1:word[-2:]=he,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=3,+1:wordmixedcap=False,+1:postag=DT,+1:postag[:2]=DT,+1:distfromsentbegin=4
5,bias,word.lower()=the,word[:4]=the,word[:3]=the,word[:2]=th,word[-4:]=the,word[-3:]=the,word[-2:]=he,word.isupper()=False,word.islower()=True,word.istitle()=False,word.isdigit()=False,word.ispunctuation=False,word.length=3,wordmixedcap=False,postag=DT,postag[:2]=DT,distfromsentbegin=5,"-1:word.lower()=,","-1:word[:4]=,","-1:word[:3]=,","-1:word[:2]=,","-1:word[-4:]=,","-1:word[-3:]=,","-1:word[-2:]=,",-1:word.istitle()=False,-1:word.isupper()=False,-1:word.islower()=False,-1:word.isdigit()=False,-1:word.ispunctuation=True,-1:word.length=1,-1:wordmixedcap=False,"-1:postag=,","-1:postag[:2]=,",-1:distfromsentbegin=5,+1:word.lower()=only,+1:word[:4]=only,+1:word[:3]=onl,+1:word[:2]=on,+1:word[-4:]=only,+1:word[-3:]=nly,+1:word[-2:]=ly,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=4,+1:wordmixedcap=False,+1:postag=JJ,+1:postag[:2]=JJ,+1:distfromsentbegin=5
6,bias,word.lower()=only,word[:4]=only,word[:3]=onl,word[:2]=on,word[-4:]=only,word[-3:]=nly,word[-2:]=ly,word.isupper()=False,word.islower()=True,word.istitle()=False,word.isdigit()=False,word.ispunctuation=False,word.length=4,wordmixedcap=False,postag=JJ,postag[:2]=JJ,distfromsentbegin=6,-1:word.lower()=the,-1:word[:4]=the,-1:word[:3]=the,-1:word[:2]=th,-1:word[-4:]=the,-1:word[-3:]=the,-1:word[-2:]=he,-1:word.istitle()=False,-1:word.isupper()=False,-1:word.islower()=True,-1:word.isdigit()=False,-1:word.ispunctuation=False,-1:word.length=3,-1:wordmixedcap=False,-1:postag=DT,-1:postag[:2]=DT,-1:distfromsentbegin=6,+1:word.lower()=redeeming,+1:word[:4]=rede,+1:word[:3]=red,+1:word[:2]=re,+1:word[-4:]=ming,+1:word[-3:]=ing,+1:word[-2:]=ng,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=9,+1:wordmixedcap=False,+1:postag=NN,+1:postag[:2]=NN,+1:distfromsentbegin=6
7,bias,word.lower()=redeeming,word[:4]=rede,word[:3]=red,word[:2]=re,word[-4:]=ming,word[-3:]=ing,word[-2:]=ng,word.isupper()=False,word.islower()=True,word.istitle()=False,word.isdigit()=False,word.ispunctuation=False,word.length=9,wordmixedcap=False,postag=NN,postag[:2]=NN,distfromsentbegin=7,-1:word.lower()=only,-1:word[:4]=only,-1:word[:3]=onl,-1:word[:2]=on,-1:word[-4:]=only,-1:word[-3:]=nly,-1:word[-2:]=ly,-1:word.istitle()=False,-1:word.isupper()=False,-1:word.islower()=True,-1:word.isdigit()=False,-1:word.ispunctuation=False,-1:word.length=4,-1:wordmixedcap=False,-1:postag=JJ,-1:postag[:2]=JJ,-1:distfromsentbegin=7,+1:word.lower()=factor,+1:word[:4]=fact,+1:word[:3]=fac,+1:word[:2]=fa,+1:word[-4:]=ctor,+1:word[-3:]=tor,+1:word[-2:]=or,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=6,+1:wordmixedcap=False,+1:postag=NN,+1:postag[:2]=NN,+1:distfromsentbegin=7
8,bias,word.lower()=factor,word[:4]=fact,word[:3]=fac,word[:2]=fa,word[-4:]=ctor,word[-3:]=tor,word[-2:]=or,word.isupper()=False,word.islower()=True,word.istitle()=False,word.isdigit()=False,word.ispunctuation=False,word.length=6,wordmixedcap=False,postag=NN,postag[:2]=NN,distfromsentbegin=8,-1:word.lower()=redeeming,-1:word[:4]=rede,-1:word[:3]=red,-1:word[:2]=re,-1:word[-4:]=ming,-1:word[-3:]=ing,-1:word[-2:]=ng,-1:word.istitle()=False,-1:word.isupper()=False,-1:word.islower()=True,-1:word.isdigit()=False,-1:word.ispunctuation=False,-1:word.length=9,-1:wordmixedcap=False,-1:postag=NN,-1:postag[:2]=NN,-1:distfromsentbegin=8,+1:word.lower()=was,+1:word[:4]=was,+1:word[:3]=was,+1:word[:2]=wa,+1:word[-4:]=was,+1:word[-3:]=was,+1:word[-2:]=as,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=3,+1:wordmixedcap=False,+1:postag=VBD,+1:postag[:2]=VB,+1:distfromsentbegin=8
9,bias,word.lower()=was,word[:4]=was,word[:3]=was,word[:2]=wa,word[-4:]=was,word[-3:]=was,word[-2:]=as,word.isupper()=False,word.islower()=True,word.istitle()=False,word.isdigit()=False,word.ispunctuation=False,word.length=3,wordmixedcap=False,postag=VBD,postag[:2]=VB,distfromsentbegin=9,-1:word.lower()=factor,-1:word[:4]=fact,-1:word[:3]=fac,-1:word[:2]=fa,-1:word[-4:]=ctor,-1:word[-3:]=tor,-1:word[-2:]=or,-1:word.istitle()=False,-1:word.isupper()=False,-1:word.islower()=True,-1:word.isdigit()=False,-1:word.ispunctuation=False,-1:word.length=6,-1:wordmixedcap=False,-1:postag=NN,-1:postag[:2]=NN,-1:distfromsentbegin=9,+1:word.lower()=the,+1:word[:4]=the,+1:word[:3]=the,+1:word[:2]=th,+1:word[-4:]=the,+1:word[-3:]=the,+1:word[-2:]=he,+1:word.istitle()=False,+1:word.isupper()=False,+1:word.islower()=True,+1:word.isdigit()=False,+1:word.ispunctuation=False,+1:word.length=3,+1:wordmixedcap=False,+1:postag=DT,+1:postag[:2]=DT,+1:distfromsentbegin=9


In [None]:
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

CPU times: user 1.04 s, sys: 157 ms, total: 1.2 s
Wall time: 1.2 s


In [None]:
%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 1.4 s, sys: 6.43 ms, total: 1.4 s
Wall time: 1.41 s


In [None]:

trainer.set_params({
    'c1': 0.5,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [None]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

In [None]:
%%time
# Train the model and save the trained CRF model. 
trainer.train('CRF_ABSA.crfsuite')

CPU times: user 4.16 s, sys: 18 ms, total: 4.17 s
Wall time: 4.19 s


In [None]:
trainer.logparser.last_iteration

{'active_features': 11253,
 'error_norm': 469.769573,
 'feature_norm': 50.461573,
 'linesearch_step': 0.0625,
 'linesearch_trials': 5,
 'loss': 2783.04016,
 'num': 50,
 'scores': {},
 'time': 0.303}

In [None]:
# to use the trained model
tagger = pycrfsuite.Tagger()
tagger.open('CRF_ABSA.crfsuite')

<contextlib.closing at 0x7ff563498090>

In [None]:
#Let's try it on one test sentence
example_sent = test_sents[6]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

Straight-forward , no surprises , very decent Japanese food .

Predicted: O O O O O O O B-A I-A O
Correct:   O O O O O O O B-A I-A O


In [None]:
def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

In [None]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]


CPU times: user 267 ms, sys: 0 ns, total: 267 ms
Wall time: 271 ms


In [None]:
print(bio_classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         B-A       0.87      0.79      0.83      1135
         I-A       0.85      0.61      0.71       538

   micro avg       0.87      0.74      0.80      1673
   macro avg       0.86      0.70      0.77      1673
weighted avg       0.87      0.74      0.79      1673
 samples avg       0.10      0.10      0.10      1673



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


We can check the probabilities of transition of the hidden states - some of which are more probable than others. The following example shows that B-A -> I-A is very likely (like in iPhone (B-A) size (I-A)). 

In [None]:

from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(8))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-8:])

Top likely transitions:
O      -> O       1.551267
I-A    -> I-A     1.184515
B-A    -> I-A     1.079228
O      -> B-A     0.877105
B-A    -> NN      -0.001173
B-A    -> O       -0.176186
O      -> NN      -0.297322
NN     -> I-A     -0.411481

Top unlikely transitions:
B-A    -> O       -0.176186
O      -> NN      -0.297322
NN     -> I-A     -0.411481
I-A    -> O       -0.786301
NN     -> O       -1.227021
I-A    -> B-A     -5.098276
B-A    -> B-A     -5.776205
O      -> I-A     -9.092515


We can also check which feature is the most (or least) corelated to tag entities or aspects. The top positive features for "B-A" are postag=NN or postag=NNS - that is if the word is a noun. 

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(10))

print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-10:])

Top positive:
6.905116 B-A    word.lower()=priced
4.892806 B-A    word.lower()=bill
3.945973 B-A    word.lower()=entertaining
3.707035 O      EOS
3.480503 B-A    -1:word.lower()=conclusion
3.445495 B-A    word[:4]=port
3.391120 B-A    word.lower()=vegetable
3.360985 B-A    word.lower()=prices
3.334948 O      word[:2]=un
3.196075 O      word[:4]=rest

Top negative:
-2.489338 O      word[:3]=bee
-2.547552 O      word[-3:]=mon
-2.587079 O      word[-4:]=shes
-2.640685 O      word[-2:]=ta
-2.643922 O      word[:3]=sna
-2.734180 O      word[:4]=sand
-2.776047 O      word[-2:]=il
-2.879191 O      word[:4]=deco
-2.921460 O      word[-2:]=ef
-3.885902 O      word[-3:]=nic


Obviously, the result using just POS as features is not good. Now try training a better model by adding other features, like word info, pre/post word, case information, etc.

***1. Report your best set of features and how it affects precision / recall. ***

Features consist of the below features for the word itself, the word before and after 

- POS tag
- word itself
- first 4 characters of word
- first 3 characters of word
- first 2 characters of word
- last 4 characters of word
- last 3 characters of word
- last 2 characters of word
- whether word is all upper case
- whether word is all lower case
- whether word is starting with upper case
- whether word is digit
- first 2 charact of the POS tag
- whether word is punctuation
- length of word
- whether word has mix of uppper case and lower case
- distance of word from the first word

**Before**

              precision    recall  f1-score   support

         B-A       0.62      0.36      0.46      1135
         I-A       0.55      0.23      0.32       538

**After**

              precision    recall  f1-score   support

         B-A       0.87      0.79      0.83      1135
         I-A       0.85      0.61      0.71       538


*** 2. Does using more feature lead to better recall/ precision?  ***


1.   To some extent yes, adding more features on the surrounding words (case info, postag info) does inprove the performance. .
2.   But there seems to be an overall cap of performance when using this approach
3.   It's not always that more features will lead to better results. Some features do not improve the result e.g. set of features for 2 words before / after the target word.



*** 3. What are the most informative features that lead to the best model? Does it tell you anything intuitively?  ***

**Top positive:**

1. word.lower()=priced
2. word.lower()=atmosphere
3. word.lower()=waiting
4. BOS
5. word.lower()=reservation

**Top negative:**

1. word.lower()=serving
2. word[-3:]=lad
3. word.lower()=drink
4. word[-2:]=sa
5. word.lower()=sandwich
 
This is quite informative to understand the good and the bad sides from the reviews. We can infer from the top features that the restaurant sells cheap food with good environment with short waiting time, but the food and drink and service provided is not satisfying.

*** 4. Identify the challenge and difficulties that you face in doing this assignment. Also suggest how the model can be improved***   
We could try different algorithm for example Stochastic Gradient Descent instead of the default LBFGS

We could try different hyperparameter set using some grid search approach, the hyper parameters include (but not limited to)
1. L1 & L2 regularization coefficient
2. max iteration (a large iteration does not necessarily produce a better model, for example 200 iteration seemed to overfit the model)
3. epsilon
4. delta
..
