## Supervised Learning for Entity and Aspect Mining

This notebook introduces Conditional Random Fields (CRF) for entity and aspect mining. Recall that we have mentioned that entity and aspect mining involves 3 main tasks:
1. Extraction of entity 
2. Extraction of aspects associated with the entity
3. Sentiment classification

In this notebook, we use CRF for the second task. 

### Conditional Random Fields
CRF is a machine learning technique that works on sequences and is very popular in natural language porcessing (NLP), e.g. in Named entity Recogition (NER), Part of speech tagging (POS) and word sense disambiguation. 

The CRF is a subset of HMF (hidden markov fields) in that it may have dependencies beyond the adjacent words.

Earlier, we had introduced several heuristic techniques based on dependency relations for the extraction of aspects. In this notebook, we try integrate linguistic features into the ML model, e.g. POS information of words. 

# Import Libraries

In [17]:
!pip install python-crfsuite



In [38]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pandas as pd
import pycrfsuite
from collections import Counter

print(sklearn.__version__)

1.0.2


# Dataset Preparation

Our labelled data set is in IOB format with 3 columns. The first column is the actual words, the second is the POS and the 3rd column states whether it is B-A, I-A or others O. We write a simple code to convert it into a form for the pycrfsuite library. This is the most accessible library to run CRFs.

In [19]:
# Read dataset in iob format

def createCRFSet(fname):
    train_sents = []
    tt_sents = []
    t_sents = []
    fp = open(fname,  encoding="utf-8")
   
   # get tuples
    for line in fp.readlines():
        line = tuple(line.split())
        t_sents.append(line)
    
    # put tuples into each sentence
    for t in t_sents:
        if len(t)!=0: 
            tt_sents.append(t)
        else:
            train_sents.append(tt_sents)
            tt_sents=[]
    
    return train_sents

In [20]:
train_sents = createCRFSet("./Restaurants_Train.iob")
train_sents[0]

[('But', 'CC', 'O'),
 ('the', 'DT', 'O'),
 ('staff', 'NN', 'B-A'),
 ('was', 'VBD', 'O'),
 ('so', 'RB', 'O'),
 ('horrible', 'JJ', 'O'),
 ('to', 'TO', 'O'),
 ('us', 'PRP', 'O'),
 ('.', '.', 'O')]

In [21]:
test_sents = createCRFSet("./Restaurants_Test.iob")
test_sents[0]

[('The', 'DT', 'O'),
 ('bread', 'NN', 'B-A'),
 ('is', 'VBZ', 'O'),
 ('top', 'JJ', 'O'),
 ('notch', 'NN', 'O'),
 ('as', 'RB', 'O'),
 ('well', 'RB', 'O'),
 ('.', '.', 'O')]

# Feature Extraction

In the tuples, there is only word and POS tag information to predict the label. But these 2 information are not enough. So we have a function called "word2features" to extract other features from the sentence. The function is adapted from https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html.

For this tutorial, we have extracted the POS tag of previous word and next word to use as features for prediction of label.

In [22]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [  # for all words
        'bias',
        'postag=' + postag,
        #'word.lower()=' + word.lower(), # lowercase word as feature
        #'word[-3:]='+ word[-3:], # word ending as feature
        #'word[-2:]='+ word[-2:], # word ending as feature
        #'word.isupper()='+ str(word.isupper()), # lexical information as feature
        #'word.istitle()='+ str(word.istitle()), # lexical information as feature
        #'word.isdigit()='+ str(word.isdigit()), # lexical information as feature
        #'postag[:2]='+ postag[:2], # part of POS tag as feature
    ]
    if i > 0: # if not BOS, check previous word
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:postag=' + postag1,
            #'-1:word.lower()='+ word1.lower(),
            #'-1:word.istitle()='+ str(word1.istitle()),
            #'-1:word.isupper()='+ str(word1.isupper()),
            #'-1:postag[:2]='+postag1[:2],
        ])
    else:
        features.append('BOS')  # beginning of statement
        
    if i < len(sent)-1:  # if not EOS, check next word
        word2 = sent[i+1][0]
        postag2 = sent[i+1][1]
        features.extend([
            '+1:postag=' + postag2,
            #'+1:word.lower()='+ word2.lower(),
            #'+1:word.istitle()='+ str(word2.istitle()),
            #'+1:word.isupper()='+ str(word2.isupper()),
            #'+1:postag[:2]='+ postag2[:2],
        ])
    else:
        features.append('EOS')
                
    return features


def get_sentence_features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def get_sentence_labels(sent):
    return [label for token, postag, label in sent]

def get_sentence_tokens(sent):
    return [token for token, postag, label in sent]

Note the features for one of the sentence - 'To be completely fair, the only redeeming factor was the food which was above average, but couldn't make up for all the other deficiencies of Teodora'. The POS tags (before and after are used as features).

In [23]:
# data before feature extraction, changed to dataframe for easy printing.
df_1 = pd.DataFrame(train_sents[1], columns=["Word","POS","Entity or Aspect Tag"])
df_1

Unnamed: 0,Word,POS,Entity or Aspect Tag
0,To,TO,O
1,be,VB,O
2,completely,RB,O
3,fair,JJ,O
4,",",",",O
5,the,DT,O
6,only,JJ,O
7,redeeming,NN,O
8,factor,NN,O
9,was,VBD,O


In [24]:
# To observe how the training set looks like after feature extraction
df_2 = pd.DataFrame(get_sentence_features(train_sents[1]), columns=["Bias constant","POS", "POS Before","POS after" ])
df_2

Unnamed: 0,Bias constant,POS,POS Before,POS after
0,bias,postag=TO,BOS,+1:postag=VB
1,bias,postag=VB,-1:postag=TO,+1:postag=RB
2,bias,postag=RB,-1:postag=VB,+1:postag=JJ
3,bias,postag=JJ,-1:postag=RB,"+1:postag=,"
4,bias,"postag=,",-1:postag=JJ,+1:postag=DT
5,bias,postag=DT,"-1:postag=,",+1:postag=JJ
6,bias,postag=JJ,-1:postag=DT,+1:postag=NN
7,bias,postag=NN,-1:postag=JJ,+1:postag=NN
8,bias,postag=NN,-1:postag=NN,+1:postag=VBD
9,bias,postag=VBD,-1:postag=NN,+1:postag=DT


# Train-test split

In [25]:
X_train = [get_sentence_features(s) for s in train_sents]
y_train = [get_sentence_labels(s) for s in train_sents]

X_test = [get_sentence_features(s) for s in test_sents]
y_test = [get_sentence_labels(s) for s in test_sents]

# Model Training

In [26]:
# Combine X and y together of training data
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

In [27]:
# set parameters for trainer

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [28]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

In [29]:
%%time
# Train the model and save the trained CRF model. 
trainer.train('CRF_ABSA.crfsuite')

CPU times: user 793 ms, sys: 202 µs, total: 793 ms
Wall time: 803 ms


In [30]:
# Information about last iteration of model
trainer.logparser.last_iteration

{'active_features': 241,
 'error_norm': 161.739801,
 'feature_norm': 16.557075,
 'linesearch_step': 1.0,
 'linesearch_trials': 1,
 'loss': 8270.838023,
 'num': 50,
 'scores': {},
 'time': 0.014}

# Testing

In [31]:
# to use the trained model
tagger = pycrfsuite.Tagger()
tagger.open('CRF_ABSA.crfsuite')

<contextlib.closing at 0x7f8569473a50>

In [32]:
#Let's try it on one test sentence
example_sent = test_sents[6]

print(' '.join(get_sentence_tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(get_sentence_features(example_sent))))

print("Correct:  ", ' '.join(get_sentence_labels(example_sent)))

Straight-forward , no surprises , very decent Japanese food .

Predicted: O O O B-A O O O O O O
Correct:   O O O O O O O B-A I-A O


In [34]:
def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

In [35]:
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [42]:
X_test

[[['bias', 'postag=DT', 'BOS', '+1:postag=NN'],
  ['bias', 'postag=NN', '-1:postag=DT', '+1:postag=VBZ'],
  ['bias', 'postag=VBZ', '-1:postag=NN', '+1:postag=JJ'],
  ['bias', 'postag=JJ', '-1:postag=VBZ', '+1:postag=NN'],
  ['bias', 'postag=NN', '-1:postag=JJ', '+1:postag=RB'],
  ['bias', 'postag=RB', '-1:postag=NN', '+1:postag=RB'],
  ['bias', 'postag=RB', '-1:postag=RB', '+1:postag=.'],
  ['bias', 'postag=.', '-1:postag=RB', 'EOS']],
 [['bias', 'postag=PRP', 'BOS', '+1:postag=VBP'],
  ['bias', 'postag=VBP', '-1:postag=PRP', '+1:postag=TO'],
  ['bias', 'postag=TO', '-1:postag=VBP', '+1:postag=VB'],
  ['bias', 'postag=VB', '-1:postag=TO', '+1:postag=PRP'],
  ['bias', 'postag=PRP', '-1:postag=VB', '+1:postag=VBP'],
  ['bias', 'postag=VBP', '-1:postag=PRP', '+1:postag=CD'],
  ['bias', 'postag=CD', '-1:postag=VBP', '+1:postag=IN'],
  ['bias', 'postag=IN', '-1:postag=CD', '+1:postag=DT'],
  ['bias', 'postag=DT', '-1:postag=IN', '+1:postag=JJS'],
  ['bias', 'postag=JJS', '-1:postag=DT', '+1

In [36]:
print(bio_classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         B-A       0.62      0.36      0.46      1135
         I-A       0.55      0.23      0.32       538

   micro avg       0.60      0.32      0.42      1673
   macro avg       0.59      0.30      0.39      1673
weighted avg       0.60      0.32      0.41      1673
 samples avg       0.04      0.04      0.04      1673



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


* The tagger also contains some probability information that can tell us something about the model. Just like HMM, CRF model also cares about the transition probability of the y labels. We can print out these top probability transitions.

* We can check the probabilities of transition of the hidden states - some of which are more probable than others. The following example shows that B-A -> I-A is very likely (like in iPhone (B-A) size (I-A)).

* The transitions also show that there might be some errors in the data. "NN" is there which is a POS tag (feature).

In [39]:
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(8))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-8:])

Top likely transitions:
I-A    -> I-A     2.657962
B-A    -> I-A     2.559635
O      -> O       1.920100
O      -> B-A     1.000955
B-A    -> O       0.152442
NN     -> B-A     -0.244032
I-A    -> O       -0.658764
O      -> NN      -0.980371

Top unlikely transitions:
B-A    -> O       0.152442
NN     -> B-A     -0.244032
I-A    -> O       -0.658764
O      -> NN      -0.980371
NN     -> O       -2.090161
I-A    -> B-A     -4.247481
B-A    -> B-A     -5.006722
O      -> I-A     -6.641027


We can also check which feature is the most (or least) corelated to tag entities or aspects. The top positive features for "B-A" are postag=NN or postag=NNS - that is if the word is a noun. 

In [40]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

#rint("\nTop negative:")
#print_state_features(Counter(info.state_features).most_common()[-20:])

Top positive:
3.756987 O      postag=,
3.561562 O      postag=PRP
3.432838 O      postag=.
2.797307 O      postag=WDT
2.301108 O      postag=PRP$
2.278891 O      EOS
2.209918 O      BOS
1.866367 I-A    -1:postag=PRP
1.826179 I-A    postag=SYM
1.767864 O      +1:postag=CD
1.727703 O      postag=JJS
1.655197 O      postag=WP
1.535133 O      bias
1.472674 B-A    postag=NN
1.412390 B-A    postag=NNS
1.379069 NN     postag=.
1.361568 NN     EOS
1.342975 I-A    postag=NN
1.224582 B-A    BOS
1.190439 O      postag=VBZ
