## Supervised Learning for Entity and Aspect Mining

This notebook introduces Conditional Random Fields (CRF) for entity and aspect mining. Recall that we have mentioned that entity and aspect mining involves 3 main tasks:
1. Extraction of entity 
2. Extraction of aspects associated with the entity
3. Sentiment classification

In this notebook, we use CRF for the second task. 

### Conditional Random Fields
CRF is a machine learning technique that works on sequences and is very popular in natural language porcessing (NLP), e.g. in Named entity Recogition (NER), Part of speech tagging (POS) and word sense disambiguation. 

The CRF is a subset of HMF (hidden markov fields) in that it may have dependencies beyond the adjacent words.

Earlier, we had introduced several heuristic techniques based on dependency relations for the extraction of aspects. In this notebook, we try integrate linguistic features into the ML model, e.g. POS information of words. 

In [None]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn
import pandas as pd

!pip install python-crfsuite
import pycrfsuite

print(sklearn.__version__)

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-crfsuite
  Downloading python_crfsuite-0.9.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (965 kB)
[K     |████████████████████████████████| 965 kB 4.3 MB/s 
[?25hInstalling collected packages: python-crfsuite
Successfully installed python-crfsuite-0.9.8
1.0.2


#Data Preparation
Our labelled data set is in IOB format with 3 columns. The first column is the actual words, the second is the POS and the 3rd column states whether it is B-A, I-A or others O. We write a simple code to convert it into a form for the pycrfsuite library. This is the most accessible library to run CRFs. 

The function word2features extracts out features in the sentence - in this case just POS of the individual tokens. The function is adapted from https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html

In [None]:
# Read in the tuples line by line from file, and put into sentences

def createCRFSet(fname):
    train_sents = []
    t_sent = []
    tuples = []
    fp = open(fname,  encoding="utf-8")
   
   #get tuples
    for line in fp.readlines():
        line = tuple(line.split())
        tuples.append(line)
    
    #put tuples into each sentence
    for t in tuples:
        if len(t)!=0: 
            t_sent.append(t)
        else:
            train_sents.append(t_sent)
            t_sent=[]
    
    return train_sents

train_sents = createCRFSet("./Restaurants_Train.iob")
test_sents = createCRFSet("./Restaurants_Test.iob")
print(len(train_sents))
print(len(test_sents))


3041
800


In [None]:
train_sents[0]

[('But', 'CC', 'O'),
 ('the', 'DT', 'O'),
 ('staff', 'NN', 'B-A'),
 ('was', 'VBD', 'O'),
 ('so', 'RB', 'O'),
 ('horrible', 'JJ', 'O'),
 ('to', 'TO', 'O'),
 ('us', 'PRP', 'O'),
 ('.', '.', 'O')]

In [None]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [  # for all words
        'bias',
        'postag=' + postag,
        #'word.lower()=' + word.lower(),
        #'word[-3:]='+ word[-3:],
        #'word[-2:]='+ word[-2:],
        #'word.isupper()='+ str(word.isupper()),
        #'word.istitle()='+ str(word.istitle()),
        #'word.isdigit()='+ str(word.isdigit()),
        #'postag[:2]='+ postag[:2],
    ]
    if i > 0: # if not BOS, check previous word
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:postag=' + postag1,
            #'-1:word.lower()='+ word1.lower(),
            #'-1:word.istitle()='+ str(word1.istitle()),
            #'-1:word.isupper()='+ str(word1.isupper()),
            #'-1:postag[:2]='+postag1[:2],
        ])
    else:
        features.append('BOS')  # beginning of statement
        
    if i < len(sent)-1:  # if not EOS, check next word
        word2 = sent[i+1][0]
        postag2 = sent[i+1][1]
        features.extend([
            '+1:postag=' + postag2,
            #'+1:word.lower()='+ word2.lower(),
            #'+1:word.istitle()='+ str(word2.istitle()),
            #'+1:word.isupper()='+ str(word2.isupper()),
            #'+1:postag[:2]='+ postag2[:2],
        ])
    else:
        features.append('EOS')
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

Note the features for one of the sentence - 'To be completely fair, the only redeeming factor was the food which was above average, but couldn't make up for all the other deficiencies of Teodora'. The POS tags (before and after are used as features).

In [None]:
# data before feature extraction, changed to dataframe for easy printing.
df_1 = pd.DataFrame(train_sents[1], columns=["Word","POS","Entity or Aspect Tag"])

df_1


Unnamed: 0,Word,POS,Entity or Aspect Tag
0,To,TO,O
1,be,VB,O
2,completely,RB,O
3,fair,JJ,O
4,",",",",O
5,the,DT,O
6,only,JJ,O
7,redeeming,NN,O
8,factor,NN,O
9,was,VBD,O


In [None]:
# To observe how the training set looks like after feature extraction
df_2 = pd.DataFrame(sent2features(train_sents[1]), columns=["Bias constant","POS", "POS Before","POS after" ])
df_2

Unnamed: 0,Bias constant,POS,POS Before,POS after
0,bias,postag=TO,BOS,+1:postag=VB
1,bias,postag=VB,-1:postag=TO,+1:postag=RB
2,bias,postag=RB,-1:postag=VB,+1:postag=JJ
3,bias,postag=JJ,-1:postag=RB,"+1:postag=,"
4,bias,"postag=,",-1:postag=JJ,+1:postag=DT
5,bias,postag=DT,"-1:postag=,",+1:postag=JJ
6,bias,postag=JJ,-1:postag=DT,+1:postag=NN
7,bias,postag=NN,-1:postag=JJ,+1:postag=NN
8,bias,postag=NN,-1:postag=NN,+1:postag=VBD
9,bias,postag=VBD,-1:postag=NN,+1:postag=DT


In [None]:
#now process the inputs and outputs for both train set and test set.
%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

CPU times: user 223 ms, sys: 14 ms, total: 237 ms
Wall time: 239 ms


#Train and use the CRF model

Create a Trainer and load the train set data.

In [None]:
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

In [None]:

trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [None]:
trainer.params()

['feature.minfreq',
 'feature.possible_states',
 'feature.possible_transitions',
 'c1',
 'c2',
 'max_iterations',
 'num_memories',
 'epsilon',
 'period',
 'delta',
 'linesearch',
 'max_linesearch']

In [None]:
%%time
# Train the model and save the trained CRF model. 
trainer.train('CRF_ABSA.crfsuite')

CPU times: user 877 ms, sys: 4.18 ms, total: 881 ms
Wall time: 884 ms


In [None]:
# the final state of the model
trainer.logparser.last_iteration

{'active_features': 244,
 'error_norm': 126.872569,
 'feature_norm': 17.409161,
 'linesearch_step': 1.0,
 'linesearch_trials': 1,
 'loss': 8229.181667,
 'num': 50,
 'scores': {},
 'time': 0.014}

In [None]:
# create a Tagger to use the trained model
tagger = pycrfsuite.Tagger()
tagger.open('CRF_ABSA.crfsuite')

<contextlib.closing at 0x7f30abc9be90>

In [None]:
# call tag() to tag a sentence. Let's try it on one test sentence
example_sent = test_sents[6]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))

Straight-forward , no surprises , very decent Japanese food .

Predicted: O O O B-A O O O O O O
Correct:   O O O O O O O B-A I-A O


# Evaluate the model
Getting an overall accuracy is not meaningful as majority of the labels in data is 'O'. It's more helpful to look at the precision and recall for each tag - 'B-A' and 'I-A'.

In [None]:
def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

In [None]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

CPU times: user 32.5 ms, sys: 0 ns, total: 32.5 ms
Wall time: 80.3 ms


In [None]:
print(bio_classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         B-A       0.63      0.36      0.46      1135
         I-A       0.56      0.23      0.32       538

   micro avg       0.61      0.32      0.42      1673
   macro avg       0.59      0.30      0.39      1673
weighted avg       0.60      0.32      0.42      1673
 samples avg       0.04      0.04      0.04      1673



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#What has the model learned?
We can check the probabilities of transition of the hidden states - some of which are more probable than others. The following example shows that B-A -> I-A is very likely (like in iPhone (B-A) size (I-A)). 
The transitions also show that there might be some errors in the data. Can you spot that?

In [None]:

from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(8))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-8:])

Top likely transitions:
I-A    -> I-A     1.720269
B-A    -> I-A     1.234421
O      -> O       1.213212
O      -> B-A     0.304503
B-A    -> O       -0.622497
I-A    -> O       -1.054049
I-A    -> B-A     -4.500151
B-A    -> B-A     -5.729769

Top unlikely transitions:
B-A    -> I-A     1.234421
O      -> O       1.213212
O      -> B-A     0.304503
B-A    -> O       -0.622497
I-A    -> O       -1.054049
I-A    -> B-A     -4.500151
B-A    -> B-A     -5.729769
O      -> I-A     -7.799991


We can also check which feature is the most (or least) corelated to tag entities or aspects. The top positive features for "B-A" are postag=NN or postag=NNS - that is if the word is a noun. 

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-20:])

Top positive:
5.041641 O      BOS
4.443006 O      postag=.
4.163987 B-A    BOS
3.608665 O      postag=PRP
3.367746 O      postag=,
2.739086 O      postag=WDT
2.076323 O      EOS
1.826382 O      postag=WP
1.641934 O      postag=JJS
1.595854 B-A    postag=NNS
1.485715 B-A    postag=NN
1.445405 O      +1:postag=CD
1.397235 O      postag=PRP$
1.183854 O      postag=VBZ
1.178977 O      postag=:
1.172939 I-A    postag=NN
1.168433 B-A    postag=VBN
1.167737 B-A    postag=VBG
1.150206 O      bias
1.124446 I-A    +1:postag=VBZ

Top negative:
-0.618222 I-A    -1:postag=PRP$
-0.646275 I-A    +1:postag=NNP
-0.651469 O      +1:postag=,
-0.694333 O      +1:postag=WDT
-0.697250 O      postag=VB
-0.737102 O      postag=SYM
-0.750187 O      +1:postag=VBZ
-0.811463 B-A    postag=IN
-0.856655 I-A    bias
-0.883636 O      +1:postag=VBP
-0.975711 B-A    bias
-1.074608 O      postag=VBN
-1.132826 O      +1:postag=VBD
-1.146991 O      postag=NNP
-1.393457 I-A    postag=VBD
-1.458857 I-A    postag=VBZ
-1.5396

Obviously, the result using just POS as features is not good. Now try training a better model by adding other features, like word info, pre/post word, case information, etc.