<a href="https://colab.research.google.com/github/Haavi97/ITS8040-NLSP/blob/master/Default_independent_project_2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task description

The "default" project topic is about Aspect-Based Sentiment Analysis (ABSA). By "default" it is meant that you are supposed to choose this topic if you don't have any preferences about the topic yourself. You are free to choose any other sufficiently challenging topic, but please consult with the lecturer.

The task comes from the SemEval 2016 Shared Task 5: http://alt.qcri.org/semeval2016/task5/

There are different subtasks and slots in this task. You can limit you work on Task 1 Slot 1: Aspect Category Detection, and you can further focus on the restaurant domain only.

The task is: given a sentence taken from a restaurant review, your system has to decide about which aspect category of the restaurant this sentence is, if any. The number of different aspect categories is fixed. One sentnce can correspond to zero, one or more aspect categories.

Some examples:

*   "I was very disappointed with this restaurant." -> RESTAURANT#GENERAL
*   "I’ve asked a cart attendant for a lotus leaf wrapped rice and she replied back rice and just walked away." -> SERVICE#GENERAL
*   "Chow fun was dry; pork shu mai was more than usually greasy and had to share a table with loud and rude family." -> FOOD#QUALITY, AMBIENCE#GENERAL

The Slot 3 of the task would be finding the polarity (negative, neutral, positive) of the aspect, but you don't have to implement this.

Since the data and evaluation tools for this Shared task come in quite complex form, I have implemented very basic data loading for you, together with a simplest possible sklearn-based implementation for this task.





I have packed the data for the restaurant domain for this subtask on my site. Let's download it:

In [1]:
! wget --no-check-certificate https://www.phon.ioc.ee/~tanela/tmp/absa-en-restaurant.zip

--2021-06-02 11:28:09--  https://www.phon.ioc.ee/~tanela/tmp/absa-en-restaurant.zip
Resolving www.phon.ioc.ee (www.phon.ioc.ee)... 193.40.251.126
Connecting to www.phon.ioc.ee (www.phon.ioc.ee)|193.40.251.126|:443... connected.
  Issued certificate has expired.
	requested host name ‘www.phon.ioc.ee’.
HTTP request sent, awaiting response... 200 OK
Length: 136921 (134K) [application/zip]
Saving to: ‘absa-en-restaurant.zip’


2021-06-02 11:28:09 (881 KB/s) - ‘absa-en-restaurant.zip’ saved [136921/136921]



In [2]:
! unzip absa-en-restaurant.zip

Archive:  absa-en-restaurant.zip
  inflating: ABSA16_Restaurants_Train_SB1_v2.xml  
  inflating: EN_REST_SB1_TEST.xml.gold  


There are two files: training and test files. Both are in XML formats. Let's  take a peek:

In [3]:
#! head -20 ABSA16_Restaurants_Train_SB1_v2.xml  

In [4]:
#! head -20 ABSA16_Restaurants_Train_SB1_v2.xml  

There is a lot of information in this XML: each sentence has a list of Opinions, where each Opinion consists of category, polarity and the target word or phrase. We are really interested only in the raw sentence and the corresponding opinion categories.

There are many ways to parse XML in Python. In this example we use a method that parses XML to Python dict, which is we will then process via loops.

In [5]:
! pip install xmltodict

Collecting xmltodict
  Downloading https://files.pythonhosted.org/packages/28/fd/30d5c1d3ac29ce229f6bdc40bbc20b28f716e8b363140c26eff19122d8a5/xmltodict-0.12.0-py2.py3-none-any.whl
Installing collected packages: xmltodict
Successfully installed xmltodict-0.12.0


In [6]:
import xmltodict as xd


with open('EN_REST_SB1_TEST.xml.gold','rb') as f:
    d = xd.parse(f)

In [7]:
d["Reviews"]["Review"][0]["sentences"]["sentence"][0]

OrderedDict([('@id', 'en_BlueRibbonSushi_478218171:0'),
             ('text', 'Yum!'),
             ('Opinions',
              OrderedDict([('Opinion',
                            OrderedDict([('@target', 'NULL'),
                                         ('@category', 'FOOD#QUALITY'),
                                         ('@polarity', 'positive'),
                                         ('@from', '0'),
                                         ('@to', '0')]))]))])

Here is the function that parses the XML and returns a list containing sentences and the corresponding list of categories:

In [8]:
def read_data(filename):
  result = []
  with open(filename,'rb') as f:
    d = xd.parse(f, force_list=('sentence', 'Opinion'))
  for review in d["Reviews"]["Review"]:
    #print(review)
    for sentence in review["sentences"]["sentence"]:
      
      text = sentence["text"]
      opinion_cats = []
      
      if "Opinions" in sentence and sentence["Opinions"] is not None:
        opinions = sentence["Opinions"]["Opinion"]
        for opinion in opinions:
          opinion_cats.append(opinion["@category"])
      result.append((text, opinion_cats))
  return result



In [9]:
train_data = read_data("ABSA16_Restaurants_Train_SB1_v2.xml")
test_data = read_data("EN_REST_SB1_TEST.xml.gold")

In [10]:
print(len(train_data), len(test_data))

2000 676


In [11]:
#train_data[:5]

# Pipelines

## Multilabel binarizer

So, each sentence can have zero or more categories. This task is called multi-label classification, as opposed to single-label classification where each sample corresponds to one and only one category.

The sklearn package has some useful utilities for multi-label classification tasks.

In [12]:
from sklearn.preprocessing import MultiLabelBinarizer

MultiLabelBinarizer builds a mapping from multi-label labels to IDs, and also constructs a label matrix for our training ans test data.

In [13]:
mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform([set(sample[1]) for sample in train_data])

In [14]:
train_labels[0:5]

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])

In [15]:
mlb.classes_

array(['AMBIENCE#GENERAL', 'DRINKS#PRICES', 'DRINKS#QUALITY',
       'DRINKS#STYLE_OPTIONS', 'FOOD#PRICES', 'FOOD#QUALITY',
       'FOOD#STYLE_OPTIONS', 'LOCATION#GENERAL', 'RESTAURANT#GENERAL',
       'RESTAURANT#MISCELLANEOUS', 'RESTAURANT#PRICES', 'SERVICE#GENERAL'],
      dtype=object)

In [16]:
#print(test_data)

In [17]:
test_labels = mlb.transform([set(sample[1]) for sample in test_data])

In [18]:
test_labels[0:5]

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

So, our labels are now stored as binary matrices -- exactly as we need them.

Let's also construct a list of training inputs, as sklearn likes it:

In [19]:
train_text = [review[0] for review in train_data]
test_text = [review[0] for review in test_data]

In [20]:
train_text[0:5]

['Judging from previous posts this used to be a good place, but not any longer.',
 'We, there were four of us, arrived at noon - the place was empty - and the staff acted like we were imposing on them and they were very rude.',
 'They never brought us complimentary noodles, ignored repeated requests for sugar, and threw our dishes on the table.',
 'The food was lousy - too sweet or too salty and the portions tiny.',
 'After all that, they complained to me about the small tip.']

## Data shape

Printing data to have a clear picture.

In [21]:
print('Train text data is of type: {}'.format(type(train_text)))
print('Train label data is of type: {}'.format(type(train_labels)))
print('Train text size: {}'.format(len(train_text)))
print('Test text size: {}'.format(len(test_text)))
print('Train labels size: {}'.format(train_labels.shape))
print('Test labels size: {}'.format(test_labels.shape))
n_labels = test_labels.shape[1]
print('Number of labels: {}'.format(n_labels))
TAGS = mlb.classes_
print('\nThis are the different labels:')
print(TAGS)

Train text data is of type: <class 'list'>
Train label data is of type: <class 'numpy.ndarray'>
Train text size: 2000
Test text size: 676
Train labels size: (2000, 12)
Test labels size: (676, 12)
Number of labels: 12

This are the different labels:
['AMBIENCE#GENERAL' 'DRINKS#PRICES' 'DRINKS#QUALITY'
 'DRINKS#STYLE_OPTIONS' 'FOOD#PRICES' 'FOOD#QUALITY' 'FOOD#STYLE_OPTIONS'
 'LOCATION#GENERAL' 'RESTAURANT#GENERAL' 'RESTAURANT#MISCELLANEOUS'
 'RESTAURANT#PRICES' 'SERVICE#GENERAL']


## OneVsRestClassifier

Now we can train a simple multi-label classifier.

We use Sklearn's OneVsRestClassifier to do this. This basically builds a seperate base classifier for each of our labels.

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

text_clf = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf',  OneVsRestClassifier(LogisticRegression()))
 ])

In [23]:
text_clf.fit(train_text, train_labels )

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 OneVsRestClassifier(estimator=LogisticRegression(C=1.0,
                                                                  class_weight=None,
                      

Classifier is trained. Let's first see how can we apply it on test data. What comes out when we feed it some test data?

In [24]:
test_predictions = text_clf.predict(test_text)

In [25]:
test_predictions[0:5]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [26]:
test_labels[0:5]

array([[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]])

So, the output of our classifier is also a binary matrix. We can evaluate the performance of our model using F1 measure (which is a geometric mean of precision and recall). 

F1 measure can be computed for each label independently and then merged using either macro or micro averaging. We will use micro-averaging that sums up the individual true positives, false positives, and false negatives of the system for different sets and the apply them to get the statistics. 



In [27]:
from sklearn.metrics import f1_score, classification_report

In [28]:
f1_score(test_labels,
         test_predictions,
         average='micro')

0.4503441494591937

In [29]:
print(classification_report(test_labels, test_predictions, target_names=mlb.classes_, zero_division=0))

                          precision    recall  f1-score   support

        AMBIENCE#GENERAL       1.00      0.21      0.35        57
           DRINKS#PRICES       0.00      0.00      0.00         3
          DRINKS#QUALITY       0.00      0.00      0.00        21
    DRINKS#STYLE_OPTIONS       0.00      0.00      0.00        12
             FOOD#PRICES       0.00      0.00      0.00        22
            FOOD#QUALITY       0.78      0.54      0.64       226
      FOOD#STYLE_OPTIONS       0.00      0.00      0.00        48
        LOCATION#GENERAL       0.00      0.00      0.00        13
      RESTAURANT#GENERAL       0.84      0.23      0.36       142
RESTAURANT#MISCELLANEOUS       0.00      0.00      0.00        33
       RESTAURANT#PRICES       0.00      0.00      0.00        21
         SERVICE#GENERAL       0.95      0.43      0.60       145

               micro avg       0.84      0.31      0.45       743
               macro avg       0.30      0.12      0.16       743
        

It's now your task to improve this.

You can try a lot of things to make the classifier more accurate. Of course, it's very recommended to try different DNN based approaches. 

Note that there are only 2000 training sentences. This makes the task basically an excercise of transfer learning. You can try using pre-trained word embeddings, pre-trained models like BERT, etc.

Note that the labels have some structure in them. Each label consists of two parts: e.g. FOOD#QUALITY consists of FOOD and QUALITY. Maybe try splitting the labels into two parts and predicting each part independanty? Of course, you need to glue them back up when doing evaluation.

Multi-label classification can be easily handled with neural network models with a model that has one output for each label, but instead of simple cross-entropy, binary cross-entropy has to be used for optimization.

Multi-label classification is a quite popular task and you can find a lot of tutorials on the internet. For example, many many text classification where a text can be assigned many tags (as in StackOverflow) are multi-label tasks.


Check the slides of Lecture 1 on how the final project should look like, what parts it should contain and how it will be graded.




## Failing with OneVsOne

In [30]:
def get_index(l):
  try:
    return list(l).index(1)
  except: 
    #print(l)
    return 13

In [31]:
import numpy as np

def reverse(l):
  result = []
  for e in l:
    zeros = np.zeros(12, int)
    if e != 13:
      zeros[e] = 1
    result.append(zeros)
  return np.array(result)


In [32]:
from sklearn.multiclass import OneVsOneClassifier

text_clf_1vs1 = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf',  OneVsOneClassifier(LogisticRegression()))
 ])
train_labels_1vs1 = list(map(lambda x: get_index(x), train_labels))
text_clf_1vs1.fit(train_text, train_labels_1vs1)

test_predictions = text_clf_1vs1.predict(test_text)

test_predictions = reverse(test_predictions)
print(test_predictions)
#test_labels_1vs1 = list(map(lambda x: get_index(x), test_labels))
#print(test_predictions)
f1 = f1_score(test_labels,
         test_predictions,
         average='micro')
print('F1: {}'.format(f1))
print(classification_report(test_labels, test_predictions, target_names=mlb.classes_, zero_division=0))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
F1: 0.5303703703703704
                          precision    recall  f1-score   support

        AMBIENCE#GENERAL       0.79      0.47      0.59        57
           DRINKS#PRICES       0.00      0.00      0.00         3
          DRINKS#QUALITY       0.00      0.00      0.00        21
    DRINKS#STYLE_OPTIONS       0.00      0.00      0.00        12
             FOOD#PRICES       0.00      0.00      0.00        22
            FOOD#QUALITY       0.53      0.88      0.66       226
      FOOD#STYLE_OPTIONS       0.00      0.00      0.00        48
        LOCATION#GENERAL       0.00      0.00      0.00        13
      RESTAURANT#GENERAL       0.61      0.56      0.58       142
RESTAURANT#MISCELLANEOUS       0.00      0.00      0.00        33
       RESTAURANT#PRICES       0.00      0.00      0.00        21
         SERVICE#GENERAL       0.85      0.37      0.51       14

## OutputCode

In [33]:
from sklearn.multiclass import OutputCodeClassifier
from sklearn.preprocessing import FunctionTransformer

text_clf_occ = Pipeline([
     ('vect', CountVectorizer()),
     ('tfidf', TfidfTransformer()), 
     ('to_dense', FunctionTransformer(lambda x: x.todense(), accept_sparse=True)),
     ('clf',  OutputCodeClassifier(LogisticRegression()))
 ])
#train_labels_1vs1 = list(map(lambda x: get_index(x), train_labels))
#text_clf_occ.fit(train_text, train_labels)

#test_predictions = text_clf_occ.predict(test_text)

#test_predictions = reverse(test_predictions)
print(test_predictions)
#test_labels_1vs1 = list(map(lambda x: get_index(x), test_labels))
#print(test_predictions)
f1 = f1_score(test_labels,
         test_predictions,
         average='micro')
print('F1: {}'.format(f1))
print(classification_report(test_labels, test_predictions, target_names=mlb.classes_, zero_division=0))

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
F1: 0.5303703703703704
                          precision    recall  f1-score   support

        AMBIENCE#GENERAL       0.79      0.47      0.59        57
           DRINKS#PRICES       0.00      0.00      0.00         3
          DRINKS#QUALITY       0.00      0.00      0.00        21
    DRINKS#STYLE_OPTIONS       0.00      0.00      0.00        12
             FOOD#PRICES       0.00      0.00      0.00        22
            FOOD#QUALITY       0.53      0.88      0.66       226
      FOOD#STYLE_OPTIONS       0.00      0.00      0.00        48
        LOCATION#GENERAL       0.00      0.00      0.00        13
      RESTAURANT#GENERAL       0.61      0.56      0.58       142
RESTAURANT#MISCELLANEOUS       0.00      0.00      0.00        33
       RESTAURANT#PRICES       0.00      0.00      0.00        21
         SERVICE#GENERAL       0.85      0.37      0.51       14

# BERT approach
  

## Requirements

In [34]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 12.2MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 34.9MB/s 
[?25hCollecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |

In [35]:
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

import torch 

device = 'cpu'
if torch.cuda.is_available():
  device = torch.device('cuda')

print(device)

cuda


## BERT tokenizer

In [36]:
!pip install transformers



In [37]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




## TextDataset

In [38]:
class TextLabelDataset (Dataset):
    def __init__(self, text, labels, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.text = text
        self.labels = labels
        self.max_len = max_len
        
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, item_idx):
        text = self.text[item_idx]
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length= self.max_len,
            padding = 'max_length',
            return_token_type_ids= False,
            return_attention_mask= True,
            truncation=True,
            return_tensors = 'pt'
          )
        
        input_ids = inputs['input_ids'].flatten()
        attn_mask = inputs['attention_mask'].flatten()
               
        return {
          'input_ids': input_ids ,
          'attention_mask': attn_mask,
          'label_ids':torch.tensor(self.labels[item_idx],dtype= torch.float)
        }

In [39]:
all_text = train_text + test_text
all_text_lengths = map(lambda x: len(tokenizer.tokenize(x)), all_text)
MAX_LEN = max(all_text_lengths) + 2
print(MAX_LEN)

110


In [40]:
train_dataset = TextLabelDataset(train_text, train_labels, tokenizer, MAX_LEN)
test_dataset = TextLabelDataset(test_text, test_labels, tokenizer, MAX_LEN)

## BERT Model

In [41]:
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertModel

class MyBertModel(nn.Module):
    def __init__(self, num_classes, device='cpu', finetuning=False):
        super().__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased', 
                                              return_dict=True)

        self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)
        self.sig = nn.Sigmoid()

        self.device = device
        self.finetuning = finetuning

    def forward(self, x, attn):
        '''
        x: (N, T). int64
        Returns
        enc: (N, T, num_classes)
        '''
        x = x.to(self.device)
        # feed input tokens through BERT
        if self.training and self.finetuning:
            # print("->bert.train()")
            self.bert.train()

            outputs = self.bert(input_ids=x,attention_mask=attn)
        else:
            self.bert.eval()
            with torch.no_grad():
              outputs = self.bert(x)
        #print('outputs.pooler_output shape:')
        #print(outputs.pooler_output.shape)
        # feed BERT encodings through a classification layer
        logits = self.fc(outputs.pooler_output)
        #print('logits shape:')
        #print(logits.shape)
        
        return self.sig(logits)
    
    def to_labels(self, result):
        to_return = torch.empty_like(result)
        threshold = 0.5
        for i in range(result.shape[0]):
            for j in range(result.shape[1]):
                if result[i][j]>threshold:
                    to_return[i][j] = 1
                else:
                    to_return[i][j] = 0
        return to_return


## Train function

In [42]:
!pip install sklearn_crfsuite
import sklearn_crfsuite
import sklearn_crfsuite.metrics

Collecting sklearn_crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting python-crfsuite>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/79/47/58f16c46506139f17de4630dbcfb877ce41a6355a1bbf3c443edb9708429/python_crfsuite-0.9.7-cp37-cp37m-manylinux1_x86_64.whl (743kB)
[K     |████████████████████████████████| 747kB 12.4MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.7 sklearn-crfsuite-0.3.6


In [43]:
def train(model, num_epochs, train_iter, dev_iter):
  criterion = nn.BCELoss()
  optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

  for epoch in range(1, num_epochs+1):
    print("Epoch %d" % epoch)
    
    for i, batch in enumerate(train_iter):
        model.train()
        x = batch["input_ids"]
        y = batch["label_ids"]
        attn = batch["attention_mask"]

        #print(x.shape)
        #print(y.shape)

        optimizer.zero_grad()
        x = x.to(device)
        attn = attn.to(device)
        logits = model(x, attn) # logits: (N, T, TAGS), y: (N, T)
        #print(logits[0])

        logits = logits.view(-1, logits.shape[-1]) # (N*T, TAGS)
        y = y.to(device)
        #y = y.view(-1)  # (N*T,)
        #logits = logits.view(-1)
        #print(logits.shape)
        #print(y.shape)

        loss = criterion(logits, y)
        loss.backward()

        optimizer.step()


        if i % 10 == 0: # monitoring
            print(f"step: {i}, loss: {loss.item()}")

        if i % 100 == 0: # let's evaluate more frequently than every epoch
            evaluate("test set", dev_iter, model)

## Evaluate function

In [56]:
from sklearn.metrics import classification_report

def evaluate(dataset_name, data_iter, model, full_report=False):
  
  model.eval()
  not_started = True
  with torch.no_grad():
    for batch in data_iter:
      x = batch["input_ids"]
      y = batch["label_ids"]
      attn = batch["attention_mask"]
      x = x.to(device)
      y = y.to(device)

      logits = model(x, attn)
      y_pred = logits

      if not_started:
        y_seq = y
        y_pred_seq = logits
        not_started = False
      else:
        #print(y_seq.shape)
        #print(y_pred_seq.shape)
        y_seq = torch.cat((y_seq,y), dim=0)
        y_pred_seq = torch.cat((y_pred_seq, y_pred), dim=0)
  
  y_pred_seq = model.to_labels(y_pred_seq)
  y_seq = y_seq.cpu()
  y_pred_seq = y_pred_seq.cpu()
  accuracy = sklearn_crfsuite.metrics.flat_accuracy_score(y_seq, y_pred_seq)
  
  print('  Evaluation on {} -  acc: {:.4f}%'.format(dataset_name, accuracy))
  if full_report:
    print(classification_report(y_seq, y_pred_seq, target_names=list(TAGS)))

## Model call

In [45]:
model = MyBertModel(n_labels, device, True).to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Batching

In [46]:
# we use a small batch size, as BERT needs a lot of memory
batch_size = 16

train_iter = DataLoader(dataset=train_dataset,
                                 batch_size=batch_size,
                                 shuffle=True,
                                 num_workers=2)
dev_iter = DataLoader(dataset=test_dataset,
                                 batch_size=batch_size,
                                 shuffle=False,
                                 num_workers=2)

## Training

In [52]:
train(model, 10, train_iter, dev_iter)

Epoch 1
step: 0, loss: 0.19493016600608826
  Evaluation on test set -  acc: 0.9141%
step: 10, loss: 0.21719703078269958
step: 20, loss: 0.19440731406211853
step: 30, loss: 0.14981821179389954
step: 40, loss: 0.1721893846988678
step: 50, loss: 0.1284503936767578
step: 60, loss: 0.17334114015102386
step: 70, loss: 0.16674762964248657
step: 80, loss: 0.1575440615415573
step: 90, loss: 0.21302422881126404
step: 100, loss: 0.13550278544425964
  Evaluation on test set -  acc: 0.9377%
step: 110, loss: 0.14882510900497437
step: 120, loss: 0.11877729743719101
Epoch 2
step: 0, loss: 0.09526744484901428
  Evaluation on test set -  acc: 0.9366%
step: 10, loss: 0.19732323288917542
step: 20, loss: 0.07471096515655518
step: 30, loss: 0.11449578404426575
step: 40, loss: 0.06444206833839417
step: 50, loss: 0.08811011165380478
step: 60, loss: 0.13567900657653809
step: 70, loss: 0.05856955796480179
step: 80, loss: 0.04596608877182007
step: 90, loss: 0.11922075599431992
step: 100, loss: 0.1120020970702171

## Evaluation

In [57]:
evaluate("dev", dev_iter, model, full_report=True)

  Evaluation on dev -  acc: 0.9026%
                          precision    recall  f1-score   support

        AMBIENCE#GENERAL       0.81      0.61      0.70        57
           DRINKS#PRICES       0.00      0.00      0.00         3
          DRINKS#QUALITY       0.52      0.57      0.55        21
    DRINKS#STYLE_OPTIONS       0.42      0.67      0.52        12
             FOOD#PRICES       0.62      0.23      0.33        22
            FOOD#QUALITY       0.75      0.69      0.72       226
      FOOD#STYLE_OPTIONS       0.67      0.25      0.36        48
        LOCATION#GENERAL       0.78      0.54      0.64        13
      RESTAURANT#GENERAL       0.28      0.91      0.43       142
RESTAURANT#MISCELLANEOUS       0.36      0.12      0.18        33
       RESTAURANT#PRICES       0.18      0.90      0.30        21
         SERVICE#GENERAL       0.81      0.72      0.77       145

               micro avg       0.48      0.66      0.56       743
               macro avg       0.52   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
