# Terminology project 

### AGUIAR Mathilde NIAOURI Dimitra

#### M2 NLP

# Goal and tools 


In this notebook we are using BERT model (bert-base-uncased) to predict IOB tagging on the NLP domain. 
To do so we use the BERT hosted on HuggingFace.

In [46]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Imports and packages 

In [47]:
!pip3 install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [48]:
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [49]:
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertConfig, BertForTokenClassification
import pandas as pd
from pandas import DataFrame
import glob
from sklearn.metrics import accuracy_score
import torch.nn as nn

To use your GPU

In [50]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
print(device)

cpu


## Data preprocessing 



### Important issue to address:

BERT tokenize sub-words and not words. That's why we need strategies to deal with annotation on subwords instead of words. 

### ⚠ Change the paths to yours 

### Create all the dataframes of train, test and dev data

In [51]:
def data_in_df(path):
  all_files = glob.glob(path + "/*.iob")
  l = []
  cols = ['Token', 'Tag']

  for filename in all_files:
      df = pd.read_csv(filename, names = cols,on_bad_lines='skip', sep='\s+', engine='python')
      l.append(df)

  df = pd.concat(l, axis=0, ignore_index=True)
  return df

In [52]:
# Train df 
df_train = data_in_df('/content/drive/MyDrive/Terminology_project/curated/training_data/iob')
# Dev df 
df_dev = data_in_df('/content/drive/MyDrive/Terminology_project/curated/dev_data/iob')
# Test df 
df_test = data_in_df('/content/drive/MyDrive/Terminology_project/curated/test_data/iob_md')

In [53]:
print("Number of tokens in the dataset", len(df_train['Token']))
print("Number of tokens in the dataset",len(df_dev['Token']))
print("Number of tokens in the dataset",len(df_test['Token']))

Number of tokens in the dataset 19356
Number of tokens in the dataset 2222
Number of tokens in the dataset 2772


In [55]:
# Clean the tag columns 
def clean_tags(df):
  # clean the tag column
  print(df['Tag'])
  df['Tag'] = df['Tag'].replace(['        I','    I','II'], 'I')
  df['Tag'] = df['Tag'].replace(['        O','O ','0','o'], 'O')
  df['Tag'] = df['Tag'].replace(['   B','b'], 'B')
  return df

In [56]:
df_train = clean_tags(df_train)
# Dev df 
df_dev = clean_tags(df_dev)
# Test df 
df_test = clean_tags(df_test)

0            O
1            O
2            O
3            O
4            O
         ...  
19351    B-NLP
19352    I-NLP
19353    I-NLP
19354    I-NLP
19355        O
Name: Tag, Length: 19356, dtype: object
0       O
1       O
2       O
3       O
4       O
       ..
2217    O
2218    O
2219    O
2220    O
2221    O
Name: Tag, Length: 2222, dtype: object
0           O
1           O
2           O
3       B-NLP
4       I-NLP
        ...  
2767        O
2768        O
2769    B-NLP
2770        O
2771        O
Name: Tag, Length: 2772, dtype: object


### Reconstruct sentences and sequence of tags

In [60]:
def reconstruct_sent(df):
    sentences = []
    tags = []
    tmp_words = []
    tmp_tag = []
    idx = 0
    for i in df['Token'].values:
        if i!= ".":
            tmp_words.append(str(i)+" ")
            tag = df['Tag'].iloc[idx]
            tmp_tag.append(str(tag))
        else:
            tmp_words.append(str(i))
            sentence = ''.join(tmp_words)
            sentences.append(sentence)
            tmp_words = []
            tmp_tag.append("O")
            tags_seq = ','.join(tmp_tag)
            tags.append(tags_seq)
            tmp_tag = []
        idx+=1


    df = pd.DataFrame(list(zip(sentences, tags)), columns =['Sentence', 'Tags'])
    return df


In [61]:
# Train df with sentences instead of tokens
df_train_sents = reconstruct_sent(df_train)
# Dev df with sentences instead of tokens
df_dev_sents = reconstruct_sent(df_dev)
# Test df with sentences instead of tokens
df_test_sents = reconstruct_sent(df_test)

In [62]:
print("Number of sentences in the dataset", len(df_train_sents['Sentence']))
print("Number of tokens in the dataset",len(df_dev_sents['Sentence']))
print("Number of tokens in the dataset",len(df_test_sents['Sentence']))

Number of sentences in the dataset 692
Number of tokens in the dataset 82
Number of tokens in the dataset 105


In [65]:
label2id = {'B-NLP':1, 'I-NLP':2, 'O':0}  #{k: v for v, k in enumerate(df_train.Tags.unique())}  , 'nan':3
id2label = {1:'B-NLP', 2:'I-NLP', 0:'O'}  #{v: k for v, k in enumerate(df_train.Tags.unique())}  , 3:'nan'
label2id

{'B-NLP': 1, 'I-NLP': 2, 'O': 0}

## Tokenization

We try to preserve the alignments with the labels and their subwords

We use the BERT base uncased tokenizer 

In [67]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [68]:
def tokenize_and_preserve_labels(sentence, text_labels, tokenizer):
    """
    Word piece tokenization makes it difficult to match word labels
    back up with individual word pieces. This function tokenizes each
    word one at a time so that it is easier to preserve the correct
    label for each subword. It is, of course, a bit slower in processing
    time, but it will help our model achieve higher accuracy.
    """

    tokenized_sentence = []
    labels = []

    sentence = sentence.strip()

    for word, label in zip(sentence.split(), text_labels.split(",")):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)

    return tokenized_sentence, labels


In [69]:
### This cell is for info purpose only ###
# For the train data
tokenize_and_preserve_labels(df_train_sents['Sentence'].iloc[20], df_train_sents['Tags'].iloc[20], tokenizer)
# For the dev data
tokenize_and_preserve_labels(df_dev_sents['Sentence'].iloc[20], df_dev_sents['Tags'].iloc[20], tokenizer)

(['furthermore',
  ',',
  'the',
  'empirical',
  'formula',
  '##e',
  'based',
  'on',
  'the',
  'results',
  'can',
  'be',
  'used',
  'to',
  'predict',
  'the',
  'parameter',
  'in',
  'esa',
  'to',
  'avoid',
  'parameter',
  'estimation',
  'that',
  'is',
  'usually',
  'time',
  '-',
  'consuming',
  '.'],
 ['O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'B-NLP',
  'O',
  'B-NLP',
  'O',
  'O',
  'B-NLP',
  'I-NLP',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O',
  'O'])

Quick demo of a tokenized example from our dataframe 

# Pytorch Datsets and Dataloaders preparation

## Pytorch Dataset classes

In [70]:
class dataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        # step 1: tokenize (and adapt corresponding labels)
        sentence = self.data.Sentence[index]  
        word_labels = self.data.Tags[index]  
        tokenized_sentence, labels = tokenize_and_preserve_labels(sentence, word_labels, self.tokenizer)
        
        # step 2: add special tokens (and corresponding labels)
        tokenized_sentence = ["[CLS]"] + tokenized_sentence + ["[SEP]"] # add special tokens
        labels.insert(0, "O") # add outside label for [CLS] token
        labels.insert(-1, "O") # add outside label for [SEP] token

        # step 3: truncating/padding
        maxlen = self.max_len

        if (len(tokenized_sentence) > maxlen):
          # truncate
          tokenized_sentence = tokenized_sentence[:maxlen]
          labels = labels[:maxlen]
        else:
          # pad
          tokenized_sentence = tokenized_sentence + ['[PAD]'for _ in range(maxlen - len(tokenized_sentence))]
          labels = labels + ["O" for _ in range(maxlen - len(labels))]

        # step 4: obtain the attention mask
        attn_mask = [1 if tok != '[PAD]' else 0 for tok in tokenized_sentence]
        
        # step 5: convert tokens to input ids
        ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)

        label_ids = [label2id[label] for label in labels]
        
        return {
              'ids': torch.tensor(ids, dtype=torch.long),
              'mask': torch.tensor(attn_mask, dtype=torch.long),
              'targets': torch.tensor(label_ids, dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

### Hyperparameters that help define the dataset

In [71]:
MAX_LEN = 64 
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 1 
LEARNING_RATE = 1e-05  
MAX_GRAD_NORM = 10 

In [72]:
training_set = dataset(df_train_sents, tokenizer, MAX_LEN)
dev_set = dataset(df_dev_sents, tokenizer, MAX_LEN)
testing_set = dataset(df_test_sents, tokenizer, MAX_LEN)

In [73]:
training_set[0]

{'ids': tensor([  101,  1999,  2023,  3720,  2057,  6848,  2195, 12046,  2015,  1997,
          2522,  5886, 10127,  4225,  2478,  2415,  2075,  3399,  1998,  8556,
          1996,  6179,  2791,  1997,  2107, 12046,  2015,  2005,  2592, 13063,
          1999,  6882,  3793,  4245,  1012,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]),
 'mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'targets': tensor([0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 2, 0, 0, 1, 1, 2, 0, 0, 0, 0, 0, 0,
         0, 1, 1, 0, 0, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

In [74]:
testing_set[26]

{'ids': tensor([  101,  1996,  3463,  1997,  5164, 16293,  2015,  2006,  5046,  3128,
          2653,  8720,  2024,  2036,  6022,  2488,  2084,  1996,  2110,  1011,
          1997,  1011,  1996,  1011,  2396,  3921,  1012,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]),
 'mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 'targets': tensor([0, 0, 0, 0, 1, 2, 2, 0, 0, 1, 2, 2, 0, 0, 0, 0, 0, 0, 1, 2, 2, 2, 2, 2,
         2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])}

In [75]:
# print the first 30 tokens and corresponding labels
for token, label in zip(tokenizer.convert_ids_to_tokens(training_set[0]["ids"][:30]), training_set[0]["targets"][:30]):
  print('{0:10}  {1}'.format(token, id2label[label.item()]))

[CLS]       O
in          O
this        O
article     O
we          O
discuss     O
several     O
metric      B-NLP
##s         B-NLP
of          I-NLP
co          I-NLP
##her       I-NLP
##ence      I-NLP
defined     O
using       O
center      B-NLP
##ing       B-NLP
theory      I-NLP
and         O
investigate  O
the         O
useful      O
##ness      O
of          O
such        O
metric      B-NLP
##s         B-NLP
for         O
information  O
ordering    O


## Pytorch Dataloaders

In [76]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

# Model definition

In [78]:
model = BertForTokenClassification.from_pretrained('bert-base-uncased',  
                                                   num_labels=len(id2label),
                                                   id2label=id2label,
                                                   label2id=label2id)
model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwis

In [79]:
ids = training_set[0]["ids"].unsqueeze(0)
mask = training_set[0]["mask"].unsqueeze(0)
targets = training_set[0]["targets"].unsqueeze(0)
ids = ids.to(device)
mask = mask.to(device)
targets = targets.to(device)
outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
initial_loss = outputs[0]
initial_loss

tensor(1.0953, grad_fn=<NllLossBackward0>)

In [80]:
tr_logits = outputs[1]
tr_logits.shape

torch.Size([1, 64, 3])

In [81]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=LEARNING_RATE)

# Training 

## Training function

In [82]:
# Defining the training function on the 80% of the dataset for tuning the bert model
def train(epoch):
    tr_loss, tr_accuracy = 0, 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_preds, tr_labels = [], []
    # put model in training mode
    model.train()
    
    for idx, batch in enumerate(training_loader):
        
        ids = batch['ids'].to(device, dtype = torch.long)
        mask = batch['mask'].to(device, dtype = torch.long)
        targets = batch['targets'].to(device, dtype = torch.long)

        outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
        loss, tr_logits = outputs.loss, outputs.logits
        tr_loss += loss.item()

        nb_tr_steps += 1
        nb_tr_examples += targets.size(0)
        
        if idx % 100==0:
            loss_step = tr_loss/nb_tr_steps
            print(f"Training loss per 100 training steps: {loss_step}")
           
        # compute training accuracy
        flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
        active_logits = tr_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
        flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
        # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
        active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
        targets = torch.masked_select(flattened_targets, active_accuracy)
        predictions = torch.masked_select(flattened_predictions, active_accuracy)
        
        tr_preds.extend(predictions)
        tr_labels.extend(targets)
        
        tmp_tr_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
        tr_accuracy += tmp_tr_accuracy
    
        # gradient clipping
        torch.nn.utils.clip_grad_norm_(
            parameters=model.parameters(), max_norm=MAX_GRAD_NORM
        )
        
        # backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    epoch_loss = tr_loss / nb_tr_steps
    tr_accuracy = tr_accuracy / nb_tr_steps
    print(f"Training loss epoch: {epoch_loss}")
    print(f"Training accuracy epoch: {tr_accuracy}")


In [83]:
for epoch in range(EPOCHS):
    print(f"Training epoch: {epoch + 1}")
    train(epoch)

Training epoch: 1
Training loss per 100 training steps: 1.1123508214950562
Training loss per 100 training steps: 0.3489703205552432
Training loss epoch: 0.2839571420598581
Training accuracy epoch: 0.8072301541917124


# Model Evaluation

## Eval function

In [84]:
def valid(model, testing_loader):
    # put model in evaluation mode
    model.eval()
    
    eval_loss, eval_accuracy = 0, 0
    nb_eval_examples, nb_eval_steps = 0, 0
    eval_preds, eval_labels = [], []
    
    with torch.no_grad():
        for idx, batch in enumerate(testing_loader):
            
            ids = batch['ids'].to(device, dtype = torch.long)
            mask = batch['mask'].to(device, dtype = torch.long)
            targets = batch['targets'].to(device, dtype = torch.long)
            
            outputs = model(input_ids=ids, attention_mask=mask, labels=targets)
            loss, eval_logits = outputs.loss, outputs.logits
            
            eval_loss += loss.item()

            nb_eval_steps += 1
            nb_eval_examples += targets.size(0)
        
            if idx % 100==0:
                loss_step = eval_loss/nb_eval_steps
                print(f"Validation loss per 100 evaluation steps: {loss_step}")
              
            # compute evaluation accuracy
            flattened_targets = targets.view(-1) # shape (batch_size * seq_len,)
            active_logits = eval_logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
            flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size * seq_len,)
            # now, use mask to determine where we should compare predictions with targets (includes [CLS] and [SEP] token predictions)
            active_accuracy = mask.view(-1) == 1 # active accuracy is also of shape (batch_size * seq_len,)
            targets = torch.masked_select(flattened_targets, active_accuracy)
            predictions = torch.masked_select(flattened_predictions, active_accuracy)
            
            eval_labels.extend(targets)
            eval_preds.extend(predictions)
            
            tmp_eval_accuracy = accuracy_score(targets.cpu().numpy(), predictions.cpu().numpy())
            eval_accuracy += tmp_eval_accuracy
    
    print(eval_labels)
    print(eval_preds)

    labels = [id2label[id.item()] for id in eval_labels]
    predictions = [id2label[id.item()] for id in eval_preds]

    print(labels)
    print(predictions)
    
    eval_loss = eval_loss / nb_eval_steps
    eval_accuracy = eval_accuracy / nb_eval_steps
    print(f"Validation Loss: {eval_loss}")
    print(f"Validation Accuracy: {eval_accuracy}")

    return labels, predictions


In [85]:
labels, predictions = valid(model, testing_loader)

Validation loss per 100 evaluation steps: 0.2417120337486267
[tensor(0), tensor(1), tensor(1), tensor(1), tensor(2), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(1), tensor(1), tensor(1), tensor(1), tensor(2), tensor(2), tensor(2), tensor(2), tensor(2), tensor(2), tensor(0), tensor(1), tensor(1), tensor(1), tensor(1), tensor(0), tensor(0), tensor(0), tensor(1), tensor(2), tensor(2), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0), tensor(1), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(1), tensor(1), tensor(0), tensor(0), tensor(1), tensor(2), tensor(2), tensor(2), tensor(0), tensor(0), tensor(0), tensor(1), tensor(1), tensor(1), tensor(2), tensor(2), tensor(2), tensor(0), tensor(1), tensor(1), tensor(2), tensor(2), tensor(2), tensor(0), tensor(0), tensor(0), tensor(1), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(0), tensor(1), tensor(2), tensor(2), tensor(2), ten

# Metrics

In [87]:
from seqeval.metrics import classification_report

print(labels)
print(predictions)

print(classification_report([labels], [predictions]))


['O', 'B-NLP', 'B-NLP', 'B-NLP', 'I-NLP', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'B-NLP', 'B-NLP', 'B-NLP', 'B-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'O', 'B-NLP', 'B-NLP', 'B-NLP', 'B-NLP', 'O', 'O', 'O', 'B-NLP', 'I-NLP', 'I-NLP', 'O', 'O', 'O', 'O', 'B-NLP', 'O', 'B-NLP', 'B-NLP', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'B-NLP', 'B-NLP', 'O', 'O', 'B-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'O', 'O', 'O', 'B-NLP', 'B-NLP', 'B-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'O', 'B-NLP', 'B-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'O', 'O', 'O', 'B-NLP', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'O', 'B-NLP', 'B-NLP', 'B-NLP', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'B-NLP', 'O', 'B-NLP', 'B-NLP', 'O', 'O', 'B-NLP', 'B-NLP', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'I-NLP', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'B-NLP', 'B-NLP', 'B-NLP', 'I-NLP', 'I-

In [88]:
from transformers import pipeline

pipe = pipeline("token-classification", model=model, tokenizer=tokenizer)
res =[]

for s in df_test_sents['Sentence'].values:
    print("SENT",s)
    res.append(pipe(s))


print(res)


SENT Recent work in natural language generation has begun to take linguistic variation into account , developing algorithms that are capable of modifying the system 's linguistic style based either on the user 's linguistic style or other factors , such as personality or politeness .
SENT While stylistic control has traditionally relied on handcrafted rules , statistical methods are likely to be needed for generation systems to scale to the production of the large range of variation observed in human dialogues .
SENT Previous work on statistical natural language generation ( SNLG ) has shown that the grammaticality and naturalness of generated utterances can be optimized from data ; however these data - driven methods have not been shown to produce stylistic variation that is perceived by humans in the way that the system intended .
SENT This paper describes Personage , a highly parameterizable language generator whose parameters are based on psychological findings about the linguistic

In [89]:
sentence = "Second , we define entropy - based measures that estimate the correspondence of target - language phrases to translationese , thereby eliminating the need to annotate the parallel corpus with information pertaining to the direction of translation ."

inputs = tokenizer(sentence, padding='max_length', truncation=True, max_length=MAX_LEN, return_tensors="pt")

# move to gpu
ids = inputs["input_ids"].to(device)
mask = inputs["attention_mask"].to(device)
# forward pass
outputs = model(ids, mask)
logits = outputs[0]

active_logits = logits.view(-1, model.num_labels) # shape (batch_size * seq_len, num_labels)
flattened_predictions = torch.argmax(active_logits, axis=1) # shape (batch_size*seq_len,) - predictions at the token level

tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
token_predictions = [id2label[i] for i in flattened_predictions.cpu().numpy()]
wp_preds = list(zip(tokens, token_predictions)) # list of tuples. Each tuple = (wordpiece, prediction)

word_level_predictions = []
for pair in wp_preds:
  if (pair[0].startswith(" ##")) or (pair[0] in ['[CLS]', '[SEP]', '[PAD]']):
    # skip prediction
    continue
  else:
    word_level_predictions.append(pair[1])

# we join tokens, if they are not special ones
str_rep = " ".join([t[0] for t in wp_preds if t[0] not in ['[CLS]', '[SEP]', '[PAD]']]).replace(" ##", "")
print(str_rep)
print(word_level_predictions)



second , we define entropy - based measures that estimate the correspondence of target - language phrases to translationese , thereby eliminating the need to annotate the parallel corpus with information pertaining to the direction of translation .
['O', 'O', 'O', 'O', 'B-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'I-NLP', 'I-NLP', 'I-NLP', 'O', 'B-NLP', 'I-NLP', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'B-NLP', 'O', 'O', 'B-NLP', 'I-NLP', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-NLP', 'O']


In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
# confusion matrix
ref_labels=['B-NLP', 'I-NLP', 'O']
conf_mat = confusion_matrix(predictions, labels, labels=ref_labels)
sns.set(font_scale=1)
x_axis_labels = ref_labels
y_axis_labels = ref_labels
matrix = sns.heatmap(conf_mat, annot=True, fmt='d', linewidths=.5, cmap='flare', xticklabels=x_axis_labels,
                     yticklabels=y_axis_labels)
matrix.set(xlabel='predicted', ylabel='actual')