# 1. Information about the submission

## 1.1 Name and number of the assignment 

**Assignment №1 Comparison Identification**

## 1.2 Student name

**Insaf Ashrapov**

## 1.3 Codalab user ID

**Insafq, team name Sber2**

# 2. Technical Report

## 2.1 Methodology 

**Task description** 

Named-entity recognition (NER) is a task of natural language processing in which an entity in a given text, such as the name of a person, place or organization, is identified and categorized. NER algorithms can be used to vet email for malicious content or in social media to verify audience interests. It is also useful for search engines so that users can quickly find related articles on a topic *(generated by ChatGPT)*

In our "Comparison Identification Challenge" we face classical NER problem, there transformers shoult easily outperform other tasks.


***Examined aproaches***

Didn't work or gave not noticible affect:
1. Poor metric with LSTM. Definitely Fasttext embedding could significantly improve the score.
2. Using Early Stopping. Because loss on val poorly says about model quality
3. Longer training. Training model more than in final solution gave worse result and cause overfitting
4. xlm-roberta-large gave worst result due high overfitting 

## 2.2 Discussion of results

The final resolution for the Named Entity Recognition (NER) task depends entirely on Transformers. Transformer models are powerful neural network architectures with the capability to learn context and extract meaningful information from natural language data. With enough training and fine-tuning, a transformer can become effective at recognizing entities within the given corpus of text accurately.
xlm-roberta-large-finetuned-conll03-english

Method |  dev | test
--- | ---| ---
xlm-roberta-large finetune | 0.13 | -
LSTM 5 epochs | 0.20| -
LSTM 300 epochs |0.3 | -
bert-base-uncased NER finetune | 	0.686 | 0.801


# 3. Code

## 3.1 Requirements + imports

In [None]:
!pip install spacy
!pip install transformers
!pip install sentencepiece
!pip install datasets evaluate transformers[sentencepiece]
!pip install seqeval
# and some other your dependencies

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import os 
from tqdm import tqdm
from datasets import Dataset, DatasetDict, load_metric


from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments
from transformers.file_utils import cached_property
from typing import Tuple
from sklearn.model_selection import train_test_split
import gc
from tqdm.auto import tqdm, trange
import numpy as np
from sklearn.metrics import f1_score
import joblib
import transformers

import numpy as np
import pandas as pd

from sklearn import preprocessing
from sklearn import model_selection

from tqdm import tqdm
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup

#to suppress warnings 
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

def cleanup():
    gc.collect()
    torch.cuda.empty_cache()

## 3.2 Download the data

In [None]:
!wget --no-cache --backups=1 "https://raw.githubusercontent.com/s-nlp/semantic-role-labelling/main/dev_no_answers.tsv"
!wget --no-cache --backups=1 "https://raw.githubusercontent.com/s-nlp/semantic-role-labelling/main/test_no_answers.tsv"
!wget --no-cache --backups=1 "https://raw.githubusercontent.com/s-nlp/semantic-role-labelling/main/train.tsv"
# if some needed file is not in the public domain use google drive or other free hosting to make them available

--2022-12-20 21:05:55--  https://raw.githubusercontent.com/s-nlp/semantic-role-labelling/main/dev_no_answers.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43071 (42K) [text/plain]
Saving to: ‘dev_no_answers.tsv’


2022-12-20 21:05:56 (50.0 MB/s) - ‘dev_no_answers.tsv’ saved [43071/43071]

--2022-12-20 21:05:56--  https://raw.githubusercontent.com/s-nlp/semantic-role-labelling/main/test_no_answers.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58177 (57K) [text/plain]
Saving to: ‘test_no_answers.tsv’


2022-12-20 21:05:5

## 3.3 Preprocessing 

In [None]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


def read_dataset(filename, splitter="\t"):
    data = []
    sentence = []
    tags = []
    sentences = []
    tagses = []
    with open(filename) as f:
        for line in f:
            if not line.isspace():
                word, tag = line.split(splitter)
                sentence.append(word)
                tags.append(tag.strip())
            else:
                data.append((sentence, tags))
                sentence = []
                tags = []

                sentences.append(sentence)
                tagses.append(tags)
    return data, sentences, tagses


def read_dev_test(filename, splitter="\t"):
    data = []
    sentence = []

    with open(filename) as f:
        for line in f:
            if not line.isspace():
                word = line
                sentence.append(word.replace('\n',''))
            else:
                data.append(sentence)
                sentence = []
    return data

In [None]:
training_data, train_sentences, train_tags = read_dataset("train.tsv")
test_data, test_sen, test_tag = read_dataset("test_no_answers.tsv", splitter="\n")
dev_data, dev_sen, dev_tags = read_dataset("dev_no_answers.tsv", splitter="\n")

word_to_ix = {}

# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
            
for sent, tags in test_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index

for sent, tags in dev_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index

tag_to_ix = {
    "O": 0,
    "B-Object": 1,
    "I-Object": 2,
    "B-Aspect": 3,
    "I-Aspect": 4,
    "B-Predicate": 5,
    "I-Predicate": 6
}  # Assign each tag with a unique index

idx_to_tag = dict(map(reversed, tag_to_ix.items()))

EMBEDDING_DIM = 32
HIDDEN_DIM = 64

## 3.4 Training baseline

In [None]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

In [None]:
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

for epoch in tqdm(range(300)):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

  0%|          | 0/300 [00:00<?, ?it/s]

### Inference

In [None]:
with open("out_test.tsv", "w") as w:
    with torch.no_grad():
        for sentence in tqdm(test_data):
            inputs = prepare_sequence(sentence[0], word_to_ix)
            tag_scores = model(inputs)
            tags = [idx_to_tag[int(i)] for i in tag_scores.argmax(dim=-1)]
            for i, y in zip(sentence[0], tags):
                w.write(f"{i}\t{y}\n")
            w.write("\n")
            
with open("out_dev.tsv", "w") as w:
    with torch.no_grad():
        for sentence in tqdm(dev_data):
            inputs = prepare_sequence(sentence[0], word_to_ix)
            tag_scores = model(inputs)
            tags = [idx_to_tag[int(i)] for i in tag_scores.argmax(dim=-1)]
            for i, y in zip(sentence[0], tags):
                w.write(f"{i}\t{y}\n")
            w.write("\n")
            

  0%|          | 0/360 [00:00<?, ?it/s]

  0%|          | 0/283 [00:00<?, ?it/s]

In [None]:
!zip out.zip out_test.tsv
!zip out_dev.zip out_dev.tsv

updating: out_test.tsv (deflated 74%)
updating: out_dev.tsv (deflated 72%)


## Hugginc face baseline

### Preprocessing labels

In [None]:
from torch import Tensor
from typing import Optional


def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

def tokenize_and_align_labels(examples, label_all_tokens=False):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        label_ids = [label_list.index(idx) if isinstance(idx, str) else idx for idx in label_ids]

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs



def compute_metrics(p):
    predictions, labels, inputs = p.predictions, p.label_ids, p.inputs
    predictions = np.argmax(p.predictions, axis=2)

    # send only the first token of each word to the evaluation
    true_predictions = []
    true_labels = []
    for prediction, label, tokens in zip(predictions, labels, inputs):
        true_predictions.append([])
        true_labels.append([])
        for (p, l, t) in zip(prediction, label, tokens):
            if l != -100 and not tokenizer.convert_ids_to_tokens(int(t)).startswith('##'):
                true_predictions[-1].append(label_list[p])
                true_labels[-1].append(label_list[l])

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division=0)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }


### Dataset creation

In [None]:
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import DataCollatorForTokenClassification, pipeline

model_checkpoint = "bert-base-uncased"#"bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)


metric = load_metric("seqeval")

training_data, sentences, tag = read_dataset("train.tsv")
enc_tag = []

label_list = list(tag_to_ix.keys())
for val in tag:
  enc_tag.append([tag_to_ix.get(item)  for item in val])

train_sentences, test_sentences, train_tag, test_tag = model_selection.train_test_split(
                                      sentences, 
                                      tag, 
                                      random_state=42, 
                                      test_size=0.1
                                    )

ner_data = DatasetDict({
    'train': Dataset.from_pandas(pd.DataFrame({'tokens': train_sentences, 'tags': train_tag})),
    'test': Dataset.from_pandas(pd.DataFrame({'tokens': test_sentences, 'tags': test_tag}))
})
tokenized_datasets = ner_data.map(tokenize_and_align_labels, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list), ignore_mismatched_sizes=True)
model.config.id2label = dict(enumerate(label_list))
model.config.label2id = {v: k for k, v in model.config.id2label.items()}


args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=4,
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
    include_inputs_for_metrics=True,
)

data_collator = DataCollatorForTokenClassification(tokenizer)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

### Fit predict

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)
trainer.train() 

In [None]:
trainer.evaluate()

The following columns in the evaluation set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: tokens, tags. If tokens, tags are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 234
  Batch size = 16


{'eval_loss': 0.1575445532798767,
 'eval_precision': 0.8196981731532963,
 'eval_recall': 0.9036777583187391,
 'eval_f1': 0.8596418159100374,
 'eval_accuracy': 0.9506791720569211,
 'eval_runtime': 1.3921,
 'eval_samples_per_second': 168.096,
 'eval_steps_per_second': 10.775,
 'epoch': 4.0}

### Inference

In [None]:
from transformers import pipeline
device = 'cuda' if torch.cuda.is_available() else 'cpu'

pipe = pipeline(model=model.to('cpu'), tokenizer=tokenizer, task='ner', aggregation_strategy='max', 
                ignore_labels=[],
                grouped_entities=False, device='cpu')

test_data = read_dev_test("test_no_answers.tsv", splitter="\n")
dev_data = read_dev_test("dev_no_answers.tsv", splitter="\n")

with open("out_test_bert.tsv", "w") as w:
    with torch.no_grad():
        for list_sentence in tqdm(test_data):
            text = " ".join(list_sentence)
            output = pd.DataFrame(pipe(text))
            output = output[~output['word'].str.startswith("##")]
            tags = output['entity'].values
            len_output = 0 

            while len_output < len(tags) and len_output < len(list_sentence):
              w.write(f"{list_sentence[len_output]}\t{tags[len_output]}\n")
              len_output = len_output + 1

            while len_output < len(list_sentence):
              w.write(f"{list_sentence[len_output]}\t{list(tag_to_ix.keys())[0]}\n")
              len_output = len_output + 1

            w.write("\n")

with open("out_dev_bert.tsv", "w") as w:
    with torch.no_grad():
        for list_sentence in tqdm(dev_data):
            text = " ".join(list_sentence)
            output = pd.DataFrame(pipe(text))
            output = output[~output['word'].str.startswith("##")]
            tags = output['entity'].values
            len_output = 0 

            while len_output < len(tags) and len_output < len(list_sentence):
              w.write(f"{list_sentence[len_output]}\t{tags[len_output]}\n")
              len_output = len_output + 1

            while len_output < len(list_sentence):
              w.write(f"{list_sentence[len_output]}\t{list(tag_to_ix.keys())[0]}\n")
              len_output = len_output + 1

            w.write("\n")

!zip out_test_xlm.zip out_test_bert.tsv
!zip out_dev_xlm.zip out_dev_bert.tsv

100%|██████████| 360/360 [00:48<00:00,  7.47it/s]
100%|██████████| 283/283 [00:42<00:00,  6.65it/s]


updating: out_test_bert.tsv (deflated 75%)
updating: out_dev_bert.tsv (deflated 74%)
