In [None]:
!pip install transformers datasets

## Fechting Dataset
We need Wiki D/Similar dataset (wiki-d-similar.zip) to perform training or evaluation.

It should be at the same directory as this notebook.

You can get this dataset from [Sentence Transformers](https://github.com/m3hrdadfi/sentence-transformers)

If you're using colab you can use next section to upload the file directly or mount your drive if you uploaded into your Google Drive, otherwise you can ignore the next section.


In [None]:
# 1. Upload Directly
#from google.colab import files
#files.upload()
#
# 2. Mounting Google Drive
#from google.colab import drive
#drive.mount('/content/gdrive')
#!cp gdrive/MyDrive/wiki-d-similar.zip .

Extracting dataset from zip format.

In [None]:
!mkdir nli
!7z x wiki-d-similar.zip -onli/

In [4]:
!mkdir nli
!cp wiki-d-similar.zip nli/
!7z x nli/wiki-d-similar.zip -onli/


7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,2 CPUs Intel(R) Xeon(R) CPU @ 2.20GHz (406F0),ASM,AES-NI)

Scanning the drive for archives:
  0M Scan         1 file, 32603485 bytes (32 MiB)

Extracting archive: nli/wiki-d-similar.zip
--
Path = nli/wiki-d-similar.zip
Type = zip
Physical Size = 32603485

  0%     22% 1 - wiki-d-similar/wiki-d-similar.csv                                           38% 1 - wiki-d-similar/wiki-d-similar.csv                                           50% 2 - wiki-d-similar/test.csv                                 60% 4 - wiki-d-similar/train.csv                                 

Next cell will load the dataset, since our delimiter is tab instead of Comma (,) we have to declare it using ```delimiter="\t"```

In [5]:
from datasets import load_dataset

data_files = {"train": "train.csv", "test": "test.csv", "dev": "dev.csv"}

dataset = load_dataset("nli/wiki-d-similar", data_files=data_files, delimiter="\t")

Using custom data configuration wiki-d-similar-132f132706a1ec58


Downloading and preparing dataset csv/wiki-d-similar to /root/.cache/huggingface/datasets/csv/wiki-d-similar-132f132706a1ec58/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/wiki-d-similar-132f132706a1ec58/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Labels in wikinli dataset are "similar" and "dissimilar".

We have to map them to the corresponding id, the function in the next section will do the job:


In [6]:
def label2id(example):
  if example["Label"] == "similar":
    example["label"] = 1
  else:
    example["label"] = 0
  return example
dataset = dataset.map(label2id)

  0%|          | 0/126628 [00:00<?, ?ex/s]

  0%|          | 0/5497 [00:00<?, ?ex/s]

  0%|          | 0/5277 [00:00<?, ?ex/s]

There are a few things that we should be aware of when feeding data into our model, fortunately tokenizer and data collator from transformers library would take care of this, we just need to init them properly. 

In [7]:
from transformers import AutoTokenizer, DataCollatorWithPadding

checkpoint = "HooshvareLab/bert-fa-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["Sentence1"], example["Sentence2"], truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True).shuffle(seed=42)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)

Downloading:   0%|          | 0.00/440 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14M [00:00<?, ?B/s]

  0%|          | 0/127 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


  0%|          | 0/6 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

We are loading both models at the same time to evaluate.

In [9]:
from transformers import AutoModelForSequenceClassification
m3hrdadfi_model = AutoModelForSequenceClassification.from_pretrained('m3hrdadfi/bert-fa-base-uncased-wikinli', num_labels=2)
haddad_model = AutoModelForSequenceClassification.from_pretrained('demoversion/bert-fa-base-uncased-haddad-wikinli', num_labels=2)

Downloading:   0%|          | 0.00/699 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/621M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/818 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/621M [00:00<?, ?B/s]

## Evaluation

This function will return metrics of the desired dataset ('dev', 'test') as defined bellow, I think there should be easier ways to do this but I could'nt find the proper functions and have to do this manually.

In [10]:
import torch
from torch.utils.data import DataLoader
from datasets import load_metric

def calculate_metric(model, metric_name, dataset_name):
  tokenized_datasets2 = tokenized_datasets.remove_columns(["Sentence1", "Sentence2", 'Article Title', 'Article Link', 'Label'])
  tokenized_datasets2 = tokenized_datasets2.rename_column("label", "labels")
  tokenized_datasets2.set_format("torch")
  device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
  model.to(device)
  eval_dataloader = DataLoader(
      tokenized_datasets2[dataset_name], batch_size=32, collate_fn=data_collator
  )

  metric = load_metric(metric_name)
  model.eval()
  for batch in eval_dataloader:
      batch = {k: v.to(device) for k, v in batch.items()}
      with torch.no_grad():
          outputs = model(**batch)

      logits = outputs.logits
      predictions = torch.argmax(logits, dim=-1)
      metric.add_batch(predictions=predictions, references=batch["labels"])

  return metric.compute()

We can now easily evaluate our models, as you can see in the section bellow ```77.84``` is the exact accuracy that was reported on [Sentence Transformers](https://github.com/m3hrdadfi/sentence-transformers) repository.


In [11]:
calculate_metric(m3hrdadfi_model, "accuracy", "dev")

Downloading builder script:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

{'accuracy': 0.7788516202387721}

In [12]:
calculate_metric(haddad_model, "accuracy", "dev")

{'accuracy': 0.786242183058556}

To summarize metrics on diffrent datasets we could use pandas to create a table like this:

In [18]:
import pandas as pd
def create_metrics_table(models, metrics_name, datasets_name):
  arr_dicts = []
  for model in models:
    row = dict()
    row['Model'] = model.config._name_or_path
    for d_name in datasets_name:
      for m_name in metrics_name:
        row[d_name+'_'+m_name] = calculate_metric(model, m_name, d_name)[m_name]
    arr_dicts.append(row)
  return pd.DataFrame.from_dict(arr_dicts)

create_metrics_table([m3hrdadfi_model, haddad_model], ['accuracy', 'f1'], ['dev', 'test'])

Downloading builder script:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Unnamed: 0,Model,dev_accuracy,dev_f1,test_accuracy,test_f1
0,m3hrdadfi/bert-fa-base-uncased-wikinli,0.778852,0.775793,0.766418,0.75991
1,demoversion/bert-fa-base-uncased-haddad-wikinli,0.786242,0.797487,0.77042,0.785666
