# ParsBERT (v2.0)
A Transformer-based Model for Persian Language Understanding

We reconstructed the vocabulary and fine-tuned the ParsBERT v1.1 on the new Persian corpora in order to provide some functionalities for using ParsBERT in other scopes!


## Persian NER [ARMAN, PEYMA]

This task aims to extract named entities in the text, such as names and label with appropriate **NER** classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with **IOB** format. In this format, tokens that are not part of an entity are tagged as **”O”**, the **”B”** tag corresponds to the first word of an object, and the **”I”** tag corresponds to the rest of the terms of the same entity. Both **”B”** and **”I”** tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the **NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text**.

There are two primary datasets used in Persian NER, **ARMAN**, and **PEYMA**. 

In [1]:
!nvidia-smi
!lscpu

Mon Aug 16 11:59:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P8    34W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install transformers==4.7.0
!pip install hazm==0.7.0
!pip install seqeval==1.2.2

Collecting transformers==4.7.0
  Downloading transformers-4.7.0-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 4.2 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 49.4 MB/s 
Collecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 39.0 MB/s 
Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed huggingface-hub-0.0.8 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.7.0
Collecting hazm==0.7.0
  Downloading hazm-0.7.0-py3-none-any.whl (316 kB)
[K     |████████████████████████████████| 316 kB 4.1 MB/s 
[?25hCollecting nltk==3.3
  Downloading nltk-3.3.0.zip (1.4

In [3]:
!pip install PyDrive
import os
import IPython.display as ipd
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [4]:
import os
import gc
import ast
import time
import hazm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

import transformers
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForTokenClassification

from IPython.display import display, HTML, clear_output
from ipywidgets import widgets, Layout

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

print()
print('numpy', np.__version__)
print('pandas', pd.__version__)
print('transformers', transformers.__version__)
print('torch', torch.__version__)
print()

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


numpy 1.19.5
pandas 1.1.5
transformers 4.7.0
torch 1.9.0+cu102

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [5]:
class NER:
    def __init__(self, model_name):
        self.normalizer = hazm.Normalizer()
        self.model_name = model_name
        self.config = AutoConfig.from_pretrained(self.model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(self.model_name)
        # self.labels = list(self.config.label2id.keys())
        self.id2label = self.config.id2label

    @staticmethod
    def load_ner_data(file_path, word_index, tag_index, delimiter, join=False):
        dataset, labels = [], []
        with open(file_path, encoding="utf8") as infile:
            sample_text, sample_label = [], []
            for line in infile:
                parts = line.strip().split(delimiter)
                if len(parts) > 1:
                    word, tag = parts[word_index], parts[tag_index]
                    if not word:
                        continue
                    sample_text.append(word)
                    sample_label.append(tag)
                else:
                    # end of sample
                    if sample_text and sample_label:
                        if join:
                            dataset.append(' '.join(sample_text))
                            labels.append(' '.join(sample_label))
                        else:
                            dataset.append(sample_text)
                            labels.append(sample_label)
                    sample_text, sample_label = [], []
        if sample_text and sample_label:
            if join:
                dataset.append(' '.join(sample_text))
                labels.append(' '.join(sample_label))
            else:
                dataset.append(sample_text)
                labels.append(sample_label)
        return dataset, labels

    def load_test_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "peyma":
            ner_file_path = dataset_dir + 'test.txt'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            return self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='|',
                                      join=kwargs.get('join', False))
        elif dataset_name.lower() == "arman":
            dataset, labels = [], []
            for i in range(1, 4):
                ner_file_path = dataset_dir + f'test_fold{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter=' ',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "hooshvare-peyman+arman+wikiann":
            ner_file_path = dataset_dir + 'test.csv'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            data = pd.read_csv(ner_file_path, delimiter="\t")
            sentences, sentences_tags = data['tokens'].values.tolist(), data['ner_tags'].values.tolist()
            sentences = [ast.literal_eval(ss) for ss in sentences]
            sentences_tags = [ast.literal_eval(ss) for ss in sentences_tags]
            print(f'test part:\n #sentences: {len(sentences)}, #sentences_tags: {len(sentences_tags)}')
            return sentences, sentences_tags

    def load_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "farsiyar":
            dataset, labels = [], []
            for i in range(1, 6):
                ner_file_path = dataset_dir + 'Persian-NER-part{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='\t',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "wikiann":
            ner_file_path = dataset_dir + 'wikiann-fa.bio'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            dataset_all, labels_all = self.load_ner_data(ner_file_path, word_index=0, tag_index=-1, delimiter=' ',
                                                         join=kwargs.get('join', False))
            print(f'all data: #data: {len(dataset_all)}, #labels: {len(labels_all)}')

            try:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1,
                                                               stratify=labels_all)
                print("with stratify")
            except:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1)
                print("without stratify")
            print(f'test part:\n #data: {len(data_test)}, #labels: {len(label_test)}')
            return dataset_all, labels_all, data_test, label_test

    def ner_inference(self, input_text, device, max_length):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        pt_batch = self.tokenizer(
            [self.normalizer.normalize(sequence) for sequence in input_text],
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        pt_batch = pt_batch.to(device)
        pt_outputs = self.model(**pt_batch)
        pt_predictions = torch.argmax(pt_outputs.logits, dim=-1)
        pt_predictions = pt_predictions.cpu().detach().numpy().tolist()

        output_predictions = []
        for i, sequence in enumerate(input_text):
            tokens = self.tokenizer.tokenize(self.tokenizer.decode(self.tokenizer.encode(sequence)))
            predictions = [(token, self.id2label[prediction]) for token, prediction in
                           zip(tokens, pt_predictions[i])]
            output_predictions.append(predictions)
        return output_predictions

    def ner_evaluation(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_sentence, new_sentence_label = [], []
            for word, label in zip(sentence, sentence_label):
                # Tokenize the word and count # of subwords the word is broken into
                tokenized_word = self.tokenizer.tokenize(word)
                n_subwords = len(tokenized_word)

                # Add the tokenized word to the final tokenized word list
                tokenized_sentence.extend(tokenized_word)
                # Add the same label to the new list of labels `n_subwords` times
                new_sentence_label.extend([label] * n_subwords)

            max_len = max(max_len, len(tokenized_sentence))
            tokenized_texts.append(tokenized_sentence)
            new_labels.append(new_sentence_label)

        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences([self.tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                                  maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences([[self.config.label2id.get(l) for l in lab] for lab in new_labels],
                                     maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_loss, total_time = 0, 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')
            # get the loss
            total_loss += outputs.loss.item()

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        # Calculate the average loss over the training data.
        avg_train_loss = total_loss / len(data_loader)
        print("average loss:", avg_train_loss)
        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def ner_evaluation_2(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        print("len(input_text):", len(input_text))
        print("len(input_labels):", len(input_labels))
        c = 0
        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_words = self.tokenizer(sentence, padding=False, add_special_tokens=False).input_ids
            tokenized_sentence_ids, new_sentence_label = [], []
            for i, tokenized_word in enumerate(tokenized_words):
                # Add the tokenized word to the final tokenized word list
                tokenized_sentence_ids += tokenized_word
                # Add the same label to the new list of labels `number of subwords` times
                new_sentence_label.extend([self.config.label2id.get(sentence_label[i])] * len(tokenized_word))

            max_len = max(max_len, len(tokenized_sentence_ids))
            tokenized_texts.append(tokenized_sentence_ids)
            new_labels.append(new_sentence_label)
            c += 1
            if c % 10000 == 0:
                print("c:", c)
        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences(tokenized_texts, maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences(new_labels, maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_time = 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def check_input_label_consistency(self, labels):
        model_labels = self.config.label2id.keys()
        dataset_labels = set()
        for l in labels:
            dataset_labels.update(set(l))
        print("model labels:", model_labels)
        print("dataset labels:", dataset_labels)
        print("intersection:", set(model_labels).intersection(dataset_labels))
        print("model_labels-dataset_labels:", list(set(model_labels) - set(dataset_labels)))
        print("dataset_labels-model_labels:", list(set(dataset_labels) - set(model_labels)))
        if list(set(dataset_labels) - set(model_labels)):
            return False
        return True

    @staticmethod
    def resolve_input_label_consistency(labels, label_translation_map):
        for i, sentence_labels in enumerate(labels):
            for j, label in enumerate(sentence_labels):
                labels[i][j] = label_translation_map.get(label)
        return labels

    @staticmethod
    def evaluate_prediction_results(labels, output_predictions):
        dataset_labels = set()
        for label in labels:
            dataset_labels.update(set(label))

        true_labels, predictions = [], []
        for sample_output in output_predictions:
            sample_true_labels = []
            sample_predicted_labels = []
            for token, true_label, predicted_label in sample_output:
                sample_true_labels.append(true_label)
                if predicted_label in dataset_labels:
                    sample_predicted_labels.append(predicted_label)
                else:
                    sample_predicted_labels.append('O')
            true_labels.append(sample_true_labels)
            predictions.append(sample_predicted_labels)

        print("Test Accuracy: {}".format(accuracy_score(true_labels, predictions)))
        print("Test Precision: {}".format(precision_score(true_labels, predictions)))
        print("Test Recall: {}".format(recall_score(true_labels, predictions)))
        print("Test F1-Score: {}".format(f1_score(true_labels, predictions)))
        print("Test classification Report:\n{}".format(classification_report(true_labels, predictions, digits=10)))


In [6]:
model_name='HooshvareLab/bert-fa-base-uncased-ner-arman'
ner_model = NER(model_name)

Downloading:   0%|          | 0.00/972 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.20M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/651M [00:00<?, ?B/s]

In [7]:
print(ner_model.config)

BertConfig {
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B-event",
    "1": "B-fac",
    "2": "B-loc",
    "3": "B-org",
    "4": "B-pers",
    "5": "B-pro",
    "6": "I-event",
    "7": "I-fac",
    "8": "I-loc",
    "9": "I-org",
    "10": "I-pers",
    "11": "I-pro",
    "12": "O"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-event": 0,
    "B-fac": 1,
    "B-loc": 2,
    "B-org": 3,
    "B-pers": 4,
    "B-pro": 5,
    "I-event": 6,
    "I-fac": 7,
    "I-loc": 8,
    "I-org": 9,
    "I-pers": 10,
    "I-pro": 11,
    "O": 12
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version":

#### Sample Inference:

In [8]:
texts = [
    "مدیرکل محیط زیست استان البرز با بیان اینکه با بیان اینکه موضوع شیرابه‌های زباله‌های انتقال یافته در منطقه حلقه دره خطری برای این استان است، گفت: در این مورد گزارشاتی در ۲۵ مرداد ۱۳۹۷ تقدیم مدیران استان شده است.",
    "به گزارش خبرگزاری تسنیم از کرج، حسین محمدی در نشست خبری مشترک با معاون خدمات شهری شهرداری کرج که با حضور مدیرعامل سازمان‌های پسماند، پارک‌ها و فضای سبز و نماینده منابع طبیعی در سالن کنفرانس شهرداری کرج برگزار شد، اظهار داشت: ۸۰٪  جمعیت استان البرز در کلانشهر کرج زندگی می‌کنند.",
    "وی افزود: با همکاری‌های مشترک بین اداره کل محیط زیست و شهرداری کرج برنامه‌های مشترکی برای حفاظت از محیط زیست در شهر کرج در دستور کار قرار گرفته که این اقدامات آثار مثبتی داشته و تاکنون نزدیک به ۱۰۰ میلیارد هزینه جهت خریداری اکس-ریس صورت گرفته است.",
]

In [9]:
inference_output = ner_model.ner_inference(texts, device, ner_model.config.max_position_embeddings)

In [10]:
print(inference_output)

[[('[CLS]', 'O'), ('مدیرکل', 'O'), ('محیط', 'B-org'), ('زیست', 'I-org'), ('استان', 'I-org'), ('البرز', 'I-org'), ('با', 'O'), ('بیان', 'O'), ('اینکه', 'O'), ('با', 'O'), ('بیان', 'O'), ('اینکه', 'O'), ('موضوع', 'O'), ('شیرابه', 'O'), ('##های', 'O'), ('زبالههای', 'O'), ('انتقال', 'O'), ('یافته', 'O'), ('در', 'O'), ('منطقه', 'O'), ('حلقه', 'I-loc'), ('دره', 'O'), ('خطری', 'O'), ('برای', 'O'), ('این', 'O'), ('استان', 'O'), ('است', 'O'), ('،', 'O'), ('گفت', 'O'), (':', 'O'), ('در', 'O'), ('این', 'O'), ('مورد', 'O'), ('گزارشاتی', 'O'), ('در', 'O'), ('۲۵', 'O'), ('مرداد', 'O'), ('۱۳۹۷', 'O'), ('تقدیم', 'O'), ('مدیران', 'O'), ('استان', 'O'), ('شده', 'O'), ('است', 'O'), ('.', 'O'), ('[SEP]', 'O')], [('[CLS]', 'O'), ('به', 'O'), ('گزارش', 'O'), ('خبرگزاری', 'B-org'), ('تسنیم', 'I-org'), ('از', 'O'), ('کرج', 'B-loc'), ('،', 'O'), ('حسین', 'B-pers'), ('محمدی', 'I-pers'), ('در', 'O'), ('نشست', 'O'), ('خبری', 'O'), ('مشترک', 'O'), ('با', 'O'), ('معاون', 'O'), ('خدمات', 'O'), ('شهری', 'O'), ('شهردار

In [11]:
#@title Live Playground { display-mode: "form" }

css_is_load = False
css = """<style>
.ner-box {
    direction: rtl;
    font-size: 18px !important;
    line-height: 20px !important;
    margin: 0 0 15px;
    padding: 10px;
    text-align: justify;
    color: #343434 !important;
}
.token, .token span {
    display: inline-block !important;
    padding: 2px;
    margin: 2px 0;
}
.token.token-ner {
    background-color: #f6cd61;
    font-weight: bold;
    color: #000;
}
.token.token-ner .ner-label {
    color: #9a1f40;
    margin: 0px 2px;
}
</style>"""

if not css_is_load:
    display(HTML(css))
    css_is_load = True

submit_wd = widgets.Button(description='Send', disabled=False, button_style='success', tooltip='Submit')
text_wd = widgets.Textarea(placeholder='Please enter you text ...', rows=5, layout=Layout(width='90%'))
output_wd = widgets.Output()

display(HTML("""
<h2>Test NER model</h2>
<p style="padding: 2px 20px; margin: 0 0 20px;">
</p>
<br /><br />
"""))

display(text_wd)
display(submit_wd)
display(output_wd)

def submit_text(sender):
    with output_wd:
        clear_output(wait=True)
        text = text_wd.value
        _output = ner_model.ner_inference([text], device, ner_model.config.max_position_embeddings)
        # print(_output)
        pred_sequence = []
        for token, label in _output[0]:
            if token not in ['[CLS]', '[SEP]']:
                if label != 'O':
                    pred_sequence.append(
                        '<span class="token token-ner">%s<span class="ner-label">%s</span></span>' 
                        % (token, label))
                else:
                    pred_sequence.append(
                        '<span class="token">%s</span>' 
                        % token)
            
        html = '<p class="ner-box">%s</p>' % ' '.join(pred_sequence) 
        display(HTML(html))

submit_wd.on_click(submit_text)

Textarea(value='', layout=Layout(width='90%'), placeholder='Please enter you text ...', rows=5)

Button(button_style='success', description='Send', style=ButtonStyle(), tooltip='Submit')

Output()

#### PEYMA dataset:
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes: 

- Organization
- Money
- Location
- Date
- Time
- Person
- Percent

|     Label    |   #   |
|:------------:|:-----:|
| Organization | 16964 |
|     Money    |  2037 |
|   Location   |  8782 |
|     Date     |  4259 |
|     Time     |  732  |
|    Person    |  7675 |
|    Percent   |  699  |

Download
You can download the dataset from [here](https://hooshvare.github.io/docs/datasets/ner) with leads to following google drive file of HooshvareLab:

In [12]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

adc.json  peyma.zip  sample_data


In [13]:
!unzip peyma.zip
!ls
!ls peyma

Archive:  peyma.zip
   creating: peyma/
  inflating: peyma/dev.txt           
  inflating: peyma/test.txt          
  inflating: peyma/train.txt         
adc.json  peyma  peyma.zip  sample_data
dev.txt  test.txt  train.txt


In [14]:
sentences, labels = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [15]:
all_labels = [_ for sl in labels for _ in sl]
label_count = {l: all_labels.count(l) for l in set(all_labels)}
for l, c in label_count.items():
  print(l, c)

B_PCT 36
B_TIM 16
B_LOC 595
I_PER 297
I_ORG 1104
B_PER 434
I_PCT 40
O 28215
I_LOC 211
B_ORG 667
B_MON 26
I_MON 65
I_DAT 236
I_TIM 24
B_DAT 208


In [16]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B_PCT', 'B_LOC', 'I_PER', 'I_ORG', 'B_MON', 'I_TIM', 'B_DAT', 'B_TIM', 'B_PER', 'I_PCT', 'O', 'B_ORG', 'I_LOC', 'I_MON', 'I_DAT'}
intersection: {'O'}
model_labels-dataset_labels: ['B-fac', 'B-loc', 'B-org', 'I-fac', 'B-event', 'I-event', 'B-pers', 'B-pro', 'I-loc', 'I-pro', 'I-org', 'I-pers']
dataset_labels-model_labels: ['B_PCT', 'B_TIM', 'B_LOC', 'I_PER', 'I_ORG', 'B_PER', 'I_PCT', 'B_ORG', 'B_MON', 'I_LOC', 'I_MON', 'I_DAT', 'I_TIM', 'B_DAT']
False


In [17]:
label_translate = {
    'B_ORG': 'B-org', 
    'I_ORG': 'I-org',
    'B_LOC': 'B-loc',
    'I_LOC': 'I-loc',
    'B_PER': 'B-pers', 
    'I_PER': 'I-pers',
    'O': 'O',
    # this model can not support the following entities
    'B_DAT': 'O', 
    'I_DAT': 'O', 
    'B_PCT': 'O', 
    'I_PCT': 'O', 
    'B_TIM': 'O', 
    'I_TIM': 'O', 
    'B_MON': 'O', 
    'I_MON': 'O'
}
labels = ner_model.resolve_input_label_consistency(labels, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-loc', 'B-org', 'O', 'B-pers', 'I-loc', 'I-org', 'I-pers'}
intersection: {'B-loc', 'B-org', 'O', 'B-pers', 'I-loc', 'I-org', 'I-pers'}
model_labels-dataset_labels: ['B-fac', 'I-fac', 'B-event', 'I-event', 'B-pro', 'I-pro']
dataset_labels-model_labels: []
True


In [18]:
!nvidia-smi
!lscpu

Mon Aug 16 12:02:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P0    74W / 149W |   1139MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [19]:
inference_output_peyma = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 151
#samples: 1026
#batch: 3
Start to evaluate test data ...
inference time for step 0: 0.03475158399999145
inference time for step 1: 0.012477101999991191
inference time for step 2: 0.013099202000034893
average loss: 0.07054187109073003
total inference time: 0.06032788800001754
total inference time / #samples: 5.87991111111282e-05


In [20]:
for sample_output in inference_output_peyma[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

کنایه	O	O
سرلشگر	O	O
فیروزابادی	B-org	O
به	O	O
پادشاه	O	O
عربستان	B-loc	B-loc
و	O	O
پسرش	O	O

ريیس	O	O
سابق	O	O
ستاد	B-org	B-org
کل	I-org	O
نیروهای	I-org	B-org
مسلح	I-org	I-org
با	O	O
بیان	O	O
اینکه	O	O
ال	O	B-pers
سعود	O	I-pers
با	O	O
حمایت	O	O
همه	O	O
جانبه	O	O
غرب	O	O
بر	O	O
سرزمین	B-loc	O
حجاز	I-loc	B-loc
حاکم	O	O
شد	O	O
گفت	O	O
:	O	O
غرب	O	O
با	O	O
حاکم	O	O
کرد	O	O
##د	O	O
ال	O	B-pers
سعود	O	I-pers
بر	O	O
حجاز	B-loc	B-loc
هدفی	O	O
جز	O	O
##ناب	O	O
##ودی	O	O
اسلام	O	O
نداشته	O	O
و	O	O
این	O	O
نقشه	O	O
انگلیس	B-loc	B-loc
بود	O	O
.	O	O

سرلشگر	O	O
حسن	B-pers	B-pers
فیروزابادی	I-pers	I-pers
روز	O	O
دوشنبه	O	O
درحاشیه	O	O
ايین	O	B-event
ختم	O	I-event
مادر	O	I-event
حیدر	B-pers	B-pers
مصلحی	I-pers	I-pers
درجمع	O	O
خبرنگاران	O	O
درباره	O	O
موضوع	O	O
یمن	B-loc	B-org
افزود	O	O
:	O	O
ماهیت	O	O
انچه	O	O
در	O	O
یمن	B-loc	O
اتفاق	O	O
می	O	O
افتد	O	O
وهابیت	O	O
است	O	O
وهابیت	O	O
یک	O	O
مذهب	O	O
انگلیسی	O	B-org
است	O	O
.	O	O

وی	O	O
ادامه	O	O
داد	O	O
:	O	O
وقتی	O	O
که	O	O
انقلاب	O	O
اسلامی	O	O


In [21]:
ner_model.evaluate_prediction_results(labels, inference_output_peyma)

Test Accuracy: 0.9513726845850258
Test Precision: 0.6438824333561176
Test Recall: 0.5175824175824176
Test F1-Score: 0.5738653670423394
Test classification Report:
              precision    recall  f1-score   support

         loc  0.8644501279 0.5568369028 0.6773547094       607
         org  0.5365853659 0.5530726257 0.5447042641       716
        pers  0.6227544910 0.4185110664 0.5006016847       497

   micro avg  0.6438824334 0.5175824176 0.5738653670      1820
   macro avg  0.6745966616 0.5094735316 0.5742202194      1820
weighted avg  0.6694644679 0.5175824176 0.5769019775      1820



In [22]:
output_file_name = "ner_peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_peyma:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman dataset:
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.

1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person


|     Label    |   #   |
|:------------:|:-----:|
| Organization | 30108 |
|   Location   | 12924 |
|   Facility   |  4458 |
|     Event    |  7557 |
|    Product   |  4389 |
|    Person    | 15645 |

**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)


In [23]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

--2021-08-16 12:03:22--  https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip [following]
--2021-08-16 12:03:23--  https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1931170 (1.8M) [application/zip]
Saving to: ‘ArmanPersoNERCorpus.zip’


2021-08-16 12:03:24 (17.5 MB/s) - ‘ArmanPersoNERCorpus.zip’ saved [1931170/1931170]

adc.json							   peyma
ArmanPersoNERCorpus.zip						   peyma.zip
ner_peyma_Hoos

In [24]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

Archive:  ArmanPersoNERCorpus.zip
  inflating: arman/test_fold1.txt    
  inflating: arman/ReadMe.txt        
  inflating: arman/train_fold3.txt   
  inflating: arman/train_fold2.txt   
  inflating: arman/train_fold1.txt   
  inflating: arman/test_fold3.txt    
  inflating: arman/test_fold2.txt    
adc.json							   peyma
arman								   peyma.zip
ArmanPersoNERCorpus.zip						   sample_data
ner_peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt


In [25]:
sentences, labels = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [26]:
all_labels = [_ for sl in labels for _ in sl]
label_count = {l: all_labels.count(l) for l in set(all_labels)}
for l, c in label_count.items():
  print(l, c)

B-fac 550
B-loc 3408
B-org 4533
I-fac 936
B-event 580
I-event 1939
O 224969
B-pers 3275
B-pro 724
I-loc 900
I-pro 739
I-org 5503
I-pers 1940


In [27]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-fac', 'B-event', 'I-pro', 'I-loc', 'I-org', 'I-pers', 'B-loc', 'B-org', 'I-fac', 'I-event', 'O', 'B-pers', 'B-pro'}
intersection: {'B-fac', 'B-loc', 'B-org', 'I-fac', 'B-event', 'I-event', 'O', 'I-pro', 'B-pers', 'I-loc', 'B-pro', 'I-org', 'I-pers'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


batch size=256 -> inference time for one batch is about 205 s

batch size=512 -> inference time for one batch is about 410 s

batch size=1024 -> crach

In [28]:
!nvidia-smi
!lscpu

Mon Aug 16 12:03:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    75W / 149W |   5371MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [29]:
inference_output_arman = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 250
#samples: 7681
#batch: 16
Start to evaluate test data ...
inference time for step 0: 0.032480051999982607
inference time for step 1: 0.012913157000014053
inference time for step 2: 0.0124648330000241
inference time for step 3: 0.012672528000052807
inference time for step 4: 0.012154097999996338
inference time for step 5: 0.012007724999989478
inference time for step 6: 0.012704886000051374
inference time for step 7: 0.012395798000056857
inference time for step 8: 0.013242373000025509
inference time for step 9: 0.012160130999973262
inference time for step 10: 0.012828786999989461
inference time for step 11: 0.011924707999924067
inference time for step 12: 0.015028388000018822
inference time for step 13: 0.012573750000001382
inference time for step 14: 0.012307706999990842
inference time for step 15: 0.011902733000056287
average loss: 0.029666764079593122
total inference time: 0.22176165400014725
total inference time / #samples: 2.887145606042797e-05


In [30]:
for sample_output in inference_output_arman[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افقی	O	O
:	O	O
[UNK]	O	O
[UNK]	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخستوزیر	O	O
ایران	B-loc	B-loc
در	O	O
سالهای	O	O
ابتدايی	O	O
دهه	O	O
چهل	O	O
خورشیدی	O	O
[UNK]	O	O
جلد	O	O
سوم	O	O
یادداشتهایش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B-loc	B-loc
منتشر	O	O
شد	O	O
[UNK]	O	O
[UNK]	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
##احوال	O	O
[UNK]	O	O
پوشاک	O	O
و	O	O
جامه	O	O
[UNK]	O	O
فانتزی	O	O
و	O	O
شیک	O	O
[UNK]	O	O
[UNK]	O	O
در	O	O
حال	O	O
وزیدن	O	O
[UNK]	O	O
اطلاعیه	O	O
[UNK]	O	O
پایتخت	O	O
جمهوری	O	O
استونی	B-loc	B-loc
در	I-loc	I-loc
حوضه	I-loc	I-loc
بالتیک	I-loc	I-loc
[UNK]	O	O
[UNK]	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
[UNK]	O	O
نوعی	O	O
شمع	O	O
[UNK]	O	O
[UNK]	O	O
حرف	O	O
جمع	O	O
مونث	O	O
[UNK]	O	O
در	O	O
ایران	B-loc	B-loc
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
میشود	O	O
[UNK]	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B-loc	B-loc
[UNK]	O	O
تا	O	O
عصر	O	O
ناصرالدینشاه	B-pers	B-pers
جزيی	O	O
از	O	O
خراسان	B-loc	B-loc
بود	O	O
[UNK]	O	O
ویتامین	O	O
انعقاد	O	O
[UNK]	O	O
[U

In [31]:
ner_model.evaluate_prediction_results(labels, inference_output_arman)

Test Accuracy: 0.9797272449865814
Test Precision: 0.8433774293646905
Test Recall: 0.7440386139327138
Test F1-Score: 0.7905997626975925
Test classification Report:
              precision    recall  f1-score   support

       event  0.7415185784 0.7806122449 0.7605633803       588
         fac  0.8392226148 0.8377425044 0.8384819064       567
         loc  0.8665018541 0.7952353942 0.8293404318      3526
         org  0.8637798827 0.8078464459 0.8348773842      4741
        pers  0.8203377650 0.6242821985 0.7090062112      3657
         pro  0.7845394737 0.5947630923 0.6765957447       802

   micro avg  0.8433774294 0.7440386139 0.7905997627     13881
   macro avg  0.8193166948 0.7400803134 0.7748108431     13881
weighted avg  0.8422659731 0.7440386139 0.7881639688     13881



In [32]:
output_file_name = "ner_arman_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_arman:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman+Peyma

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

In [None]:
!unzip peyma.zip
!ls
!ls peyma

In [33]:
sentences_peyma, labels_peyma = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences_peyma), len(labels_peyma))
print(sentences_peyma[0])
print(labels_peyma[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [34]:
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B_PCT', 'B_LOC', 'I_PER', 'I_ORG', 'B_MON', 'I_TIM', 'B_DAT', 'B_TIM', 'B_PER', 'I_PCT', 'O', 'B_ORG', 'I_LOC', 'I_MON', 'I_DAT'}
intersection: {'O'}
model_labels-dataset_labels: ['B-fac', 'B-loc', 'B-org', 'I-fac', 'B-event', 'I-event', 'B-pers', 'B-pro', 'I-loc', 'I-pro', 'I-org', 'I-pers']
dataset_labels-model_labels: ['B_PCT', 'B_TIM', 'B_LOC', 'I_PER', 'I_ORG', 'B_PER', 'I_PCT', 'B_ORG', 'B_MON', 'I_LOC', 'I_MON', 'I_DAT', 'I_TIM', 'B_DAT']
False


In [35]:
label_translate = {
    'B_ORG': 'B-org', 
    'I_ORG': 'I-org',
    'B_LOC': 'B-loc',
    'I_LOC': 'I-loc',
    'B_PER': 'B-pers', 
    'I_PER': 'I-pers',
    'O': 'O',
    # this model can not support the following entities
    'B_DAT': 'O', 
    'I_DAT': 'O', 
    'B_PCT': 'O', 
    'I_PCT': 'O', 
    'B_TIM': 'O', 
    'I_TIM': 'O', 
    'B_MON': 'O', 
    'I_MON': 'O'
}
labels_peyma = ner_model.resolve_input_label_consistency(labels_peyma, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-loc', 'B-org', 'O', 'B-pers', 'I-loc', 'I-org', 'I-pers'}
intersection: {'B-loc', 'B-org', 'O', 'B-pers', 'I-loc', 'I-org', 'I-pers'}
model_labels-dataset_labels: ['B-fac', 'I-fac', 'B-event', 'I-event', 'B-pro', 'I-pro']
dataset_labels-model_labels: []
True


In [None]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

In [None]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

In [36]:
sentences_arman, labels_arman = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences_arman), len(labels_arman))
print(sentences_arman[0])
print(labels_arman[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [37]:
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-fac', 'B-event', 'I-pro', 'I-loc', 'I-org', 'I-pers', 'B-loc', 'B-org', 'I-fac', 'I-event', 'O', 'B-pers', 'B-pro'}
intersection: {'B-fac', 'B-loc', 'B-org', 'I-fac', 'B-event', 'I-event', 'O', 'I-pro', 'B-pers', 'I-loc', 'B-pro', 'I-org', 'I-pers'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [38]:
sentences = sentences_arman + sentences_peyma
labels = labels_arman + labels_peyma
print(len(sentences), len(labels))

8707 8707


In [39]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-fac', 'B-event', 'I-pro', 'I-loc', 'I-org', 'I-pers', 'B-loc', 'B-org', 'I-fac', 'I-event', 'O', 'B-pers', 'B-pro'}
intersection: {'B-fac', 'B-loc', 'B-org', 'I-fac', 'B-event', 'I-event', 'O', 'I-pro', 'B-pers', 'I-loc', 'B-pro', 'I-org', 'I-pers'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [40]:
!nvidia-smi
!lscpu

Mon Aug 16 12:09:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    75W / 149W |   9307MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [41]:
inference_output = ner_model.ner_evaluation_2(sentences, labels, device, batch_size=512)

len(input_text): 8707
len(input_labels): 8707
max_len: 250
#samples: 8707
#batch: 18
Start to evaluate test data ...
inference time for step 0: 0.02873741499990956
inference time for step 1: 0.012553055000012137
inference time for step 2: 0.01980998199996975
inference time for step 3: 0.012507681999977649
inference time for step 4: 0.012001875000009932
inference time for step 5: 0.012841079000054378
inference time for step 6: 0.012138939000010396
inference time for step 7: 0.012984970000047724
inference time for step 8: 0.01218204900010278
inference time for step 9: 0.012023419999991347
inference time for step 10: 0.012724564999984977
inference time for step 11: 0.01372300900004575
inference time for step 12: 0.012074024999947142
inference time for step 13: 0.012537339000118664
inference time for step 14: 0.012428005999936431
inference time for step 15: 0.012786659999846961
inference time for step 16: 0.013383745000055569
inference time for step 17: 0.013382437000018399
total inference

In [42]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افقی	O	O
:	O	O
[UNK]	O	O
[UNK]	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخستوزیر	O	O
ایران	B-loc	B-loc
در	O	O
سالهای	O	O
ابتدايی	O	O
دهه	O	O
چهل	O	O
خورشیدی	O	O
[UNK]	O	O
جلد	O	O
سوم	O	O
یادداشتهایش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B-loc	B-loc
منتشر	O	O
شد	O	O
[UNK]	O	O
[UNK]	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
##احوال	O	O
[UNK]	O	O
پوشاک	O	O
و	O	O
جامه	O	O
[UNK]	O	O
فانتزی	O	O
و	O	O
شیک	O	O
[UNK]	O	O
[UNK]	O	O
در	O	O
حال	O	O
وزیدن	O	O
[UNK]	O	O
اطلاعیه	O	O
[UNK]	O	O
پایتخت	O	O
جمهوری	O	O
استونی	B-loc	B-loc
در	I-loc	I-loc
حوضه	I-loc	I-loc
بالتیک	I-loc	I-loc
[UNK]	O	O
[UNK]	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
[UNK]	O	O
نوعی	O	O
شمع	O	O
[UNK]	O	O
[UNK]	O	O
حرف	O	O
جمع	O	O
مونث	O	O
[UNK]	O	O
در	O	O
ایران	B-loc	B-loc
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
میشود	O	O
[UNK]	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B-loc	B-loc
[UNK]	O	O
تا	O	O
عصر	O	O
ناصرالدینشاه	B-pers	B-pers
جزيی	O	O
از	O	O
خراسان	B-loc	B-loc
بود	O	O
[UNK]	O	O
ویتامین	O	O
انعقاد	O	O
[UNK]	O	O
[U

In [43]:
ner_model.evaluate_prediction_results(labels, inference_output)

Test Accuracy: 0.9753622639299824
Test Precision: 0.8142536069020517
Test Recall: 0.7153047576587479
Test F1-Score: 0.7615786261612533
Test classification Report:
              precision    recall  f1-score   support

       event  0.6810089021 0.7806122449 0.7274167987       588
         fac  0.7799671593 0.8377425044 0.8078231293       567
         loc  0.8666482606 0.7594967336 0.8095422308      4133
         org  0.8191881919 0.7729521715 0.7953988309      5457
        pers  0.7957996769 0.5929224844 0.6795420058      4154
         pro  0.7406832298 0.5947630923 0.6597510373       802

   micro avg  0.8142536069 0.7153047577 0.7615786262     15701
   macro avg  0.7805492368 0.7230815385 0.7465790055     15701
weighted avg  0.8148921499 0.7153047577 0.7594436072     15701



In [44]:
output_file_name = "ner_arman-and-peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### WikiAnn

https://elisa-ie.github.io/wikiann/

In [45]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1QOG15HU8VfZvJUNKos024xI-OGm0zhEX'})
download.GetContentFile('fa.tar.gz')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner_arman_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner_peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
peyma
peyma.zip
sample_data


In [46]:
!tar -zxvf fa.tar.gz
!ls

README.txt
wikiann-fa.bio
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner_arman_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner_peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [47]:
sentences_all, labels_all, sentences_test, labels_test = ner_model.load_datasets(dataset_name="wikiann", dataset_dir="./")
print(len(sentences_all), len(sentences_all))
print(len(sentences_test), len(labels_test))
print(sentences_test[0])
print(labels_test[0])

all data: #data: 272266, #labels: 272266


  return array(a, dtype, copy=False, order=order)


without stratify
test part:
 #data: 27227, #labels: 27227
272266 272266
27227 27227
['**', 'زاغی', 'نوک\u200cزرد', ',', "''Pica", 'nuttalli', "''"]
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O']


In [48]:
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-LOC', 'I-ORG', 'O', 'I-PER', 'B-PER', 'B-LOC', 'B-ORG'}
intersection: {'O'}
model_labels-dataset_labels: ['B-fac', 'B-loc', 'B-org', 'I-fac', 'B-event', 'I-event', 'B-pers', 'B-pro', 'I-loc', 'I-pro', 'I-org', 'I-pers']
dataset_labels-model_labels: ['I-LOC', 'I-ORG', 'I-PER', 'B-PER', 'B-LOC', 'B-ORG']
False


In [49]:
label_translate = {
    'B-ORG': 'B-org', 
    'I-ORG': 'I-org',
    'B-LOC': 'B-loc',
    'I-LOC': 'I-loc',
    'B-PER': 'B-pers', 
    'I-PER': 'I-pers',
    'O': 'O'
}
labels_test = ner_model.resolve_input_label_consistency(labels_test, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-loc', 'B-org', 'O', 'B-pers', 'I-loc', 'I-org', 'I-pers'}
intersection: {'B-loc', 'B-org', 'O', 'B-pers', 'I-loc', 'I-org', 'I-pers'}
model_labels-dataset_labels: ['B-fac', 'I-fac', 'B-event', 'I-event', 'B-pro', 'I-pro']
dataset_labels-model_labels: []
True


In [50]:
!nvidia-smi
!lscpu

Mon Aug 16 12:16:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0    76W / 149W |   9683MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [51]:
inference_output_wikiann = ner_model.ner_evaluation_2(sentences_test, labels_test, device, batch_size=512)

len(input_text): 27227
len(input_labels): 27227
c: 10000
c: 20000
max_len: 94
#samples: 27227
#batch: 54
Start to evaluate test data ...
inference time for step 0: 0.018234672000062346
inference time for step 1: 0.012716102000013052
inference time for step 2: 0.012263766999922154
inference time for step 3: 0.012753950999922381
inference time for step 4: 0.01214802200001941
inference time for step 5: 0.011837105999802588
inference time for step 6: 0.012844358999927863
inference time for step 7: 0.012200356999983342
inference time for step 8: 0.012293451000005007
inference time for step 9: 0.012253789999931541
inference time for step 10: 0.012777887999845916
inference time for step 11: 0.014679486999966684
inference time for step 12: 0.0119244190000245
inference time for step 13: 0.012087271999916993
inference time for step 14: 0.012314983999885953
inference time for step 15: 0.012108959999977742
inference time for step 16: 0.012556309000046895
inference time for step 17: 0.0144746979999

In [52]:
for sample_output in inference_output_wikiann[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

*	O	O
*	O	O
زاغی	B-loc	O
نوک	I-loc	O
##زرد	I-loc	O
,	O	O
'	O	O
'	O	O
pic	O	O
##a	O	O
nut	O	O
##ta	O	O
##ll	O	O
##i	O	O
'	O	O
'	O	O

تغییر	O	O
##مسیر	O	O
مک	B-loc	O
##ویل	B-loc	O
،	B-loc	O
داکوتای	I-loc	O
شمالی	I-loc	O

وست	B-loc	O
یونیورسیتی	I-loc	O
پلیس	I-loc	O
،	I-loc	O
تگزاس	I-loc	O

تغییر	O	O
##مسیر	O	O
دلت	B-pers	O
##ف	B-pers	O
فون	I-pers	O
لیل	I-pers	O
##نس	I-pers	O
##رون	I-pers	O

تغییر	O	O
##مسیر	O	O
نیروگاههای	B-org	O
زنجیرهای	I-org	O
یاسوج	I-org	O



In [53]:
ner_model.evaluate_prediction_results(labels_test, inference_output_wikiann)

Test Accuracy: 0.4319473597320818
Test Precision: 0.25437856087782235
Test Recall: 0.06732004244150332
Test F1-Score: 0.10646471783096352
Test classification Report:
              precision    recall  f1-score   support

         loc  0.2138578275 0.0569408262 0.0899357602     18809
         org  0.3599608738 0.0980200657 0.1540823447     11263
        pers  0.1682109765 0.0411006618 0.0660601819      5742

   micro avg  0.2543785609 0.0673200424 0.1064647178     35814
   macro avg  0.2473432259 0.0653538512 0.1033594289     35814
weighted avg  0.2524866987 0.0673200424 0.1062810277     35814



In [54]:
output_file_name = "ner_wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_wikiann:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Hooshvare - Arman+Peyma+WikiAnn

https://github.com/hooshvare/parsner

In [55]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1fC2WGlpqumUTaT9Dr_U1jO2no3YMKFJ4'})
download.GetContentFile('ner-v1.zip')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner_arman_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner_peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner-v1.zip
ner_wikiann_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [56]:
!unzip ner-v1.zip
!ls
!ls ner

Archive:  ner-v1.zip
   creating: ner/
  inflating: ner/valid.csv           
  inflating: ner/ner.csv             
  inflating: ner/test.csv            
  inflating: ner/train.csv           
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner
ner_arman-and-peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner_arman_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner_peyma_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
ner-v1.zip
ner_wikiann_HooshvareLab-bert-fa-base-uncased-ner-arman_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio
ner.csv  test.csv  train.csv  valid.csv


In [57]:
sentences_paw, labels_paw = ner_model.load_test_datasets(dataset_name="hooshvare-peyman+arman+wikiann", dataset_dir="./ner/")
print(len(sentences_paw), len(labels_paw))
print(sentences_paw[0])
print(labels_paw[0])

test part:
 #sentences: 6049, #sentences_tags: 6049
6049 6049
['همچنین', 'عملیات', 'لرزه\u200cنگاری', 'دوبعدی', 'نیز', 'با', 'فعالیت', 'مستمر', 'چهار', 'گروه', 'کاری', 'در', 'مناطقی', 'که', 'از', 'نظر', 'اکتشافی', 'مورد', 'نظر', 'بود', '،', 'به', 'پایان', 'رسید', 'که', 'نتایج', 'آن', 'در', 'حال', 'بررسی', 'است', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [58]:
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-PCT', 'I-TIM', 'I-LOC', 'B-EVE', 'I-MON', 'B-MON', 'B-PCT', 'B-PRO', 'B-ORG', 'I-FAC', 'I-DAT', 'I-ORG', 'I-PRO', 'B-DAT', 'O', 'I-PER', 'I-EVE', 'B-TIM', 'B-PER', 'B-FAC', 'B-LOC'}
intersection: {'O'}
model_labels-dataset_labels: ['B-fac', 'B-loc', 'B-org', 'I-fac', 'B-event', 'I-event', 'B-pers', 'B-pro', 'I-loc', 'I-pro', 'I-org', 'I-pers']
dataset_labels-model_labels: ['I-PCT', 'I-TIM', 'B-EVE', 'B-PCT', 'B-PRO', 'I-DAT', 'I-PRO', 'I-PER', 'I-EVE', 'B-FAC', 'I-LOC', 'I-MON', 'B-MON', 'I-FAC', 'B-ORG', 'I-ORG', 'B-DAT', 'B-TIM', 'B-PER', 'B-LOC']
False


In [59]:
label_translate = {
    'B-LOC': 'B-loc', 
    'I-LOC': 'I-loc', 
    'B-EVE': 'B-event', 
    'I-EVE': 'I-event', 
    'B-ORG': 'B-org', 
    'I-ORG': 'I-org', 
    'B-MON': 'O', 
    'I-MON': 'O', 
    'B-DAT': 'O', 
    'I-DAT': 'O', 
    'B-PRO': 'B-pro', 
    'I-PRO': 'I-pro',
    'B-FAC': 'B-fac', 
    'I-FAC': 'I-fac', 
    'B-PCT': 'O', 
    'I-PCT': 'O', 
    'B-PER': 'B-pers', 
    'I-PER': 'I-pers', 
    'B-TIM': 'O', 
    'I-TIM': 'O', 
    'O': 'O'
}
labels_paw = ner_model.resolve_input_label_consistency(labels_paw, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-fac', 'B-event', 'I-pro', 'I-loc', 'I-org', 'I-pers', 'B-loc', 'B-org', 'I-fac', 'I-event', 'O', 'B-pers', 'B-pro'}
intersection: {'B-fac', 'B-loc', 'B-org', 'I-fac', 'B-event', 'I-event', 'O', 'I-pro', 'B-pers', 'I-loc', 'B-pro', 'I-org', 'I-pers'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [60]:
!nvidia-smi
!lscpu

Mon Aug 16 12:23:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    74W / 149W |   4073MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [61]:
inference_output = ner_model.ner_evaluation_2(sentences_paw, labels_paw, device, batch_size=256)

len(input_text): 6049
len(input_labels): 6049
max_len: 448
#samples: 6049
#batch: 24
Start to evaluate test data ...
inference time for step 0: 0.029542523999907644
inference time for step 1: 0.014591774999871632
inference time for step 2: 0.01334347500005606
inference time for step 3: 0.012860247999924468
inference time for step 4: 0.01230657699989024
inference time for step 5: 0.01310279599988462
inference time for step 6: 0.012792242999921655
inference time for step 7: 0.013016703999937818
inference time for step 8: 0.01241631600009896
inference time for step 9: 0.01235773500002324
inference time for step 10: 0.013430374999870764
inference time for step 11: 0.013497061999942161
inference time for step 12: 0.012122470999884172
inference time for step 13: 0.012319961999992302
inference time for step 14: 0.012281212999823765
inference time for step 15: 0.01319177999994281
inference time for step 16: 0.013084445000004052
inference time for step 17: 0.024944275999814636
inference time fo

In [62]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

همچنین	O	O
عملیات	O	O
لرزهنگاری	O	O
دوبعدی	O	O
نیز	O	O
با	O	O
فعالیت	O	O
مستمر	O	O
چهار	O	O
گروه	O	O
کاری	O	O
در	O	O
مناطقی	O	O
که	O	O
از	O	O
نظر	O	O
اکتشافی	O	O
مورد	O	O
نظر	O	O
بود	O	O
،	O	O
به	O	O
پایان	O	O
رسید	O	O
که	O	O
نتایج	O	O
ان	O	O
در	O	O
حال	O	O
بررسی	O	O
است	O	O
.	O	O

محدث	B-pers	O
در	O	O
مورد	O	O
مشارکت	O	O
شرکتهای	O	O
خارجی	O	O
در	O	O
فعالیتهای	O	O
اکتشافی	O	O
کشور	O	O
گفت	O	O
:	O	O
تاکنون	O	O
چند	O	O
منطقه	O	O
اکتشافی	O	O
را	O	O
برای	O	O
مشارکت	O	O
و	O	O
سرمایهگذاری	O	O
شرکتهای	O	O
خارجی	O	O
اعلام	O	O
کردهایم	O	O
و	O	O
در	O	O
حال	O	O
مذاکره	O	O
با	O	O
طرفهای	O	O
خارجی	O	O
هستیم	O	O
و	O	O
انتظار	O	O
میرود	O	O
تا	O	O
اخر	O	O
امسال	O	O
بتوانیم	O	O
چند	O	O
قرارداد	O	O
را	O	O
نهایی	O	O
کنیم	O	O
.	O	O

مدیر	O	O
امور	B-org	B-org
اکتشاف	I-org	O
شرکت	I-org	B-org
ملی	I-org	I-org
نفت	I-org	I-org
فرو	O	O
##افتادگی	O	O
دزفول	B-loc	B-loc
و	O	O
منطقه	B-loc	B-loc
گسل	I-loc	I-loc
کازرون	I-loc	I-loc
تا	O	O
بالارو	B-loc	O
##د	B-loc	O
در	O	O
اطراف	O	O
لرستان	B-loc	B-loc
را	O	O
مستعدترین	O

In [63]:
ner_model.evaluate_prediction_results(labels_paw, inference_output)

Test Accuracy: 0.9665831120129709
Test Precision: 0.722788142382969
Test Recall: 0.6314368563143685
Test F1-Score: 0.6740313800832533
Test classification Report:
              precision    recall  f1-score   support

       event  0.4471910112 0.7713178295 0.5661450925       258
         fac  0.6081871345 0.8093385214 0.6944908180       257
         loc  0.8632590651 0.6671168130 0.7526185488      2962
         org  0.6772036474 0.6708822644 0.6740281349      3321
        pers  0.7607922803 0.5282087447 0.6235171696      2836
         pro  0.5124378109 0.5613079019 0.5357607282       367

   micro avg  0.7227881424 0.6314368563 0.6740313801     10001
   macro avg  0.6448451583 0.6680286791 0.6410934153     10001
weighted avg  0.7422575364 0.6314368563 0.6756496383     10001



In [64]:
output_file_name = "ner_arman-and-peyma-and-wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()