# ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as **BERT-Base**.

Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)

**All the models (downstream tasks) are uncased** and trained with whole word masking. (coming soon stay tuned)



## Persian NER [ARMAN, PEYMA, COMPOSITE]

This task aims to extract named entities in the text, such as names and label with appropriate **NER** classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with **IOB** format. In this format, tokens that are not part of an entity are tagged as **”O”**, the **”B”** tag corresponds to the first word of an object, and the **”I”** tag corresponds to the rest of the terms of the same entity. Both **”B”** and **”I”** tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the **NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text**.

There are two primary datasets used in Persian NER, **ARMAN**, and **PEYMA**. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.


In [1]:
!nvidia-smi
!lscpu

Mon Aug 16 04:58:24 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install transformers==4.7.0
!pip install hazm==0.7.0
!pip install seqeval==1.2.2

Collecting transformers==4.7.0
  Downloading transformers-4.7.0-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 4.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 21.7 MB/s 
Collecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.8 MB/s 
Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed huggingface-hub-0.0.8 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.7.0
Collecting hazm==0.7.0
  Downloading hazm-0.7.0-py3-none-any.whl (316 kB)
[K     |████████████████████████████████| 316 kB 3.8 MB/s 
[?25hCollecting libwapiti>=0.2.1
  Downloading libwapiti-0.

In [3]:
!pip install PyDrive
import os
import IPython.display as ipd
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [4]:
import os
import gc
import ast
import time
import hazm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

import transformers
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForTokenClassification

from IPython.display import display, HTML, clear_output
from ipywidgets import widgets, Layout

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

print()
print('numpy', np.__version__)
print('pandas', pd.__version__)
print('transformers', transformers.__version__)
print('torch', torch.__version__)
print()

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


numpy 1.19.5
pandas 1.1.5
transformers 4.7.0
torch 1.9.0+cu102

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [5]:
class NER:
    def __init__(self, model_name):
        self.normalizer = hazm.Normalizer()
        self.model_name = model_name
        self.config = AutoConfig.from_pretrained(self.model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(self.model_name)
        # self.labels = list(self.config.label2id.keys())
        self.id2label = self.config.id2label

    @staticmethod
    def load_ner_data(file_path, word_index, tag_index, delimiter, join=False):
        dataset, labels = [], []
        with open(file_path, encoding="utf8") as infile:
            sample_text, sample_label = [], []
            for line in infile:
                parts = line.strip().split(delimiter)
                if len(parts) > 1:
                    word, tag = parts[word_index], parts[tag_index]
                    if not word:
                        continue
                    sample_text.append(word)
                    sample_label.append(tag)
                else:
                    # end of sample
                    if sample_text and sample_label:
                        if join:
                            dataset.append(' '.join(sample_text))
                            labels.append(' '.join(sample_label))
                        else:
                            dataset.append(sample_text)
                            labels.append(sample_label)
                    sample_text, sample_label = [], []
        if sample_text and sample_label:
            if join:
                dataset.append(' '.join(sample_text))
                labels.append(' '.join(sample_label))
            else:
                dataset.append(sample_text)
                labels.append(sample_label)
        return dataset, labels

    def load_test_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "peyma":
            ner_file_path = dataset_dir + 'test.txt'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            return self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='|',
                                      join=kwargs.get('join', False))
        elif dataset_name.lower() == "arman":
            dataset, labels = [], []
            for i in range(1, 4):
                ner_file_path = dataset_dir + f'test_fold{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter=' ',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "hooshvare-peyman+arman+wikiann":
            ner_file_path = dataset_dir + 'test.csv'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            data = pd.read_csv(ner_file_path, delimiter="\t")
            sentences, sentences_tags = data['tokens'].values.tolist(), data['ner_tags'].values.tolist()
            sentences = [ast.literal_eval(ss) for ss in sentences]
            sentences_tags = [ast.literal_eval(ss) for ss in sentences_tags]
            print(f'test part:\n #sentences: {len(sentences)}, #sentences_tags: {len(sentences_tags)}')
            return sentences, sentences_tags

    def load_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "farsiyar":
            dataset, labels = [], []
            for i in range(1, 6):
                ner_file_path = dataset_dir + 'Persian-NER-part{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='\t',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "wikiann":
            ner_file_path = dataset_dir + 'wikiann-fa.bio'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            dataset_all, labels_all = self.load_ner_data(ner_file_path, word_index=0, tag_index=-1, delimiter=' ',
                                                         join=kwargs.get('join', False))
            print(f'all data: #data: {len(dataset_all)}, #labels: {len(labels_all)}')

            try:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1,
                                                               stratify=labels_all)
                print("with stratify")
            except:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1)
                print("without stratify")
            print(f'test part:\n #data: {len(data_test)}, #labels: {len(label_test)}')
            return dataset_all, labels_all, data_test, label_test

    def ner_inference(self, input_text, device, max_length):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        pt_batch = self.tokenizer(
            [self.normalizer.normalize(sequence) for sequence in input_text],
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        pt_batch = pt_batch.to(device)
        pt_outputs = self.model(**pt_batch)
        pt_predictions = torch.argmax(pt_outputs.logits, dim=-1)
        pt_predictions = pt_predictions.cpu().detach().numpy().tolist()

        output_predictions = []
        for i, sequence in enumerate(input_text):
            tokens = self.tokenizer.tokenize(self.tokenizer.decode(self.tokenizer.encode(sequence)))
            predictions = [(token, self.id2label[prediction]) for token, prediction in
                           zip(tokens, pt_predictions[i])]
            output_predictions.append(predictions)
        return output_predictions

    def ner_evaluation(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_sentence, new_sentence_label = [], []
            for word, label in zip(sentence, sentence_label):
                # Tokenize the word and count # of subwords the word is broken into
                tokenized_word = self.tokenizer.tokenize(word)
                n_subwords = len(tokenized_word)

                # Add the tokenized word to the final tokenized word list
                tokenized_sentence.extend(tokenized_word)
                # Add the same label to the new list of labels `n_subwords` times
                new_sentence_label.extend([label] * n_subwords)

            max_len = max(max_len, len(tokenized_sentence))
            tokenized_texts.append(tokenized_sentence)
            new_labels.append(new_sentence_label)

        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences([self.tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                                  maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences([[self.config.label2id.get(l) for l in lab] for lab in new_labels],
                                     maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_loss, total_time = 0, 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')
            # get the loss
            total_loss += outputs.loss.item()

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        # Calculate the average loss over the training data.
        avg_train_loss = total_loss / len(data_loader)
        print("average loss:", avg_train_loss)
        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def ner_evaluation_2(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        print("len(input_text):", len(input_text))
        print("len(input_labels):", len(input_labels))
        c = 0
        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_words = self.tokenizer(sentence, padding=False, add_special_tokens=False).input_ids
            tokenized_sentence_ids, new_sentence_label = [], []
            for i, tokenized_word in enumerate(tokenized_words):
                # Add the tokenized word to the final tokenized word list
                tokenized_sentence_ids += tokenized_word
                # Add the same label to the new list of labels `number of subwords` times
                new_sentence_label.extend([self.config.label2id.get(sentence_label[i])] * len(tokenized_word))

            max_len = max(max_len, len(tokenized_sentence_ids))
            tokenized_texts.append(tokenized_sentence_ids)
            new_labels.append(new_sentence_label)
            c += 1
            if c % 10000 == 0:
                print("c:", c)
        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences(tokenized_texts, maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences(new_labels, maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_time = 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def check_input_label_consistency(self, labels):
        model_labels = self.config.label2id.keys()
        dataset_labels = set()
        for l in labels:
            dataset_labels.update(set(l))
        print("model labels:", model_labels)
        print("dataset labels:", dataset_labels)
        print("intersection:", set(model_labels).intersection(dataset_labels))
        print("model_labels-dataset_labels:", list(set(model_labels) - set(dataset_labels)))
        print("dataset_labels-model_labels:", list(set(dataset_labels) - set(model_labels)))
        if list(set(dataset_labels) - set(model_labels)):
            return False
        return True

    @staticmethod
    def resolve_input_label_consistency(labels, label_translation_map):
        for i, sentence_labels in enumerate(labels):
            for j, label in enumerate(sentence_labels):
                labels[i][j] = label_translation_map.get(label)
        return labels

    @staticmethod
    def evaluate_prediction_results(labels, output_predictions):
        dataset_labels = set()
        for label in labels:
            dataset_labels.update(set(label))

        true_labels, predictions = [], []
        for sample_output in output_predictions:
            sample_true_labels = []
            sample_predicted_labels = []
            for token, true_label, predicted_label in sample_output:
                sample_true_labels.append(true_label)
                if predicted_label in dataset_labels:
                    sample_predicted_labels.append(predicted_label)
                else:
                    sample_predicted_labels.append('O')
            true_labels.append(sample_true_labels)
            predictions.append(sample_predicted_labels)

        print("Test Accuracy: {}".format(accuracy_score(true_labels, predictions)))
        print("Test Precision: {}".format(precision_score(true_labels, predictions)))
        print("Test Recall: {}".format(recall_score(true_labels, predictions)))
        print("Test F1-Score: {}".format(f1_score(true_labels, predictions)))
        print("Test classification Report:\n{}".format(classification_report(true_labels, predictions, digits=10)))


In [6]:
model_name='HooshvareLab/bert-base-parsbert-armanner-uncased'
ner_model = NER(model_name)

Downloading:   0%|          | 0.00/937 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/651M [00:00<?, ?B/s]

In [7]:
print(ner_model.config)

BertConfig {
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B-event",
    "1": "B-fac",
    "2": "B-loc",
    "3": "B-org",
    "4": "B-pers",
    "5": "B-pro",
    "6": "I-event",
    "7": "I-fac",
    "8": "I-loc",
    "9": "I-org",
    "10": "I-pers",
    "11": "I-pro",
    "12": "O"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-event": 0,
    "B-fac": 1,
    "B-loc": 2,
    "B-org": 3,
    "B-pers": 4,
    "B-pro": 5,
    "I-event": 6,
    "I-fac": 7,
    "I-loc": 8,
    "I-org": 9,
    "I-pers": 10,
    "I-pro": 11,
    "O": 12
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version":

#### Sample Inference:

In [8]:
texts = [
    "مدیرکل محیط زیست استان البرز با بیان اینکه با بیان اینکه موضوع شیرابه‌های زباله‌های انتقال یافته در منطقه حلقه دره خطری برای این استان است، گفت: در این مورد گزارشاتی در ۲۵ مرداد ۱۳۹۷ تقدیم مدیران استان شده است.",
    "به گزارش خبرگزاری تسنیم از کرج، حسین محمدی در نشست خبری مشترک با معاون خدمات شهری شهرداری کرج که با حضور مدیرعامل سازمان‌های پسماند، پارک‌ها و فضای سبز و نماینده منابع طبیعی در سالن کنفرانس شهرداری کرج برگزار شد، اظهار داشت: ۸۰٪  جمعیت استان البرز در کلانشهر کرج زندگی می‌کنند.",
    "وی افزود: با همکاری‌های مشترک بین اداره کل محیط زیست و شهرداری کرج برنامه‌های مشترکی برای حفاظت از محیط زیست در شهر کرج در دستور کار قرار گرفته که این اقدامات آثار مثبتی داشته و تاکنون نزدیک به ۱۰۰ میلیارد هزینه جهت خریداری اکس-ریس صورت گرفته است.",
]

In [9]:
inference_output = ner_model.ner_inference(texts, device, ner_model.config.max_position_embeddings)

In [10]:
print(inference_output)

[[('[CLS]', 'O'), ('مدیرکل', 'O'), ('محیط', 'B-org'), ('زیست', 'I-org'), ('استان', 'I-org'), ('البرز', 'I-org'), ('با', 'O'), ('بیان', 'O'), ('اینکه', 'O'), ('با', 'O'), ('بیان', 'O'), ('اینکه', 'O'), ('موضوع', 'O'), ('شیرابه', 'O'), ('##های', 'O'), ('زبالههای', 'O'), ('انتقال', 'O'), ('یافته', 'O'), ('در', 'O'), ('منطقه', 'O'), ('حلقه', 'I-loc'), ('دره', 'O'), ('خطری', 'O'), ('برای', 'O'), ('این', 'O'), ('استان', 'O'), ('است', 'O'), ('،', 'O'), ('گفت', 'O'), (':', 'O'), ('در', 'O'), ('این', 'O'), ('مورد', 'O'), ('گزارشاتی', 'O'), ('در', 'O'), ('۲۵', 'O'), ('مرداد', 'O'), ('۱۳۹۷', 'O'), ('تقدیم', 'O'), ('مدیران', 'O'), ('استان', 'O'), ('شده', 'O'), ('است', 'O'), ('.', 'O'), ('[SEP]', 'O')], [('[CLS]', 'O'), ('به', 'O'), ('گزارش', 'O'), ('خبرگزاری', 'B-org'), ('تسنیم', 'I-org'), ('از', 'O'), ('کرج', 'B-loc'), ('،', 'O'), ('حسین', 'B-pers'), ('محمدی', 'I-pers'), ('در', 'O'), ('نشست', 'O'), ('خبری', 'O'), ('مشترک', 'O'), ('با', 'O'), ('معاون', 'O'), ('خدمات', 'O'), ('شهری', 'O'), ('شهردار

In [11]:
#@title Live Playground { display-mode: "form" }

css_is_load = False
css = """<style>
.ner-box {
    direction: rtl;
    font-size: 18px !important;
    line-height: 20px !important;
    margin: 0 0 15px;
    padding: 10px;
    text-align: justify;
    color: #343434 !important;
}
.token, .token span {
    display: inline-block !important;
    padding: 2px;
    margin: 2px 0;
}
.token.token-ner {
    background-color: #f6cd61;
    font-weight: bold;
    color: #000;
}
.token.token-ner .ner-label {
    color: #9a1f40;
    margin: 0px 2px;
}
</style>"""

if not css_is_load:
    display(HTML(css))
    css_is_load = True

submit_wd = widgets.Button(description='Send', disabled=False, button_style='success', tooltip='Submit')
text_wd = widgets.Textarea(placeholder='Please enter you text ...', rows=5, layout=Layout(width='90%'))
output_wd = widgets.Output()

display(HTML("""
<h2>Test NER model</h2>
<p style="padding: 2px 20px; margin: 0 0 20px;">
</p>
<br /><br />
"""))

display(text_wd)
display(submit_wd)
display(output_wd)

def submit_text(sender):
    with output_wd:
        clear_output(wait=True)
        text = text_wd.value
        _output = ner_model.ner_inference([text], device, ner_model.config.max_position_embeddings)
        # print(_output)
        pred_sequence = []
        for token, label in _output[0]:
            if token not in ['[CLS]', '[SEP]']:
                if label != 'O':
                    pred_sequence.append(
                        '<span class="token token-ner">%s<span class="ner-label">%s</span></span>' 
                        % (token, label))
                else:
                    pred_sequence.append(
                        '<span class="token">%s</span>' 
                        % token)
            
        html = '<p class="ner-box">%s</p>' % ' '.join(pred_sequence) 
        display(HTML(html))

submit_wd.on_click(submit_text)

Textarea(value='', layout=Layout(width='90%'), placeholder='Please enter you text ...', rows=5)

Button(button_style='success', description='Send', style=ButtonStyle(), tooltip='Submit')

Output()

#### PEYMA dataset:
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes: 

- Organization
- Money
- Location
- Date
- Time
- Person
- Percent

|     Label    |   #   |
|:------------:|:-----:|
| Organization | 16964 |
|     Money    |  2037 |
|   Location   |  8782 |
|     Date     |  4259 |
|     Time     |  732  |
|    Person    |  7675 |
|    Percent   |  699  |

Download
You can download the dataset from [here](https://hooshvare.github.io/docs/datasets/ner) with leads to following google drive file of HooshvareLab:

In [12]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

adc.json  peyma.zip  sample_data


In [13]:
!unzip peyma.zip
!ls
!ls peyma

Archive:  peyma.zip
   creating: peyma/
  inflating: peyma/dev.txt           
  inflating: peyma/test.txt          
  inflating: peyma/train.txt         
adc.json  peyma  peyma.zip  sample_data
dev.txt  test.txt  train.txt


In [14]:
sentences, labels = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [15]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B_LOC', 'I_ORG', 'B_DAT', 'O', 'B_MON', 'I_PCT', 'I_TIM', 'I_PER', 'I_MON', 'B_PER', 'I_DAT', 'B_PCT', 'I_LOC', 'B_ORG', 'B_TIM'}
intersection: {'O'}
model_labels-dataset_labels: ['B-fac', 'I-pro', 'I-event', 'B-org', 'B-pro', 'B-event', 'I-pers', 'I-fac', 'B-loc', 'B-pers', 'I-org', 'I-loc']
dataset_labels-model_labels: ['B_LOC', 'I_MON', 'B_DAT', 'I_ORG', 'B_PER', 'B_MON', 'I_TIM', 'I_DAT', 'B_PCT', 'I_LOC', 'I_PCT', 'B_ORG', 'I_PER', 'B_TIM']
False


In [16]:
label_translate = {
    'B_ORG': 'B-org', 
    'I_ORG': 'I-org',
    'B_LOC': 'B-loc',
    'I_LOC': 'I-loc',
    'B_PER': 'B-pers', 
    'I_PER': 'I-pers',
    'O': 'O',
    # this model can not support the following entities
    'B_DAT': 'O', 
    'I_DAT': 'O', 
    'B_PCT': 'O', 
    'I_PCT': 'O', 
    'B_TIM': 'O', 
    'I_TIM': 'O', 
    'B_MON': 'O', 
    'I_MON': 'O'
}
labels = ner_model.resolve_input_label_consistency(labels, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-org', 'O', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
intersection: {'B-org', 'O', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
model_labels-dataset_labels: ['B-fac', 'I-pro', 'I-event', 'B-pro', 'B-event', 'I-fac']
dataset_labels-model_labels: []
True


In [17]:
!nvidia-smi
!lscpu

Mon Aug 16 05:01:33 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0    69W / 149W |   1139MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [18]:
inference_output_peyma = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 151
#samples: 1026
#batch: 3
Start to evaluate test data ...
inference time for step 0: 0.036388676000001396
inference time for step 1: 0.012967459000037707
inference time for step 2: 0.014209151999921232
average loss: 0.05863246818383535
total inference time: 0.06356528699996034
total inference time / #samples: 6.195447076019525e-05


In [19]:
for sample_output in inference_output_peyma[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

کنایه	O	O
سرلشگر	O	O
فیروزابادی	B-org	B-pers
به	O	O
پادشاه	O	O
عربستان	B-loc	B-loc
و	O	O
پسرش	O	O

ريیس	O	O
سابق	O	O
ستاد	B-org	B-org
کل	I-org	I-org
نیروهای	I-org	I-org
مسلح	I-org	I-org
با	O	O
بیان	O	O
اینکه	O	O
ال	O	B-pers
سعود	O	I-pers
با	O	O
حمایت	O	O
همه	O	O
جانبه	O	O
غرب	O	O
بر	O	O
سرزمین	B-loc	O
حجاز	I-loc	B-loc
حاکم	O	O
شد	O	O
گفت	O	O
:	O	O
غرب	O	O
با	O	O
حاکم	O	O
کردد	O	O
ال	O	B-pers
سعود	O	I-pers
بر	O	O
حجاز	B-loc	B-loc
هدفی	O	O
جز	O	O
##ناب	O	O
##ودی	O	O
اسلام	O	O
نداشته	O	O
و	O	O
این	O	O
نقشه	O	O
انگلیس	B-loc	B-loc
بود	O	O
.	O	O

سرلشگر	O	O
حسن	B-pers	B-pers
فیروزابادی	I-pers	I-pers
روز	O	O
دوشنبه	O	O
درحاشیه	O	O
ايین	O	O
ختم	O	O
مادر	O	O
حیدر	B-pers	B-pers
مصلحی	I-pers	I-pers
درجمع	O	O
خبرنگاران	O	O
درباره	O	O
موضوع	O	O
یمن	B-loc	B-loc
افزود	O	O
:	O	O
ماهیت	O	O
انچه	O	O
در	O	O
یمن	B-loc	B-loc
اتفاق	O	O
می	O	O
افتد	O	O
وهابیت	O	O
است	O	O
وهابیت	O	O
یک	O	O
مذهب	O	O
انگلیسی	O	O
است	O	O
.	O	O

وی	O	O
ادامه	O	O
داد	O	O
:	O	O
وقتی	O	O
که	O	O
انقلاب	O	O
اسلامی	O	O
به	O	O
پیروزی	O	

In [20]:
ner_model.evaluate_prediction_results(labels, inference_output_peyma)

Test Accuracy: 0.9560639279944612
Test Precision: 0.6765873015873016
Test Recall: 0.56738768718802
Test F1-Score: 0.6171945701357466
Test classification Report:
              precision    recall  f1-score   support

         loc  0.8220140515 0.5782537068 0.6789168279       607
         org  0.6005706134 0.5921237693 0.5963172805       711
        pers  0.6536458333 0.5175257732 0.5776754891       485

   micro avg  0.6765873016 0.5673876872 0.6171945701      1803
   macro avg  0.6920768328 0.5626344164 0.6176365325      1803
weighted avg  0.6893990375 0.5673876872 0.6191107671      1803



In [21]:
output_file_name = "ner_peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_peyma:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman dataset:
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.

1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person


|     Label    |   #   |
|:------------:|:-----:|
| Organization | 30108 |
|   Location   | 12924 |
|   Facility   |  4458 |
|     Event    |  7557 |
|    Product   |  4389 |
|    Person    | 15645 |

**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)


In [22]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

--2021-08-16 05:02:08--  https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
Resolving github.com (github.com)... 52.192.72.89
Connecting to github.com (github.com)|52.192.72.89|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip [following]
--2021-08-16 05:02:09--  https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1931170 (1.8M) [application/zip]
Saving to: ‘ArmanPersoNERCorpus.zip’


2021-08-16 05:02:09 (23.8 MB/s) - ‘ArmanPersoNERCorpus.zip’ saved [1931170/1931170]

adc.json
ArmanPersoNERCorpus.zip
ner_peyma_HooshvareLab-bert-base-parsbert-arman

In [23]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

Archive:  ArmanPersoNERCorpus.zip
  inflating: arman/test_fold1.txt    
  inflating: arman/ReadMe.txt        
  inflating: arman/train_fold3.txt   
  inflating: arman/train_fold2.txt   
  inflating: arman/train_fold1.txt   
  inflating: arman/test_fold3.txt    
  inflating: arman/test_fold2.txt    
adc.json
arman
ArmanPersoNERCorpus.zip
ner_peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
peyma
peyma.zip
sample_data


In [24]:
sentences, labels = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [25]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-fac', 'B-org', 'O', 'B-event', 'I-fac', 'I-pro', 'I-event', 'B-pro', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
intersection: {'B-fac', 'I-pro', 'I-event', 'B-org', 'O', 'B-pro', 'B-event', 'I-pers', 'I-fac', 'B-loc', 'B-pers', 'I-org', 'I-loc'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


batch size=256 -> inference time for one batch is about 205 s

batch size=512 -> inference time for one batch is about 410 s

batch size=1024 -> crach

In [26]:
!nvidia-smi
!lscpu

Mon Aug 16 05:02:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P0    73W / 149W |   5371MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [27]:
inference_output_arman = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 253
#samples: 7681
#batch: 16
Start to evaluate test data ...
inference time for step 0: 0.027324765000003026
inference time for step 1: 0.01264835500001027
inference time for step 2: 0.01241664900021533
inference time for step 3: 0.012448098000049868
inference time for step 4: 0.013267498999994132
inference time for step 5: 0.012465513000051942
inference time for step 6: 0.01224539800000457
inference time for step 7: 0.01264471800004685
inference time for step 8: 0.0126662979998855
inference time for step 9: 0.012539189000108308
inference time for step 10: 0.011976841999967291
inference time for step 11: 0.012759697000092274
inference time for step 12: 0.013020728000128656
inference time for step 13: 0.012121246999868163
inference time for step 14: 0.013405248000026404
inference time for step 15: 0.012123001999952976
average loss: 0.01598202728200704
total inference time: 0.21607324600040556
total inference time / #samples: 2.8130874365369818e-05


In [28]:
for sample_output in inference_output_arman[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افقی	O	O
:	O	O
[UNK]	O	O
[UNK]	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخستوزیر	O	O
ایران	B-loc	B-loc
در	O	O
سالهای	O	O
ابتدايی	O	O
دهه	O	O
چهل	O	O
خورشیدی	O	O
[UNK]	O	O
جلد	O	O
سوم	O	O
یادداشتهای	O	O
##ش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B-loc	B-loc
منتشر	O	O
شد	O	O
[UNK]	O	O
[UNK]	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
##احوال	O	O
[UNK]	O	O
پوشاک	O	O
و	O	O
جامه	O	O
[UNK]	O	O
فانتزی	O	O
و	O	O
شیک	O	O
[UNK]	O	O
[UNK]	O	O
در	O	O
حال	O	O
وزیدن	O	O
[UNK]	O	O
اطلاعیه	O	O
[UNK]	O	O
پایتخت	O	O
جمهوری	O	O
استونی	B-loc	B-loc
در	I-loc	I-loc
حوضه	I-loc	I-loc
بالتیک	I-loc	I-loc
[UNK]	O	O
[UNK]	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
[UNK]	O	O
نوعی	O	O
شمع	O	O
[UNK]	O	O
[UNK]	O	O
حرف	O	O
جمع	O	O
مونث	O	O
[UNK]	O	O
در	O	O
ایران	B-loc	B-loc
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
میشود	O	O
[UNK]	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B-loc	B-loc
[UNK]	O	O
تا	O	O
عصر	O	O
ناصرالدینشاه	B-pers	B-pers
جزيی	O	O
از	O	O
خراسان	B-loc	B-loc
بود	O	O
[UNK]	O	O
ویتامین	O	O
انعقاد	O	O
[UNK]

In [29]:
ner_model.evaluate_prediction_results(labels, inference_output_arman)

Test Accuracy: 0.9835790215540025
Test Precision: 0.8638760444552608
Test Recall: 0.7705499276410999
Test F1-Score: 0.8145485141698857
Test classification Report:
              precision    recall  f1-score   support

       event  0.7631578947 0.7435897436 0.7532467532       585
         fac  0.8619469027 0.8589065256 0.8604240283       567
         loc  0.8927796715 0.8152235427 0.8522407928      3534
         org  0.8930846224 0.8356747552 0.8634264350      4698
        pers  0.8334452653 0.6822429907 0.7503022975      3638
         pro  0.7411167513 0.5488721805 0.6306695464       798

   micro avg  0.8638760445 0.7705499276 0.8145485142     13820
   macro avg  0.8309218513 0.7474182897 0.7850516422     13820
weighted avg  0.8617547916 0.7705499276 0.8125600712     13820



In [30]:
output_file_name = "ner_arman_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_arman:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman+Peyma

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

In [None]:
!unzip peyma.zip
!ls
!ls peyma

In [31]:
sentences_peyma, labels_peyma = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences_peyma), len(labels_peyma))
print(sentences_peyma[0])
print(labels_peyma[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [32]:
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B_LOC', 'I_ORG', 'B_DAT', 'O', 'B_MON', 'I_PCT', 'I_TIM', 'I_PER', 'I_MON', 'B_PER', 'I_DAT', 'B_PCT', 'I_LOC', 'B_ORG', 'B_TIM'}
intersection: {'O'}
model_labels-dataset_labels: ['B-fac', 'I-pro', 'I-event', 'B-org', 'B-pro', 'B-event', 'I-pers', 'I-fac', 'B-loc', 'B-pers', 'I-org', 'I-loc']
dataset_labels-model_labels: ['B_LOC', 'I_MON', 'B_DAT', 'I_ORG', 'B_PER', 'B_MON', 'I_TIM', 'I_DAT', 'B_PCT', 'I_LOC', 'I_PCT', 'B_ORG', 'I_PER', 'B_TIM']
False


In [33]:
label_translate = {
    'B_ORG': 'B-org', 
    'I_ORG': 'I-org',
    'B_LOC': 'B-loc',
    'I_LOC': 'I-loc',
    'B_PER': 'B-pers', 
    'I_PER': 'I-pers',
    'O': 'O',
    # this model can not support the following entities
    'B_DAT': 'O', 
    'I_DAT': 'O', 
    'B_PCT': 'O', 
    'I_PCT': 'O', 
    'B_TIM': 'O', 
    'I_TIM': 'O', 
    'B_MON': 'O', 
    'I_MON': 'O'
}
labels_peyma = ner_model.resolve_input_label_consistency(labels_peyma, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-org', 'O', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
intersection: {'B-org', 'O', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
model_labels-dataset_labels: ['B-fac', 'I-pro', 'I-event', 'B-pro', 'B-event', 'I-fac']
dataset_labels-model_labels: []
True


In [None]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

In [None]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

In [34]:
sentences_arman, labels_arman = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences_arman), len(labels_arman))
print(sentences_arman[0])
print(labels_arman[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [35]:
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-fac', 'B-org', 'O', 'B-event', 'I-fac', 'I-pro', 'I-event', 'B-pro', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
intersection: {'B-fac', 'I-pro', 'I-event', 'B-org', 'O', 'B-pro', 'B-event', 'I-pers', 'I-fac', 'B-loc', 'B-pers', 'I-org', 'I-loc'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [36]:
sentences = sentences_arman + sentences_peyma
labels = labels_arman + labels_peyma
print(len(sentences), len(labels))

8707 8707


In [37]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-fac', 'B-org', 'O', 'B-event', 'I-fac', 'I-pro', 'I-event', 'B-pro', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
intersection: {'B-fac', 'I-pro', 'I-event', 'B-org', 'O', 'B-pro', 'B-event', 'I-pers', 'I-fac', 'B-loc', 'B-pers', 'I-org', 'I-loc'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [38]:
!nvidia-smi
!lscpu

Mon Aug 16 05:08:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    72W / 149W |   9439MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [39]:
inference_output = ner_model.ner_evaluation_2(sentences, labels, device, batch_size=512)

len(input_text): 8707
len(input_labels): 8707
max_len: 253
#samples: 8707
#batch: 18
Start to evaluate test data ...
inference time for step 0: 0.03488754699992569
inference time for step 1: 0.014062674999877345
inference time for step 2: 0.01432582900019952
inference time for step 3: 0.012635022000040408
inference time for step 4: 0.017613605000178723
inference time for step 5: 0.015364161999968928
inference time for step 6: 0.013217842999893037
inference time for step 7: 0.013404925000031653
inference time for step 8: 0.012368070999855263
inference time for step 9: 0.012939779000134877
inference time for step 10: 0.013083449000077962
inference time for step 11: 0.013501607000080185
inference time for step 12: 0.012628675999849293
inference time for step 13: 0.013119349000135117
inference time for step 14: 0.016228339000008418
inference time for step 15: 0.012945755999908215
inference time for step 16: 0.012465872000120726
inference time for step 17: 0.01364022500001738
total inferenc

In [40]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افقی	O	O
:	O	O
[UNK]	O	O
[UNK]	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخستوزیر	O	O
ایران	B-loc	B-loc
در	O	O
سالهای	O	O
ابتدايی	O	O
دهه	O	O
چهل	O	O
خورشیدی	O	O
[UNK]	O	O
جلد	O	O
سوم	O	O
یادداشتهای	O	O
##ش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B-loc	B-loc
منتشر	O	O
شد	O	O
[UNK]	O	O
[UNK]	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
##احوال	O	O
[UNK]	O	O
پوشاک	O	O
و	O	O
جامه	O	O
[UNK]	O	O
فانتزی	O	O
و	O	O
شیک	O	O
[UNK]	O	O
[UNK]	O	O
در	O	O
حال	O	O
وزیدن	O	O
[UNK]	O	O
اطلاعیه	O	O
[UNK]	O	O
پایتخت	O	O
جمهوری	O	O
استونی	B-loc	B-loc
در	I-loc	I-loc
حوضه	I-loc	I-loc
بالتیک	I-loc	I-loc
[UNK]	O	O
[UNK]	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
[UNK]	O	O
نوعی	O	O
شمع	O	O
[UNK]	O	O
[UNK]	O	O
حرف	O	O
جمع	O	O
مونث	O	O
[UNK]	O	O
در	O	O
ایران	B-loc	B-loc
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
میشود	O	O
[UNK]	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B-loc	B-loc
[UNK]	O	O
تا	O	O
عصر	O	O
ناصرالدینشاه	B-pers	B-pers
جزيی	O	O
از	O	O
خراسان	B-loc	B-loc
بود	O	O
[UNK]	O	O
ویتامین	O	O
انعقاد	O	O
[UNK]

In [41]:
ner_model.evaluate_prediction_results(labels, inference_output)

Test Accuracy: 0.9795462458932185
Test Precision: 0.8362657091561939
Test Recall: 0.7453754080522307
Test F1-Score: 0.7882090158386355
Test classification Report:
              precision    recall  f1-score   support

       event  0.7073170732 0.7435897436 0.7250000000       585
         fac  0.7957516340 0.8589065256 0.8261238338       567
         loc  0.8841931943 0.7780729292 0.8277456647      4141
         org  0.8538794801 0.8016269181 0.8269285782      5409
        pers  0.8137869293 0.6614115935 0.7297297297      4123
         pro  0.7008000000 0.5488721805 0.6156008433       798

   micro avg  0.8362657092 0.7453754081 0.7882090158     15623
   macro avg  0.7926213851 0.7320799817 0.7585214416     15623
weighted avg  0.8359170261 0.7453754081 0.7868536030     15623



In [42]:
output_file_name = "ner_arman-and-peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### WikiAnn

https://elisa-ie.github.io/wikiann/

In [43]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1QOG15HU8VfZvJUNKos024xI-OGm0zhEX'})
download.GetContentFile('fa.tar.gz')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
peyma
peyma.zip
sample_data


In [44]:
!tar -zxvf fa.tar.gz
!ls

README.txt
wikiann-fa.bio
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [45]:
sentences_all, labels_all, sentences_test, labels_test = ner_model.load_datasets(dataset_name="wikiann", dataset_dir="./")
print(len(sentences_all), len(sentences_all))
print(len(sentences_test), len(labels_test))
print(sentences_test[0])
print(labels_test[0])

all data: #data: 272266, #labels: 272266


  return array(a, dtype, copy=False, order=order)


without stratify
test part:
 #data: 27227, #labels: 27227
272266 272266
27227 27227
['**', 'زاغی', 'نوک\u200cزرد', ',', "''Pica", 'nuttalli', "''"]
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O']


In [46]:
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-PER', 'B-PER', 'O', 'I-ORG', 'B-LOC', 'B-ORG', 'I-LOC'}
intersection: {'O'}
model_labels-dataset_labels: ['B-fac', 'I-pro', 'I-event', 'B-org', 'B-pro', 'B-event', 'I-pers', 'I-fac', 'B-loc', 'B-pers', 'I-org', 'I-loc']
dataset_labels-model_labels: ['I-PER', 'B-PER', 'B-LOC', 'B-ORG', 'I-ORG', 'I-LOC']
False


In [47]:
label_translate = {
    'B-PER': 'B-pers',
    'I-PER': 'I-pers',
    'B-LOC': 'B-loc',
    'I-LOC': 'I-loc',
    'B-ORG': 'B-org',
    'I-ORG': 'I-org',
    'O': 'O'
}
labels_test = ner_model.resolve_input_label_consistency(labels_test, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-org', 'O', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
intersection: {'B-org', 'O', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
model_labels-dataset_labels: ['B-fac', 'I-pro', 'I-event', 'B-pro', 'B-event', 'I-fac']
dataset_labels-model_labels: []
True


In [48]:
!nvidia-smi
!lscpu

Mon Aug 16 05:19:33 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    72W / 149W |   9439MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [49]:
inference_output_wikiann = ner_model.ner_evaluation_2(sentences_test, labels_test, device, batch_size=512)

len(input_text): 27227
len(input_labels): 27227
c: 10000
c: 20000
max_len: 95
#samples: 27227
#batch: 54
Start to evaluate test data ...
inference time for step 0: 0.01783539899997777
inference time for step 1: 0.012523351000254479
inference time for step 2: 0.012162539000200923
inference time for step 3: 0.012708841999938159
inference time for step 4: 0.012974639999811188
inference time for step 5: 0.01279431700004352
inference time for step 6: 0.013126357000146527
inference time for step 7: 0.015500496999720781
inference time for step 8: 0.011868716999742901
inference time for step 9: 0.012690967000253295
inference time for step 10: 0.011829684000076668
inference time for step 11: 0.013426444999822706
inference time for step 12: 0.013086638999993738
inference time for step 13: 0.012879284000064217
inference time for step 14: 0.01232760299990332
inference time for step 15: 0.01269249300003139
inference time for step 16: 0.012312423999901512
inference time for step 17: 0.01507743499996

In [50]:
for sample_output in inference_output_wikiann[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

*	O	O
*	O	O
زاغی	B-loc	O
نوک	I-loc	O
##زرد	I-loc	O
,	O	O
'	O	O
'	O	O
pic	O	O
##a	O	O
nut	O	O
##ta	O	O
##ll	O	O
##i	O	O
'	O	O
'	O	O

تغییر	O	O
##مسیر	O	O
مک	B-loc	B-loc
##ویل	B-loc	I-loc
،	B-loc	O
داکوتای	I-loc	B-loc
شمالی	I-loc	I-loc

وست	B-loc	O
یونیورسیتی	I-loc	O
پلیس	I-loc	O
،	I-loc	O
تگزاس	I-loc	B-loc

تغییر	O	O
##مسیر	O	O
دلت	B-pers	O
##ف	B-pers	O
فون	I-pers	O
لیل	I-pers	I-pro
##نس	I-pers	O
##رون	I-pers	O

تغییر	O	O
##مسیر	O	O
نیروگاههای	B-org	O
زنجیرهای	I-org	O
یاسوج	I-org	O



In [51]:
ner_model.evaluate_prediction_results(labels_test, inference_output_wikiann)

Test Accuracy: 0.4682374703956127
Test Precision: 0.17987630580531158
Test Recall: 0.09447507779658242
Test F1-Score: 0.12388366890380315
Test classification Report:
              precision    recall  f1-score   support

         loc  0.1286806883 0.0695319764 0.0902810383     19358
         org  0.3364599092 0.0979563073 0.1517363717     11352
        pers  0.1831628926 0.1693112762 0.1759649123      5924

   micro avg  0.1798763058 0.0944750778 0.1238836689     36634
   macro avg  0.2161011634 0.1122665200 0.1393274408     36634
weighted avg  0.2018766891 0.0944750778 0.1231803180     36634



In [52]:
output_file_name = "ner_wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_wikiann:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Hooshvare - Arman+Peyma+WikiAnn

https://github.com/hooshvare/parsner

In [53]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1fC2WGlpqumUTaT9Dr_U1jO2no3YMKFJ4'})
download.GetContentFile('ner-v1.zip')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner-v1.zip
ner_wikiann_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [54]:
!unzip ner-v1.zip
!ls
!ls ner

Archive:  ner-v1.zip
   creating: ner/
  inflating: ner/valid.csv           
  inflating: ner/ner.csv             
  inflating: ner/test.csv            
  inflating: ner/train.csv           
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
ner-v1.zip
ner_wikiann_HooshvareLab-bert-base-parsbert-armanner-uncased_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio
ner.csv  test.csv  train.csv  valid.csv


In [55]:
sentences_paw, labels_paw = ner_model.load_test_datasets(dataset_name="hooshvare-peyman+arman+wikiann", dataset_dir="./ner/")
print(len(sentences_paw), len(labels_paw))
print(sentences_paw[0])
print(labels_paw[0])

test part:
 #sentences: 6049, #sentences_tags: 6049
6049 6049
['همچنین', 'عملیات', 'لرزه\u200cنگاری', 'دوبعدی', 'نیز', 'با', 'فعالیت', 'مستمر', 'چهار', 'گروه', 'کاری', 'در', 'مناطقی', 'که', 'از', 'نظر', 'اکتشافی', 'مورد', 'نظر', 'بود', '،', 'به', 'پایان', 'رسید', 'که', 'نتایج', 'آن', 'در', 'حال', 'بررسی', 'است', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [56]:
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-PER', 'B-TIM', 'B-PER', 'I-PRO', 'O', 'I-MON', 'B-FAC', 'B-PRO', 'B-ORG', 'I-ORG', 'B-DAT', 'B-MON', 'I-FAC', 'I-TIM', 'B-LOC', 'B-EVE', 'B-PCT', 'I-DAT', 'I-LOC', 'I-PCT', 'I-EVE'}
intersection: {'O'}
model_labels-dataset_labels: ['B-fac', 'I-pro', 'I-event', 'B-org', 'B-pro', 'B-event', 'I-pers', 'I-fac', 'B-loc', 'B-pers', 'I-org', 'I-loc']
dataset_labels-model_labels: ['I-PRO', 'B-PRO', 'I-ORG', 'B-ORG', 'B-DAT', 'I-FAC', 'I-TIM', 'B-EVE', 'B-PCT', 'I-LOC', 'I-EVE', 'I-PER', 'B-TIM', 'B-PER', 'I-MON', 'B-FAC', 'B-MON', 'B-LOC', 'I-DAT', 'I-PCT']
False


In [57]:
label_translate = {
    'B-LOC': 'B-loc', 
    'I-LOC': 'I-loc', 
    'B-EVE': 'B-event', 
    'I-EVE': 'I-event', 
    'B-ORG': 'B-org', 
    'I-ORG': 'I-org', 
    'B-MON': 'O', 
    'I-MON': 'O', 
    'B-DAT': 'O', 
    'I-DAT': 'O', 
    'B-PRO': 'B-pro', 
    'I-PRO': 'I-pro',
    'B-FAC': 'B-fac', 
    'I-FAC': 'I-fac', 
    'B-PCT': 'O', 
    'I-PCT': 'O', 
    'B-TIM': 'O', 
    'I-TIM': 'O', 
    'B-PER': 'B-pers', 
    'I-PER': 'I-pers', 
    'O': 'O'
}
labels_paw = ner_model.resolve_input_label_consistency(labels_paw, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-fac', 'B-org', 'O', 'B-event', 'I-fac', 'I-pro', 'I-event', 'B-pro', 'I-pers', 'B-pers', 'B-loc', 'I-org', 'I-loc'}
intersection: {'B-fac', 'I-pro', 'I-event', 'B-org', 'O', 'B-pro', 'B-event', 'I-pers', 'I-fac', 'B-loc', 'B-pers', 'I-org', 'I-loc'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [58]:
!nvidia-smi
!lscpu

Mon Aug 16 05:28:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0    72W / 149W |   3897MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [59]:
inference_output = ner_model.ner_evaluation_2(sentences_paw, labels_paw, device, batch_size=256)

len(input_text): 6049
len(input_labels): 6049
max_len: 448
#samples: 6049
#batch: 24
Start to evaluate test data ...
inference time for step 0: 0.028818677999879583
inference time for step 1: 0.014001434999954654
inference time for step 2: 0.012747524000133126
inference time for step 3: 0.013768861000244215
inference time for step 4: 0.013666740000189748
inference time for step 5: 0.012375746999623516
inference time for step 6: 0.01341209100019114
inference time for step 7: 0.012407443000029161
inference time for step 8: 0.012912472000152775
inference time for step 9: 0.013608700999611756
inference time for step 10: 0.012945288999617333
inference time for step 11: 0.013774329000170837
inference time for step 12: 0.012789027000053466
inference time for step 13: 0.01334026599988647
inference time for step 14: 0.013002834999952029
inference time for step 15: 0.012671259999933682
inference time for step 16: 0.018540628999744513
inference time for step 17: 0.013168282000151521
inference tim

In [60]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

همچنین	O	O
عملیات	O	O
لرزهنگاری	O	O
دوبعدی	O	O
نیز	O	O
با	O	O
فعالیت	O	O
مستمر	O	O
چهار	O	O
گروه	O	O
کاری	O	O
در	O	O
مناطقی	O	O
که	O	O
از	O	O
نظر	O	O
اکتشافی	O	O
مورد	O	O
نظر	O	O
بود	O	O
،	O	O
به	O	O
پایان	O	O
رسید	O	O
که	O	O
نتایج	O	O
ان	O	O
در	O	O
حال	O	O
بررسی	O	O
است	O	O
.	O	O

محدث	B-pers	O
در	O	O
مورد	O	O
مشارکت	O	O
شرکتهای	O	O
خارجی	O	O
در	O	O
فعالیتهای	O	O
اکتشافی	O	O
کشور	O	O
گفت	O	O
:	O	O
تاکنون	O	O
چند	O	O
منطقه	O	O
اکتشافی	O	O
را	O	O
برای	O	O
مشارکت	O	O
و	O	O
سرمایهگذاری	O	O
شرکتهای	O	O
خارجی	O	O
اعلام	O	O
کردهایم	O	O
و	O	O
در	O	O
حال	O	O
مذاکره	O	O
با	O	O
طرفهای	O	O
خارجی	O	O
هستیم	O	O
و	O	O
انتظار	O	O
میرود	O	O
تا	O	O
اخر	O	O
امسال	O	O
بتوانیم	O	O
چند	O	O
قرارداد	O	O
را	O	O
نهایی	O	O
کنیم	O	O
.	O	O

مدیر	O	O
امور	B-org	B-org
اکتشاف	I-org	I-org
شرکت	I-org	I-org
ملی	I-org	I-org
نفت	I-org	I-org
فرو	O	O
##افتادگی	O	O
دزفول	B-loc	B-loc
و	O	O
منطقه	B-loc	B-loc
گسل	I-loc	I-loc
کازرون	I-loc	I-loc
تا	O	O
بالارو	B-loc	B-loc
##د	B-loc	I-loc
در	O	O
اطراف	O	O
لرستان	B-loc	B-loc
را	O	O

In [61]:
ner_model.evaluate_prediction_results(labels_paw, inference_output)

Test Accuracy: 0.9709172482552343
Test Precision: 0.7479296653431651
Test Recall: 0.6612175308394344
Test F1-Score: 0.7019056744384116
Test classification Report:
              precision    recall  f1-score   support

       event  0.4957264957 0.6744186047 0.5714285714       258
         fac  0.6063218391 0.8210116732 0.6975206612       257
         loc  0.8546387345 0.6751097602 0.7543396226      2961
         org  0.7083983765 0.6877841770 0.6979390957      3299
        pers  0.7860652077 0.6223479491 0.6946911387      2828
         pro  0.5373134328 0.4891304348 0.5120910384       368

   micro avg  0.7479296653 0.6612175308 0.7019056744      9971
   macro avg  0.6647440144 0.6616337665 0.6546683547      9971
weighted avg  0.7594060211 0.6612175308 0.7036233199      9971



In [62]:
output_file_name = "ner_arman-and-peyma-and-wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()