# ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as **BERT-Base**.

Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)

**All the models (downstream tasks) are uncased** and trained with whole word masking. (coming soon stay tuned)



## Persian NER [ARMAN, PEYMA, COMPOSITE]

This task aims to extract named entities in the text, such as names and label with appropriate **NER** classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with **IOB** format. In this format, tokens that are not part of an entity are tagged as **”O”**, the **”B”** tag corresponds to the first word of an object, and the **”I”** tag corresponds to the rest of the terms of the same entity. Both **”B”** and **”I”** tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the **NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text**.

There are two primary datasets used in Persian NER, **ARMAN**, and **PEYMA**. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.


In [None]:
!nvidia-smi
!lscpu

In [2]:
!pip install transformers==4.7.0
!pip install hazm==0.7.0
!pip install seqeval==1.2.2

Collecting transformers==4.7.0
  Downloading transformers-4.7.0-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 4.3 MB/s 
Collecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 50.0 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 36.0 MB/s 
Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed huggingface-hub-0.0.8 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.7.0
Collecting hazm==0.7.0
  Downloading hazm-0.7.0-py3-none-any.whl (316 kB)
[K     |████████████████████████████████| 316 kB 4.2 MB/s 
[?25hCollecting nltk==3.3
  Downloading nltk-3.3.0.zip (1.4

In [3]:
!pip install PyDrive
import os
import IPython.display as ipd
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [4]:
import os
import gc
import ast
import time
import hazm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

import transformers
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForTokenClassification

from IPython.display import display, HTML, clear_output
from ipywidgets import widgets, Layout

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

print()
print('numpy', np.__version__)
print('pandas', pd.__version__)
print('transformers', transformers.__version__)
print('torch', torch.__version__)
print()

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


numpy 1.19.5
pandas 1.1.5
transformers 4.7.0
torch 1.9.0+cu102

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [5]:
class NER:
    def __init__(self, model_name):
        self.normalizer = hazm.Normalizer()
        self.model_name = model_name
        self.config = AutoConfig.from_pretrained(self.model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(self.model_name)
        # self.labels = list(self.config.label2id.keys())
        self.id2label = self.config.id2label

    @staticmethod
    def load_ner_data(file_path, word_index, tag_index, delimiter, join=False):
        dataset, labels = [], []
        with open(file_path, encoding="utf8") as infile:
            sample_text, sample_label = [], []
            for line in infile:
                parts = line.strip().split(delimiter)
                if len(parts) > 1:
                    word, tag = parts[word_index], parts[tag_index]
                    if not word:
                        continue
                    sample_text.append(word)
                    sample_label.append(tag)
                else:
                    # end of sample
                    if sample_text and sample_label:
                        if join:
                            dataset.append(' '.join(sample_text))
                            labels.append(' '.join(sample_label))
                        else:
                            dataset.append(sample_text)
                            labels.append(sample_label)
                    sample_text, sample_label = [], []
        if sample_text and sample_label:
            if join:
                dataset.append(' '.join(sample_text))
                labels.append(' '.join(sample_label))
            else:
                dataset.append(sample_text)
                labels.append(sample_label)
        return dataset, labels

    def load_test_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "peyma":
            ner_file_path = dataset_dir + 'test.txt'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            return self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='|',
                                      join=kwargs.get('join', False))
        elif dataset_name.lower() == "arman":
            dataset, labels = [], []
            for i in range(1, 4):
                ner_file_path = dataset_dir + f'test_fold{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter=' ',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "hooshvare-peyman+arman+wikiann":
            ner_file_path = dataset_dir + 'test.csv'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            data = pd.read_csv(ner_file_path, delimiter="\t")
            sentences, sentences_tags = data['tokens'].values.tolist(), data['ner_tags'].values.tolist()
            sentences = [ast.literal_eval(ss) for ss in sentences]
            sentences_tags = [ast.literal_eval(ss) for ss in sentences_tags]
            print(f'test part:\n #sentences: {len(sentences)}, #sentences_tags: {len(sentences_tags)}')
            return sentences, sentences_tags

    def load_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "farsiyar":
            dataset, labels = [], []
            for i in range(1, 6):
                ner_file_path = dataset_dir + 'Persian-NER-part{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='\t',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "wikiann":
            ner_file_path = dataset_dir + 'wikiann-fa.bio'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            dataset_all, labels_all = self.load_ner_data(ner_file_path, word_index=0, tag_index=-1, delimiter=' ',
                                                         join=kwargs.get('join', False))
            print(f'all data: #data: {len(dataset_all)}, #labels: {len(labels_all)}')

            try:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1,
                                                               stratify=labels_all)
                print("with stratify")
            except:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1)
                print("without stratify")
            print(f'test part:\n #data: {len(data_test)}, #labels: {len(label_test)}')
            return dataset_all, labels_all, data_test, label_test

    def ner_inference(self, input_text, device, max_length):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        pt_batch = self.tokenizer(
            [self.normalizer.normalize(sequence) for sequence in input_text],
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        pt_batch = pt_batch.to(device)
        pt_outputs = self.model(**pt_batch)
        pt_predictions = torch.argmax(pt_outputs.logits, dim=-1)
        pt_predictions = pt_predictions.cpu().detach().numpy().tolist()

        output_predictions = []
        for i, sequence in enumerate(input_text):
            tokens = self.tokenizer.tokenize(self.tokenizer.decode(self.tokenizer.encode(sequence)))
            predictions = [(token, self.id2label[prediction]) for token, prediction in
                           zip(tokens, pt_predictions[i])]
            output_predictions.append(predictions)
        return output_predictions

    def ner_evaluation(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_sentence, new_sentence_label = [], []
            for word, label in zip(sentence, sentence_label):
                # Tokenize the word and count # of subwords the word is broken into
                tokenized_word = self.tokenizer.tokenize(word)
                n_subwords = len(tokenized_word)

                # Add the tokenized word to the final tokenized word list
                tokenized_sentence.extend(tokenized_word)
                # Add the same label to the new list of labels `n_subwords` times
                new_sentence_label.extend([label] * n_subwords)

            max_len = max(max_len, len(tokenized_sentence))
            tokenized_texts.append(tokenized_sentence)
            new_labels.append(new_sentence_label)

        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences([self.tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                                  maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences([[self.config.label2id.get(l) for l in lab] for lab in new_labels],
                                     maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_loss, total_time = 0, 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')
            # get the loss
            total_loss += outputs.loss.item()

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        # Calculate the average loss over the training data.
        avg_train_loss = total_loss / len(data_loader)
        print("average loss:", avg_train_loss)
        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def ner_evaluation_2(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        print("len(input_text):", len(input_text))
        print("len(input_labels):", len(input_labels))
        c = 0
        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_words = self.tokenizer(sentence, padding=False, add_special_tokens=False).input_ids
            tokenized_sentence_ids, new_sentence_label = [], []
            for i, tokenized_word in enumerate(tokenized_words):
                # Add the tokenized word to the final tokenized word list
                tokenized_sentence_ids += tokenized_word
                # Add the same label to the new list of labels `number of subwords` times
                new_sentence_label.extend([self.config.label2id.get(sentence_label[i])] * len(tokenized_word))

            max_len = max(max_len, len(tokenized_sentence_ids))
            tokenized_texts.append(tokenized_sentence_ids)
            new_labels.append(new_sentence_label)
            c += 1
            if c % 10000 == 0:
                print("c:", c)
        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences(tokenized_texts, maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences(new_labels, maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_time = 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def check_input_label_consistency(self, labels):
        model_labels = self.config.label2id.keys()
        dataset_labels = set()
        for l in labels:
            dataset_labels.update(set(l))
        print("model labels:", model_labels)
        print("dataset labels:", dataset_labels)
        print("intersection:", set(model_labels).intersection(dataset_labels))
        print("model_labels-dataset_labels:", list(set(model_labels) - set(dataset_labels)))
        print("dataset_labels-model_labels:", list(set(dataset_labels) - set(model_labels)))
        if list(set(dataset_labels) - set(model_labels)):
            return False
        return True

    @staticmethod
    def resolve_input_label_consistency(labels, label_translation_map):
        for i, sentence_labels in enumerate(labels):
            for j, label in enumerate(sentence_labels):
                labels[i][j] = label_translation_map.get(label)
        return labels

    @staticmethod
    def evaluate_prediction_results(labels, output_predictions):
        dataset_labels = set()
        for label in labels:
            dataset_labels.update(set(label))

        true_labels, predictions = [], []
        for sample_output in output_predictions:
            sample_true_labels = []
            sample_predicted_labels = []
            for token, true_label, predicted_label in sample_output:
                sample_true_labels.append(true_label)
                if predicted_label in dataset_labels:
                    sample_predicted_labels.append(predicted_label)
                else:
                    sample_predicted_labels.append('O')
            true_labels.append(sample_true_labels)
            predictions.append(sample_predicted_labels)

        print("Test Accuracy: {}".format(accuracy_score(true_labels, predictions)))
        print("Test Precision: {}".format(precision_score(true_labels, predictions)))
        print("Test Recall: {}".format(recall_score(true_labels, predictions)))
        print("Test F1-Score: {}".format(f1_score(true_labels, predictions)))
        print("Test classification Report:\n{}".format(classification_report(true_labels, predictions, digits=10)))


In [6]:
model_name='HooshvareLab/bert-base-parsbert-peymaner-uncased'
ner_model = NER(model_name)

Downloading:   0%|          | 0.00/997 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/651M [00:00<?, ?B/s]

In [7]:
print(ner_model.config)

BertConfig {
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B_DAT",
    "1": "B_LOC",
    "2": "B_MON",
    "3": "B_ORG",
    "4": "B_PCT",
    "5": "B_PER",
    "6": "B_TIM",
    "7": "I_DAT",
    "8": "I_LOC",
    "9": "I_MON",
    "10": "I_ORG",
    "11": "I_PCT",
    "12": "I_PER",
    "13": "I_TIM",
    "14": "O"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B_DAT": 0,
    "B_LOC": 1,
    "B_MON": 2,
    "B_ORG": 3,
    "B_PCT": 4,
    "B_PER": 5,
    "B_TIM": 6,
    "I_DAT": 7,
    "I_LOC": 8,
    "I_MON": 9,
    "I_ORG": 10,
    "I_PCT": 11,
    "I_PER": 12,
    "I_TIM": 13,
    "O": 14
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pos

#### Sample Inference:

In [None]:
texts = [
    "مدیرکل محیط زیست استان البرز با بیان اینکه با بیان اینکه موضوع شیرابه‌های زباله‌های انتقال یافته در منطقه حلقه دره خطری برای این استان است، گفت: در این مورد گزارشاتی در ۲۵ مرداد ۱۳۹۷ تقدیم مدیران استان شده است.",
    "به گزارش خبرگزاری تسنیم از کرج، حسین محمدی در نشست خبری مشترک با معاون خدمات شهری شهرداری کرج که با حضور مدیرعامل سازمان‌های پسماند، پارک‌ها و فضای سبز و نماینده منابع طبیعی در سالن کنفرانس شهرداری کرج برگزار شد، اظهار داشت: ۸۰٪  جمعیت استان البرز در کلانشهر کرج زندگی می‌کنند.",
    "وی افزود: با همکاری‌های مشترک بین اداره کل محیط زیست و شهرداری کرج برنامه‌های مشترکی برای حفاظت از محیط زیست در شهر کرج در دستور کار قرار گرفته که این اقدامات آثار مثبتی داشته و تاکنون نزدیک به ۱۰۰ میلیارد هزینه جهت خریداری اکس-ریس صورت گرفته است.",
]

In [9]:
inference_output = ner_model.ner_inference(texts, device, ner_model.config.max_position_embeddings)

In [10]:
print(inference_output)

[[('[CLS]', 'O'), ('مدیرکل', 'O'), ('محیط', 'B_ORG'), ('زیست', 'I_ORG'), ('استان', 'I_ORG'), ('البرز', 'I_ORG'), ('با', 'O'), ('بیان', 'O'), ('اینکه', 'O'), ('با', 'O'), ('بیان', 'O'), ('اینکه', 'O'), ('موضوع', 'O'), ('شیرابه', 'O'), ('##های', 'O'), ('زبالههای', 'O'), ('انتقال', 'O'), ('یافته', 'O'), ('در', 'O'), ('منطقه', 'B_LOC'), ('حلقه', 'I_LOC'), ('دره', 'I_LOC'), ('خطری', 'O'), ('برای', 'O'), ('این', 'O'), ('استان', 'O'), ('است', 'O'), ('،', 'O'), ('گفت', 'O'), (':', 'O'), ('در', 'O'), ('این', 'O'), ('مورد', 'O'), ('گزارشاتی', 'O'), ('در', 'O'), ('۲۵', 'B_DAT'), ('مرداد', 'I_DAT'), ('۱۳۹۷', 'I_DAT'), ('تقدیم', 'O'), ('مدیران', 'O'), ('استان', 'O'), ('شده', 'O'), ('است', 'O'), ('.', 'O'), ('[SEP]', 'O')], [('[CLS]', 'O'), ('به', 'O'), ('گزارش', 'O'), ('خبرگزاری', 'B_ORG'), ('تسنیم', 'I_ORG'), ('از', 'O'), ('کرج', 'B_LOC'), ('،', 'O'), ('حسین', 'B_PER'), ('محمدی', 'I_PER'), ('در', 'O'), ('نشست', 'O'), ('خبری', 'O'), ('مشترک', 'O'), ('با', 'O'), ('معاون', 'O'), ('خدمات', 'O'), ('شهر

In [11]:
#@title Live Playground { display-mode: "form" }

css_is_load = False
css = """<style>
.ner-box {
    direction: rtl;
    font-size: 18px !important;
    line-height: 20px !important;
    margin: 0 0 15px;
    padding: 10px;
    text-align: justify;
    color: #343434 !important;
}
.token, .token span {
    display: inline-block !important;
    padding: 2px;
    margin: 2px 0;
}
.token.token-ner {
    background-color: #f6cd61;
    font-weight: bold;
    color: #000;
}
.token.token-ner .ner-label {
    color: #9a1f40;
    margin: 0px 2px;
}
</style>"""

if not css_is_load:
    display(HTML(css))
    css_is_load = True

submit_wd = widgets.Button(description='Send', disabled=False, button_style='success', tooltip='Submit')
text_wd = widgets.Textarea(placeholder='Please enter you text ...', rows=5, layout=Layout(width='90%'))
output_wd = widgets.Output()

display(HTML("""
<h2>Test NER model</h2>
<p style="padding: 2px 20px; margin: 0 0 20px;">
</p>
<br /><br />
"""))

display(text_wd)
display(submit_wd)
display(output_wd)

def submit_text(sender):
    with output_wd:
        clear_output(wait=True)
        text = text_wd.value
        _output = ner_model.ner_inference([text], device, ner_model.config.max_position_embeddings)
        # print(_output)
        pred_sequence = []
        for token, label in _output[0]:
            if token not in ['[CLS]', '[SEP]']:
                if label != 'O':
                    pred_sequence.append(
                        '<span class="token token-ner">%s<span class="ner-label">%s</span></span>' 
                        % (token, label))
                else:
                    pred_sequence.append(
                        '<span class="token">%s</span>' 
                        % token)
            
        html = '<p class="ner-box">%s</p>' % ' '.join(pred_sequence) 
        display(HTML(html))

submit_wd.on_click(submit_text)

Textarea(value='', layout=Layout(width='90%'), placeholder='Please enter you text ...', rows=5)

Button(button_style='success', description='Send', style=ButtonStyle(), tooltip='Submit')

Output()

#### PEYMA dataset:
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes: 

- Organization
- Money
- Location
- Date
- Time
- Person
- Percent

|     Label    |   #   |
|:------------:|:-----:|
| Organization | 16964 |
|     Money    |  2037 |
|   Location   |  8782 |
|     Date     |  4259 |
|     Time     |  732  |
|    Person    |  7675 |
|    Percent   |  699  |

Download
You can download the dataset from [here](https://hooshvare.github.io/docs/datasets/ner) with leads to following google drive file of HooshvareLab:

In [12]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

adc.json  peyma.zip  sample_data


In [13]:
!unzip peyma.zip
!ls
!ls peyma

Archive:  peyma.zip
   creating: peyma/
  inflating: peyma/dev.txt           
  inflating: peyma/test.txt          
  inflating: peyma/train.txt         
adc.json  peyma  peyma.zip  sample_data
dev.txt  test.txt  train.txt


In [14]:
sentences, labels = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [15]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'I_PER', 'B_PER', 'I_DAT', 'B_DAT', 'B_LOC', 'I_MON', 'I_PCT', 'I_LOC', 'O', 'B_TIM', 'I_ORG', 'B_ORG', 'I_TIM', 'B_PCT', 'B_MON'}
intersection: {'I_PER', 'B_PER', 'I_DAT', 'I_ORG', 'B_DAT', 'B_LOC', 'I_MON', 'B_ORG', 'I_TIM', 'B_PCT', 'I_PCT', 'I_LOC', 'O', 'B_MON', 'B_TIM'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [16]:
!nvidia-smi
!lscpu

Sun Aug 15 09:27:01 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    27W /  70W |   1760MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [17]:
inference_output_peyma = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 151
#samples: 1026
#batch: 3
Start to evaluate test data ...
inference time for step 0: 0.023682450999984894
inference time for step 1: 0.008425877999997056
inference time for step 2: 0.009098503999979357
average loss: 0.039171495785315834
total inference time: 0.04120683299996131
total inference time / #samples: 4.016260526312018e-05


In [18]:
for sample_output in inference_output_peyma[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

کنایه	O	O
سرلشگر	O	O
فیروزابادی	B_ORG	B_PER
به	O	O
پادشاه	O	O
عربستان	B_LOC	B_LOC
و	O	O
پسرش	O	O

ريیس	O	O
سابق	O	O
ستاد	B_ORG	B_ORG
کل	I_ORG	I_ORG
نیروهای	I_ORG	I_ORG
مسلح	I_ORG	I_ORG
با	O	O
بیان	O	O
اینکه	O	O
ال	O	O
سعود	O	O
با	O	O
حمایت	O	O
همه	O	O
جانبه	O	O
غرب	O	O
بر	O	O
سرزمین	B_LOC	O
حجاز	I_LOC	I_LOC
حاکم	O	O
شد	O	O
گفت	O	O
:	O	O
غرب	O	O
با	O	O
حاکم	O	O
کردد	O	O
ال	O	O
سعود	O	O
بر	O	O
حجاز	B_LOC	B_LOC
هدفی	O	O
جز	O	O
##ناب	O	O
##ودی	O	O
اسلام	O	O
نداشته	O	O
و	O	O
این	O	O
نقشه	O	O
انگلیس	B_LOC	B_LOC
بود	O	O
.	O	O

سرلشگر	O	O
حسن	B_PER	B_PER
فیروزابادی	I_PER	I_PER
روز	O	B_DAT
دوشنبه	O	I_DAT
درحاشیه	O	O
ايین	O	O
ختم	O	O
مادر	O	O
حیدر	B_PER	B_PER
مصلحی	I_PER	I_PER
درجمع	O	O
خبرنگاران	O	O
درباره	O	O
موضوع	O	O
یمن	B_LOC	B_LOC
افزود	O	O
:	O	O
ماهیت	O	O
انچه	O	O
در	O	O
یمن	B_LOC	B_LOC
اتفاق	O	O
می	O	O
افتد	O	O
وهابیت	O	O
است	O	O
وهابیت	O	O
یک	O	O
مذهب	O	O
انگلیسی	O	O
است	O	O
.	O	O

وی	O	O
ادامه	O	O
داد	O	O
:	O	O
وقتی	O	O
که	O	O
انقلاب	O	O
اسلامی	O	O
به	O	O
پیروزی	O	O
رسید	O	O
،	O	O
برنا

In [19]:
ner_model.evaluate_prediction_results(labels, inference_output_peyma)

Test Accuracy: 0.9823736441264712
Test Precision: 0.8795950985615344




Test Recall: 0.7784064120697785
Test F1-Score: 0.8259129564782391
Test classification Report:
              precision    recall  f1-score   support

        _DAT  0.8955223881 0.8181818182 0.8551068884       220
        _LOC  0.9219015280 0.8945634267 0.9080267559       607
        _MON  1.0000000000 0.9615384615 0.9803921569        26
        _ORG  0.9103554869 0.8284106892 0.8674521355       711
        _PCT  0.7777777778 0.5600000000 0.6511627907        50
        _PER  0.7500000000 0.5628865979 0.6431095406       485
        _TIM  0.8666666667 0.5909090909 0.7027027027        22

   micro avg  0.8795950986 0.7784064121 0.8259129565      2121
   macro avg  0.8746034068 0.7452128692 0.8011361387      2121
weighted avg  0.8729738140 0.7784064121 0.8210608896      2121



In [20]:
output_file_name = "ner_peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_peyma:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman dataset:
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.

1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person


|     Label    |   #   |
|:------------:|:-----:|
| Organization | 30108 |
|   Location   | 12924 |
|   Facility   |  4458 |
|     Event    |  7557 |
|    Product   |  4389 |
|    Person    | 15645 |

**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)


In [21]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

--2021-08-15 09:27:47--  https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip [following]
--2021-08-15 09:27:47--  https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1931170 (1.8M) [application/zip]
Saving to: ‘ArmanPersoNERCorpus.zip’


2021-08-15 09:27:47 (44.7 MB/s) - ‘ArmanPersoNERCorpus.zip’ saved [1931170/1931170]

adc.json
ArmanPersoNERCorpus.zip
ner_peyma_HooshvareLab-bert-base-parsbert-peyma

In [22]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

Archive:  ArmanPersoNERCorpus.zip
  inflating: arman/test_fold1.txt    
  inflating: arman/ReadMe.txt        
  inflating: arman/train_fold3.txt   
  inflating: arman/train_fold2.txt   
  inflating: arman/train_fold1.txt   
  inflating: arman/test_fold3.txt    
  inflating: arman/test_fold2.txt    
adc.json
arman
ArmanPersoNERCorpus.zip
ner_peyma_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
peyma
peyma.zip
sample_data


In [23]:
sentences, labels = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [24]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B-org', 'B-fac', 'B-loc', 'I-pro', 'B-pro', 'B-pers', 'O', 'I-pers', 'I-loc', 'B-event', 'I-event', 'I-org', 'I-fac'}
intersection: {'O'}
model_labels-dataset_labels: ['I_PER', 'B_PER', 'I_ORG', 'I_DAT', 'B_DAT', 'B_LOC', 'B_ORG', 'I_MON', 'I_TIM', 'B_PCT', 'I_PCT', 'I_LOC', 'B_MON', 'B_TIM']
dataset_labels-model_labels: ['B-org', 'B-fac', 'I-loc', 'B-event', 'I-event', 'B-loc', 'I-pro', 'B-pers', 'B-pro', 'I-org', 'I-fac', 'I-pers']
False


In [25]:
label_translate = {
    'B-org': 'B_ORG', 
    'I-org': 'I_ORG',
    'B-loc': 'B_LOC',
    'I-loc': 'I_LOC',
    'B-pers': 'B_PER', 
    'I-pers': 'I_PER',
    'O': 'O',
    # this model can not support the following entities
    'B-pro': 'O', 
    'I-pro': 'O', 
    'B-fac': 'O', 
    'I-fac': 'O',  
    'B-event': 'O', 
    'I-event': 'O'
}
labels = ner_model.resolve_input_label_consistency(labels, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'I_PER', 'B_PER', 'I_ORG', 'B_ORG', 'B_LOC', 'I_LOC', 'O'}
intersection: {'I_PER', 'B_PER', 'I_ORG', 'B_ORG', 'B_LOC', 'I_LOC', 'O'}
model_labels-dataset_labels: ['I_DAT', 'B_DAT', 'I_MON', 'I_TIM', 'B_PCT', 'I_PCT', 'B_MON', 'B_TIM']
dataset_labels-model_labels: []
True


batch size=256 -> inference time for one batch is about 205 s

batch size=512 -> inference time for one batch is about 410 s

batch size=1024 -> crach

In [26]:
!nvidia-smi
!lscpu

Sun Aug 15 09:27:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P0    27W /  70W |   5992MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [27]:
inference_output_arman = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 253
#samples: 7681
#batch: 16
Start to evaluate test data ...
inference time for step 0: 0.035851421000018036
inference time for step 1: 0.007916221000016321
inference time for step 2: 0.009458945999995194
inference time for step 3: 0.00902973000000884
inference time for step 4: 0.009614173000045412
inference time for step 5: 0.008412049000014576
inference time for step 6: 0.009326251999993929
inference time for step 7: 0.010476018000019849
inference time for step 8: 0.008374076999984936
inference time for step 9: 0.00807537800000091
inference time for step 10: 0.009049629999992703
inference time for step 11: 0.008163679999995566
inference time for step 12: 0.008784402999992835
inference time for step 13: 0.008739642000023196
inference time for step 14: 0.008432559999960176
inference time for step 15: 0.008200567000017145
average loss: 0.06543258321471512
total inference time: 0.16790474700007962
total inference time / #samples: 2.185975094389788e-05


In [28]:
for sample_output in inference_output_arman[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افقی	O	O
:	O	O
[UNK]	O	O
[UNK]	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	B_PER
و	O	O
نخستوزیر	O	O
ایران	B_LOC	B_LOC
در	O	O
سالهای	O	O
ابتدايی	O	O
دهه	O	B_DAT
چهل	O	I_DAT
خورشیدی	O	I_DAT
[UNK]	O	O
جلد	O	O
سوم	O	O
یادداشتهای	O	O
##ش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B_LOC	B_LOC
منتشر	O	O
شد	O	O
[UNK]	O	O
[UNK]	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
##احوال	O	O
[UNK]	O	O
پوشاک	O	O
و	O	O
جامه	O	O
[UNK]	O	O
فانتزی	O	O
و	O	O
شیک	O	O
[UNK]	O	O
[UNK]	O	O
در	O	O
حال	O	O
وزیدن	O	O
[UNK]	O	O
اطلاعیه	O	O
[UNK]	O	O
پایتخت	O	O
جمهوری	O	B_LOC
استونی	B_LOC	I_LOC
در	I_LOC	O
حوضه	I_LOC	B_LOC
بالتیک	I_LOC	I_LOC
[UNK]	O	O
[UNK]	O	O
علم	O	O
راهبرد	O	I_ORG
موسسه	O	O
و	O	O
سازمان	O	O
[UNK]	O	O
نوعی	O	O
شمع	O	O
[UNK]	O	O
[UNK]	O	O
حرف	O	O
جمع	O	O
مونث	O	O
[UNK]	O	O
در	O	O
ایران	B_LOC	B_LOC
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
میشود	O	O
[UNK]	O	O
از	O	O
شهرهای	O	O
باختری	O	B_LOC
افغانستان	B_LOC	I_LOC
[UNK]	O	O
تا	O	O
عصر	O	O
ناصرالدینشاه	B_PER	B_PER
جزيی	O	O
از	O	O
خراسان	B_LOC	B_LOC
بود	O	O
[UNK]	O	O
ویتامی

In [29]:
ner_model.evaluate_prediction_results(labels, inference_output_arman)

Test Accuracy: 0.957992208453241




Test Precision: 0.6112642094385119
Test Recall: 0.5979780960404381
Test F1-Score: 0.6045481645515715
Test classification Report:
              precision    recall  f1-score   support

        _LOC  0.5160607809 0.7591963780 0.6144509332      3534
        _ORG  0.6217207334 0.4691358025 0.5347567633      4698
        _PER  0.7709205021 0.6077515118 0.6796802951      3638

   micro avg  0.6112642094 0.5979780960 0.6045481646     11870
   macro avg  0.6362340055 0.6120278974 0.6096293306     11870
weighted avg  0.6359908671 0.5979780960 0.6029009087     11870



In [30]:
output_file_name = "ner_arman_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_arman:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman+Peyma

In [41]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

adc.json
ner_arman_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
peyma.zip
sample_data


In [42]:
!unzip peyma.zip
!ls
!ls peyma

Archive:  peyma.zip
   creating: peyma/
  inflating: peyma/dev.txt           
  inflating: peyma/test.txt          
  inflating: peyma/train.txt         
adc.json
ner_arman_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
peyma
peyma.zip
sample_data
dev.txt  test.txt  train.txt


In [43]:
sentences_peyma, labels_peyma = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences_peyma), len(labels_peyma))
print(sentences_peyma[0])
print(labels_peyma[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [44]:
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'I_PER', 'B_PER', 'I_DAT', 'B_DAT', 'B_LOC', 'I_MON', 'I_PCT', 'I_LOC', 'O', 'B_TIM', 'I_ORG', 'B_ORG', 'I_TIM', 'B_PCT', 'B_MON'}
intersection: {'I_PER', 'B_PER', 'I_DAT', 'I_ORG', 'B_DAT', 'B_LOC', 'I_MON', 'B_ORG', 'I_TIM', 'B_PCT', 'I_PCT', 'I_LOC', 'O', 'B_MON', 'B_TIM'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [45]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

--2021-08-15 09:40:28--  https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip [following]
--2021-08-15 09:40:28--  https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1931170 (1.8M) [application/zip]
Saving to: ‘ArmanPersoNERCorpus.zip’


2021-08-15 09:40:28 (49.7 MB/s) - ‘ArmanPersoNERCorpus.zip’ saved [1931170/1931170]

adc.json
ArmanPersoNERCorpus.zip
ner_arman_HooshvareLab-bert-base-parsbert-peyma

In [46]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

Archive:  ArmanPersoNERCorpus.zip
  inflating: arman/test_fold1.txt    
  inflating: arman/ReadMe.txt        
  inflating: arman/train_fold3.txt   
  inflating: arman/train_fold2.txt   
  inflating: arman/train_fold1.txt   
  inflating: arman/test_fold3.txt    
  inflating: arman/test_fold2.txt    
adc.json
arman
ArmanPersoNERCorpus.zip
ner_arman_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
peyma
peyma.zip
sample_data


In [47]:
sentences_arman, labels_arman = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences_arman), len(labels_arman))
print(sentences_arman[0])
print(labels_arman[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [48]:
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B-org', 'B-fac', 'B-loc', 'I-pro', 'B-pro', 'B-pers', 'O', 'I-pers', 'I-loc', 'B-event', 'I-event', 'I-org', 'I-fac'}
intersection: {'O'}
model_labels-dataset_labels: ['I_PER', 'B_PER', 'I_ORG', 'I_DAT', 'B_DAT', 'B_LOC', 'B_ORG', 'I_MON', 'I_TIM', 'B_PCT', 'I_PCT', 'I_LOC', 'B_MON', 'B_TIM']
dataset_labels-model_labels: ['B-org', 'B-fac', 'I-loc', 'B-event', 'I-event', 'B-loc', 'I-pro', 'B-pers', 'B-pro', 'I-org', 'I-fac', 'I-pers']
False


In [49]:
label_translate = {
    'B-org': 'B_ORG', 
    'I-org': 'I_ORG',
    'B-loc': 'B_LOC',
    'I-loc': 'I_LOC',
    'B-pers': 'B_PER', 
    'I-pers': 'I_PER',
    'O': 'O',
    # this model can not support the following entities
    'B-pro': 'O', 
    'I-pro': 'O', 
    'B-fac': 'O', 
    'I-fac': 'O',  
    'B-event': 'O', 
    'I-event': 'O'
}
labels_arman = ner_model.resolve_input_label_consistency(labels_arman, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'I_PER', 'B_PER', 'I_ORG', 'B_ORG', 'B_LOC', 'I_LOC', 'O'}
intersection: {'I_PER', 'B_PER', 'I_ORG', 'B_ORG', 'B_LOC', 'I_LOC', 'O'}
model_labels-dataset_labels: ['I_DAT', 'B_DAT', 'I_MON', 'I_TIM', 'B_PCT', 'I_PCT', 'B_MON', 'B_TIM']
dataset_labels-model_labels: []
True


In [50]:
sentences = sentences_arman + sentences_peyma
labels = labels_arman + labels_peyma
print(len(sentences), len(labels))

8707 8707


In [51]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'I_PER', 'B_PER', 'I_DAT', 'B_DAT', 'B_LOC', 'I_MON', 'I_PCT', 'I_LOC', 'O', 'B_TIM', 'I_ORG', 'B_ORG', 'I_TIM', 'B_PCT', 'B_MON'}
intersection: {'I_PER', 'B_PER', 'I_DAT', 'I_ORG', 'B_DAT', 'B_LOC', 'I_MON', 'B_ORG', 'I_TIM', 'B_PCT', 'I_PCT', 'I_LOC', 'O', 'B_MON', 'B_TIM'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [52]:
!nvidia-smi
!lscpu

Sun Aug 15 09:40:59 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    28W /  70W |  10058MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [53]:
inference_output = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 253
#samples: 8707
#batch: 18
Start to evaluate test data ...
inference time for step 0: 0.27028885000004266
inference time for step 1: 0.008836540999936915
inference time for step 2: 0.008172119999926508
inference time for step 3: 0.008350670999789145
inference time for step 4: 0.00986519999992197
inference time for step 5: 0.007859092999979111
inference time for step 6: 0.008275308000065706
inference time for step 7: 0.008499794000044858
inference time for step 8: 0.008008896000092136
inference time for step 9: 0.011382102999959898
inference time for step 10: 0.008269735000112632
inference time for step 11: 0.009032732999912696
inference time for step 12: 0.008434502000000066
inference time for step 13: 0.009660967999934655
inference time for step 14: 0.008057922000034523
inference time for step 15: 0.007929693000050975
inference time for step 16: 0.009151141000074858
inference time for step 17: 0.008403879000070447
average loss: 0.061549732772012554
total inference time: 0.

In [54]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افقی	O	O
:	O	O
[UNK]	O	O
[UNK]	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	B_PER
و	O	O
نخستوزیر	O	O
ایران	B_LOC	B_LOC
در	O	O
سالهای	O	O
ابتدايی	O	O
دهه	O	B_DAT
چهل	O	I_DAT
خورشیدی	O	I_DAT
[UNK]	O	O
جلد	O	O
سوم	O	O
یادداشتهای	O	O
##ش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B_LOC	B_LOC
منتشر	O	O
شد	O	O
[UNK]	O	O
[UNK]	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
##احوال	O	O
[UNK]	O	O
پوشاک	O	O
و	O	O
جامه	O	O
[UNK]	O	O
فانتزی	O	O
و	O	O
شیک	O	O
[UNK]	O	O
[UNK]	O	O
در	O	O
حال	O	O
وزیدن	O	O
[UNK]	O	O
اطلاعیه	O	O
[UNK]	O	O
پایتخت	O	O
جمهوری	O	B_LOC
استونی	B_LOC	I_LOC
در	I_LOC	O
حوضه	I_LOC	B_LOC
بالتیک	I_LOC	I_LOC
[UNK]	O	O
[UNK]	O	O
علم	O	O
راهبرد	O	I_ORG
موسسه	O	O
و	O	O
سازمان	O	O
[UNK]	O	O
نوعی	O	O
شمع	O	O
[UNK]	O	O
[UNK]	O	O
حرف	O	O
جمع	O	O
مونث	O	O
[UNK]	O	O
در	O	O
ایران	B_LOC	B_LOC
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
میشود	O	O
[UNK]	O	O
از	O	O
شهرهای	O	O
باختری	O	B_LOC
افغانستان	B_LOC	I_LOC
[UNK]	O	O
تا	O	O
عصر	O	O
ناصرالدینشاه	B_PER	B_PER
جزيی	O	O
از	O	O
خراسان	B_LOC	B_LOC
بود	O	O
[UNK]	O	O
ویتامی

In [55]:
ner_model.evaluate_prediction_results(labels, inference_output)

Test Accuracy: 0.9431514570671116




Test Precision: 0.5661012559886055
Test Recall: 0.6249731970552498
Test F1-Score: 0.5940822774059857
Test classification Report:
              precision    recall  f1-score   support

        _DAT  0.1174951582 0.8272727273 0.2057659695       220
        _LOC  0.5571502680 0.7780729292 0.6493349456      4141
        _MON  0.0825082508 0.9615384615 0.1519756839        26
        _ORG  0.6669851887 0.5161767425 0.5819697759      5409
        _PCT  0.1348837209 0.5800000000 0.2188679245        50
        _PER  0.7683493342 0.6017463012 0.6749183896      4123
        _TIM  0.0718232044 0.5909090909 0.1280788177        22

   micro avg  0.5661012560 0.6249731971 0.5940822774     13991
   macro avg  0.3427421608 0.6936737504 0.3729873581     13991
weighted avg  0.6517836392 0.6249731971 0.6205732299     13991



In [56]:
output_file_name = "ner_arman-and-peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### WikiAnn

https://elisa-ie.github.io/wikiann/

In [8]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1QOG15HU8VfZvJUNKos024xI-OGm0zhEX'})
download.GetContentFile('fa.tar.gz')
!ls

adc.json  fa.tar.gz  sample_data


In [9]:
!tar -zxvf fa.tar.gz
!ls

README.txt
wikiann-fa.bio
adc.json  fa.tar.gz  README.txt  sample_data  wikiann-fa.bio


In [10]:
sentences_all, labels_all, sentences_test, labels_test = ner_model.load_datasets(dataset_name="wikiann", dataset_dir="./")
print(len(sentences_all), len(sentences_all))
print(len(sentences_test), len(labels_test))
print(sentences_test[0])
print(labels_test[0])

all data: #data: 272266, #labels: 272266


  return array(a, dtype, copy=False, order=order)


without stratify
test part:
 #data: 27227, #labels: 27227
272266 272266
27227 27227
['**', 'زاغی', 'نوک\u200cزرد', ',', "''Pica", 'nuttalli', "''"]
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O']


In [11]:
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'I-LOC', 'I-PER', 'I-ORG', 'B-LOC', 'B-ORG', 'O', 'B-PER'}
intersection: {'O'}
model_labels-dataset_labels: ['B_LOC', 'I_PCT', 'I_MON', 'B_MON', 'B_TIM', 'I_PER', 'B_PCT', 'I_DAT', 'I_LOC', 'I_ORG', 'B_PER', 'I_TIM', 'B_DAT', 'B_ORG']
dataset_labels-model_labels: ['I-LOC', 'I-PER', 'I-ORG', 'B-LOC', 'B-ORG', 'B-PER']
False


In [12]:
label_translate = {
    'B-LOC': 'B_LOC',
    'I-LOC': 'I_LOC',
    'B-PER': 'B_PER',
    'I-PER': 'I_PER',
    'B-ORG': 'B_ORG',
    'I-ORG': 'I_ORG',
    'O': 'O'
}
labels_test = ner_model.resolve_input_label_consistency(labels_test, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B_LOC', 'I_PER', 'I_ORG', 'I_LOC', 'B_PER', 'O', 'B_ORG'}
intersection: {'B_LOC', 'I_PER', 'I_ORG', 'I_LOC', 'B_PER', 'O', 'B_ORG'}
model_labels-dataset_labels: ['I_PCT', 'I_MON', 'B_MON', 'B_TIM', 'B_PCT', 'I_DAT', 'I_TIM', 'B_DAT']
dataset_labels-model_labels: []
True


In [13]:
!nvidia-smi
!lscpu

Sun Aug 15 11:24:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    27W / 149W |      3MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [14]:
inference_output_wikiann = ner_model.ner_evaluation_2(sentences_test, labels_test, device, batch_size=512)

len(input_text): 27227
len(input_labels): 27227
c: 10000
c: 20000
max_len: 95
#samples: 27227
#batch: 54
Start to evaluate test data ...
inference time for step 0: 0.15348140200001126
inference time for step 1: 0.011534449999999197
inference time for step 2: 0.02137657999998055
inference time for step 3: 0.012426166000011563
inference time for step 4: 0.012180353000019295
inference time for step 5: 0.012017796000009184
inference time for step 6: 0.012487973000020247
inference time for step 7: 0.011977507999972659
inference time for step 8: 0.011768055999937133
inference time for step 9: 0.011619251999945845
inference time for step 10: 0.012271614000042064
inference time for step 11: 0.012232092999965971
inference time for step 12: 0.012742526000010912
inference time for step 13: 0.011910495999927662
inference time for step 14: 0.012445649000028425
inference time for step 15: 0.012251264999918021
inference time for step 16: 0.012051760000076683
inference time for step 17: 0.011700829000

In [15]:
for sample_output in inference_output_wikiann[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

*	O	O
*	O	O
زاغی	B_LOC	O
نوک	I_LOC	O
##زرد	I_LOC	O
,	O	O
'	O	O
'	O	O
pic	O	O
##a	O	O
nut	O	O
##ta	O	O
##ll	O	O
##i	O	O
'	O	O
'	O	O

تغییر	O	O
##مسیر	O	O
مک	B_LOC	B_LOC
##ویل	B_LOC	I_LOC
،	B_LOC	O
داکوتای	I_LOC	B_LOC
شمالی	I_LOC	I_LOC

وست	B_LOC	O
یونیورسیتی	I_LOC	B_ORG
پلیس	I_LOC	I_ORG
،	I_LOC	O
تگزاس	I_LOC	B_LOC

تغییر	O	O
##مسیر	O	O
دلت	B_PER	O
##ف	B_PER	O
فون	I_PER	O
لیل	I_PER	I_LOC
##نس	I_PER	O
##رون	I_PER	O

تغییر	O	O
##مسیر	O	O
نیروگاههای	B_ORG	O
زنجیرهای	I_ORG	O
یاسوج	I_ORG	B_LOC



In [16]:
ner_model.evaluate_prediction_results(labels_test, inference_output_wikiann)

Test Accuracy: 0.4953637310637026




Test Precision: 0.19731446852611184
Test Recall: 0.11953376644647049
Test F1-Score: 0.1488772162442416
Test classification Report:
              precision    recall  f1-score   support

        _LOC  0.1258883249 0.1024899266 0.1129904892     19358
        _ORG  0.5142689026 0.1539816772 0.2370008813     11352
        _PER  0.2132498352 0.1092167454 0.1444518866      5924

   micro avg  0.1973144685 0.1195337664 0.1488772162     36634
   macro avg  0.2844690209 0.1218961164 0.1648144190     36634
weighted avg  0.2603652017 0.1195337664 0.1565058926     36634



In [17]:
output_file_name = "ner_wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_wikiann:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Hooshvare - Arman+Peyma+WikiAnn

https://github.com/hooshvare/parsner

In [60]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1fC2WGlpqumUTaT9Dr_U1jO2no3YMKFJ4'})
download.GetContentFile('ner-v1.zip')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner-v1.zip
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [61]:
!unzip ner-v1.zip
!ls
!ls ner

Archive:  ner-v1.zip
   creating: ner/
  inflating: ner/valid.csv           
  inflating: ner/ner.csv             
  inflating: ner/test.csv            
  inflating: ner/train.csv           
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-peymaner-uncased_outputs.txt
ner-v1.zip
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio
ner.csv  test.csv  train.csv  valid.csv


In [62]:
sentences_paw, labels_paw = ner_model.load_test_datasets(dataset_name="hooshvare-peyman+arman+wikiann", dataset_dir="./ner/")
print(len(sentences_paw), len(labels_paw))
print(sentences_paw[0])
print(labels_paw[0])

test part:
 #sentences: 6049, #sentences_tags: 6049
6049 6049
['همچنین', 'عملیات', 'لرزه\u200cنگاری', 'دوبعدی', 'نیز', 'با', 'فعالیت', 'مستمر', 'چهار', 'گروه', 'کاری', 'در', 'مناطقی', 'که', 'از', 'نظر', 'اکتشافی', 'مورد', 'نظر', 'بود', '،', 'به', 'پایان', 'رسید', 'که', 'نتایج', 'آن', 'در', 'حال', 'بررسی', 'است', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [63]:
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B-PER', 'B-LOC', 'I-EVE', 'I-FAC', 'I-DAT', 'I-PCT', 'I-ORG', 'B-PRO', 'I-LOC', 'I-MON', 'O', 'B-TIM', 'I-PER', 'I-PRO', 'I-TIM', 'B-DAT', 'B-EVE', 'B-ORG', 'B-MON', 'B-FAC', 'B-PCT'}
intersection: {'O'}
model_labels-dataset_labels: ['I_PER', 'B_PER', 'I_ORG', 'I_DAT', 'B_DAT', 'B_LOC', 'B_ORG', 'I_MON', 'I_TIM', 'B_PCT', 'I_PCT', 'I_LOC', 'B_MON', 'B_TIM']
dataset_labels-model_labels: ['B-PER', 'I-EVE', 'I-FAC', 'I-DAT', 'I-ORG', 'B-PRO', 'I-LOC', 'I-MON', 'B-EVE', 'B-ORG', 'B-PCT', 'B-LOC', 'I-PCT', 'B-TIM', 'I-PER', 'I-PRO', 'I-TIM', 'B-DAT', 'B-MON', 'B-FAC']
False


In [64]:
label_translate = {
    'B-LOC': 'B_LOC',
    'I-LOC': 'I_LOC',  
    'B-TIM': 'B_TIM',
    'I-TIM': 'I_TIM',  
    'B-PRO': 'O', 
    'I-PRO': 'O',
    'B-PCT': 'B_PCT', 
    'I-PCT': 'I_PCT', 
    'B-ORG': 'B_ORG',
    'I-ORG': 'I_ORG',
    'B-DAT': 'B_DAT', 
    'I-DAT': 'I_DAT', 
    'B-EVE': 'O',
    'I-EVE': 'O', 
    'B-MON': 'B_MON', 
    'I-MON': 'I_MON', 
    'B-PER': 'B_PER',
    'I-PER': 'I_PER', 
    'B-FAC': 'O',
    'I-FAC': 'O', 
    'O': 'O'
}
labels_paw = ner_model.resolve_input_label_consistency(labels_paw, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'I_PER', 'B_PER', 'I_DAT', 'I_MON', 'B_LOC', 'B_DAT', 'I_PCT', 'I_LOC', 'O', 'B_TIM', 'I_ORG', 'B_ORG', 'I_TIM', 'B_PCT', 'B_MON'}
intersection: {'I_PER', 'B_PER', 'I_DAT', 'I_ORG', 'I_MON', 'B_LOC', 'B_DAT', 'B_ORG', 'I_TIM', 'B_PCT', 'I_PCT', 'I_LOC', 'O', 'B_MON', 'B_TIM'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [65]:
!nvidia-smi
!lscpu

Sun Aug 15 10:04:58 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    28W /  70W |  11576MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [66]:
inference_output = ner_model.ner_evaluation_2(sentences_paw, labels_paw, device, batch_size=256)

len(input_text): 6049
len(input_labels): 6049
max_len: 448
#samples: 6049
#batch: 24
Start to evaluate test data ...
inference time for step 0: 0.28100503399991794
inference time for step 1: 0.010572150999905716
inference time for step 2: 0.00943420600015088
inference time for step 3: 0.008198630000151752
inference time for step 4: 0.008187182000256144
inference time for step 5: 0.00936181499992017
inference time for step 6: 0.00878835599996819
inference time for step 7: 0.008261138999841933
inference time for step 8: 0.008309651999752532
inference time for step 9: 0.008120606999909796
inference time for step 10: 0.008240725999712595
inference time for step 11: 0.008867022000231373
inference time for step 12: 0.00858608800035654
inference time for step 13: 0.008133330999953614
inference time for step 14: 0.01005096200015032
inference time for step 15: 0.008533022999927198
inference time for step 16: 0.009133687000030477
inference time for step 17: 0.00814949699997669
inference time for

In [67]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

همچنین	O	O
عملیات	O	O
لرزهنگاری	O	O
دوبعدی	O	O
نیز	O	O
با	O	O
فعالیت	O	O
مستمر	O	O
چهار	O	O
گروه	O	O
کاری	O	O
در	O	O
مناطقی	O	O
که	O	O
از	O	O
نظر	O	O
اکتشافی	O	O
مورد	O	O
نظر	O	O
بود	O	O
،	O	O
به	O	O
پایان	O	O
رسید	O	O
که	O	O
نتایج	O	O
ان	O	O
در	O	O
حال	O	O
بررسی	O	O
است	O	O
.	O	O

محدث	B_PER	O
در	O	O
مورد	O	O
مشارکت	O	O
شرکتهای	O	O
خارجی	O	O
در	O	O
فعالیتهای	O	O
اکتشافی	O	O
کشور	O	O
گفت	O	O
:	O	O
تاکنون	O	O
چند	O	O
منطقه	O	O
اکتشافی	O	O
را	O	O
برای	O	O
مشارکت	O	O
و	O	O
سرمایهگذاری	O	O
شرکتهای	O	O
خارجی	O	O
اعلام	O	O
کردهایم	O	O
و	O	O
در	O	O
حال	O	O
مذاکره	O	O
با	O	O
طرفهای	O	O
خارجی	O	O
هستیم	O	O
و	O	O
انتظار	O	O
میرود	O	O
تا	O	O
اخر	O	B_DAT
امسال	O	B_DAT
بتوانیم	O	O
چند	O	O
قرارداد	O	O
را	O	O
نهایی	O	O
کنیم	O	O
.	O	O

مدیر	O	O
امور	B_ORG	B_ORG
اکتشاف	I_ORG	I_ORG
شرکت	I_ORG	I_ORG
ملی	I_ORG	I_ORG
نفت	I_ORG	I_ORG
فرو	O	I_ORG
##افتادگی	O	I_LOC
دزفول	B_LOC	I_ORG
و	O	O
منطقه	B_LOC	B_LOC
گسل	I_LOC	I_LOC
کازرون	I_LOC	I_LOC
تا	O	I_LOC
بالارو	B_LOC	I_LOC
##د	B_LOC	I_LOC
در	O	O
اطراف	O	O
لرستان

In [68]:
ner_model.evaluate_prediction_results(labels_paw, inference_output)

Test Accuracy: 0.9519292123629113




Test Precision: 0.6257961783439491
Test Recall: 0.6457841224196365
Test F1-Score: 0.6356330553449583
Test classification Report:
              precision    recall  f1-score   support

        _DAT  0.3022088353 0.7359413203 0.4284697509       409
        _LOC  0.6328103496 0.7764268828 0.6973005763      2961
        _MON  0.3410138249 0.7254901961 0.4639498433       102
        _ORG  0.6570796460 0.5401636860 0.5929129928      3299
        _PCT  0.4615384615 0.8210526316 0.5909090909        95
        _PER  0.7774280576 0.6113861386 0.6844813935      2828
        _TIM  0.2577319588 0.5813953488 0.3571428571        43

   micro avg  0.6257961783 0.6457841224 0.6356330553      9737
   macro avg  0.4899730191 0.6845508863 0.5450237864      9737
weighted avg  0.6627646293 0.6457841224 0.6419329228      9737



In [69]:
output_file_name = "ner_arman-and-peyma-and-wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()