# ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT is a monolingual language model based on Google’s BERT architecture with the same configurations as **BERT-Base**.

Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)

**All the models (downstream tasks) are uncased** and trained with whole word masking. (coming soon stay tuned)



## Persian NER [ARMAN, PEYMA, COMPOSITE]

This task aims to extract named entities in the text, such as names and label with appropriate **NER** classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with **IOB** format. In this format, tokens that are not part of an entity are tagged as **”O”**, the **”B”** tag corresponds to the first word of an object, and the **”I”** tag corresponds to the rest of the terms of the same entity. Both **”B”** and **”I”** tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the **NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text**.

There are two primary datasets used in Persian NER, **ARMAN**, and **PEYMA**. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.


In [1]:
!nvidia-smi
!lscpu

Mon Aug 16 07:50:10 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install transformers==4.7.0
!pip install hazm==0.7.0
!pip install seqeval==1.2.2

Collecting transformers==4.7.0
  Downloading transformers-4.7.0-py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 12.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 37.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 46.2 MB/s 
Collecting huggingface-hub==0.0.8
  Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Installing collected packages: tokenizers, sacremoses, huggingface-hub, transformers
Successfully installed huggingface-hub-0.0.8 sacremoses-0.0.45 tokenizers-0.10.3 transformers-4.7.0
Collecting hazm==0.7.0
  Downloading hazm-0.7.0-py3-none-any.whl (316 kB)
[K     |████████████████████████████████| 316 kB 10.9 MB/s 
[?25hCollecting nltk==3.3
  Downloading nltk-3.3.0.zip (1

In [3]:
!pip install PyDrive
import os
import IPython.display as ipd
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [4]:
import os
import gc
import ast
import time
import hazm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

import transformers
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForTokenClassification

from IPython.display import display, HTML, clear_output
from ipywidgets import widgets, Layout

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

print()
print('numpy', np.__version__)
print('pandas', pd.__version__)
print('transformers', transformers.__version__)
print('torch', torch.__version__)
print()

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


numpy 1.19.5
pandas 1.1.5
transformers 4.7.0
torch 1.9.0+cu102

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [5]:
class NER:
    def __init__(self, model_name):
        self.normalizer = hazm.Normalizer()
        self.model_name = model_name
        self.config = AutoConfig.from_pretrained(self.model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(self.model_name)
        # self.labels = list(self.config.label2id.keys())
        self.id2label = self.config.id2label

    @staticmethod
    def load_ner_data(file_path, word_index, tag_index, delimiter, join=False):
        dataset, labels = [], []
        with open(file_path, encoding="utf8") as infile:
            sample_text, sample_label = [], []
            for line in infile:
                parts = line.strip().split(delimiter)
                if len(parts) > 1:
                    word, tag = parts[word_index], parts[tag_index]
                    if not word:
                        continue
                    sample_text.append(word)
                    sample_label.append(tag)
                else:
                    # end of sample
                    if sample_text and sample_label:
                        if join:
                            dataset.append(' '.join(sample_text))
                            labels.append(' '.join(sample_label))
                        else:
                            dataset.append(sample_text)
                            labels.append(sample_label)
                    sample_text, sample_label = [], []
        if sample_text and sample_label:
            if join:
                dataset.append(' '.join(sample_text))
                labels.append(' '.join(sample_label))
            else:
                dataset.append(sample_text)
                labels.append(sample_label)
        return dataset, labels

    def load_test_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "peyma":
            ner_file_path = dataset_dir + 'test.txt'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            return self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='|',
                                      join=kwargs.get('join', False))
        elif dataset_name.lower() == "arman":
            dataset, labels = [], []
            for i in range(1, 4):
                ner_file_path = dataset_dir + f'test_fold{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter=' ',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "hooshvare-peyman+arman+wikiann":
            ner_file_path = dataset_dir + 'test.csv'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            data = pd.read_csv(ner_file_path, delimiter="\t")
            sentences, sentences_tags = data['tokens'].values.tolist(), data['ner_tags'].values.tolist()
            sentences = [ast.literal_eval(ss) for ss in sentences]
            sentences_tags = [ast.literal_eval(ss) for ss in sentences_tags]
            print(f'test part:\n #sentences: {len(sentences)}, #sentences_tags: {len(sentences_tags)}')
            return sentences, sentences_tags

    def load_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "farsiyar":
            dataset, labels = [], []
            for i in range(1, 6):
                ner_file_path = dataset_dir + 'Persian-NER-part{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='\t',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "wikiann":
            ner_file_path = dataset_dir + 'wikiann-fa.bio'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            dataset_all, labels_all = self.load_ner_data(ner_file_path, word_index=0, tag_index=-1, delimiter=' ',
                                                         join=kwargs.get('join', False))
            print(f'all data: #data: {len(dataset_all)}, #labels: {len(labels_all)}')

            try:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1,
                                                               stratify=labels_all)
                print("with stratify")
            except:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1)
                print("without stratify")
            print(f'test part:\n #data: {len(data_test)}, #labels: {len(label_test)}')
            return dataset_all, labels_all, data_test, label_test

    def ner_inference(self, input_text, device, max_length):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        pt_batch = self.tokenizer(
            [self.normalizer.normalize(sequence) for sequence in input_text],
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        pt_batch = pt_batch.to(device)
        pt_outputs = self.model(**pt_batch)
        pt_predictions = torch.argmax(pt_outputs.logits, dim=-1)
        pt_predictions = pt_predictions.cpu().detach().numpy().tolist()

        output_predictions = []
        for i, sequence in enumerate(input_text):
            tokens = self.tokenizer.tokenize(self.tokenizer.decode(self.tokenizer.encode(sequence)))
            predictions = [(token, self.id2label[prediction]) for token, prediction in
                           zip(tokens, pt_predictions[i])]
            output_predictions.append(predictions)
        return output_predictions

    def ner_evaluation(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_sentence, new_sentence_label = [], []
            for word, label in zip(sentence, sentence_label):
                # Tokenize the word and count # of subwords the word is broken into
                tokenized_word = self.tokenizer.tokenize(word)
                n_subwords = len(tokenized_word)

                # Add the tokenized word to the final tokenized word list
                tokenized_sentence.extend(tokenized_word)
                # Add the same label to the new list of labels `n_subwords` times
                new_sentence_label.extend([label] * n_subwords)

            max_len = max(max_len, len(tokenized_sentence))
            tokenized_texts.append(tokenized_sentence)
            new_labels.append(new_sentence_label)

        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences([self.tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                                  maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences([[self.config.label2id.get(l) for l in lab] for lab in new_labels],
                                     maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_loss, total_time = 0, 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')
            # get the loss
            total_loss += outputs.loss.item()

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        # Calculate the average loss over the training data.
        avg_train_loss = total_loss / len(data_loader)
        print("average loss:", avg_train_loss)
        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def ner_evaluation_2(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        print("len(input_text):", len(input_text))
        print("len(input_labels):", len(input_labels))
        c = 0
        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_words = self.tokenizer(sentence, padding=False, add_special_tokens=False).input_ids
            tokenized_sentence_ids, new_sentence_label = [], []
            for i, tokenized_word in enumerate(tokenized_words):
                # Add the tokenized word to the final tokenized word list
                tokenized_sentence_ids += tokenized_word
                # Add the same label to the new list of labels `number of subwords` times
                new_sentence_label.extend([self.config.label2id.get(sentence_label[i])] * len(tokenized_word))

            max_len = max(max_len, len(tokenized_sentence_ids))
            tokenized_texts.append(tokenized_sentence_ids)
            new_labels.append(new_sentence_label)
            c += 1
            if c % 10000 == 0:
                print("c:", c)
        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences(tokenized_texts, maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences(new_labels, maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_time = 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def check_input_label_consistency(self, labels):
        model_labels = self.config.label2id.keys()
        dataset_labels = set()
        for l in labels:
            dataset_labels.update(set(l))
        print("model labels:", model_labels)
        print("dataset labels:", dataset_labels)
        print("intersection:", set(model_labels).intersection(dataset_labels))
        print("model_labels-dataset_labels:", list(set(model_labels) - set(dataset_labels)))
        print("dataset_labels-model_labels:", list(set(dataset_labels) - set(model_labels)))
        if list(set(dataset_labels) - set(model_labels)):
            return False
        return True

    @staticmethod
    def resolve_input_label_consistency(labels, label_translation_map):
        for i, sentence_labels in enumerate(labels):
            for j, label in enumerate(sentence_labels):
                labels[i][j] = label_translation_map.get(label)
        return labels

    @staticmethod
    def evaluate_prediction_results(labels, output_predictions):
        dataset_labels = set()
        for label in labels:
            dataset_labels.update(set(label))

        true_labels, predictions = [], []
        for sample_output in output_predictions:
            sample_true_labels = []
            sample_predicted_labels = []
            for token, true_label, predicted_label in sample_output:
                sample_true_labels.append(true_label)
                if predicted_label in dataset_labels:
                    sample_predicted_labels.append(predicted_label)
                else:
                    sample_predicted_labels.append('O')
            true_labels.append(sample_true_labels)
            predictions.append(sample_predicted_labels)

        print("Test Accuracy: {}".format(accuracy_score(true_labels, predictions)))
        print("Test Precision: {}".format(precision_score(true_labels, predictions)))
        print("Test Recall: {}".format(recall_score(true_labels, predictions)))
        print("Test F1-Score: {}".format(f1_score(true_labels, predictions)))
        print("Test classification Report:\n{}".format(classification_report(true_labels, predictions, digits=10)))


In [6]:
model_name='HooshvareLab/bert-base-parsbert-ner-uncased'
ner_model = NER(model_name)

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/651M [00:00<?, ?B/s]

In [7]:
print(ner_model.config)

BertConfig {
  "architectures": [
    "BertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "B-date",
    "1": "B-event",
    "2": "B-facility",
    "3": "B-location",
    "4": "B-money",
    "5": "B-organization",
    "6": "B-percent",
    "7": "B-person",
    "8": "B-product",
    "9": "B-time",
    "10": "I-date",
    "11": "I-event",
    "12": "I-facility",
    "13": "I-location",
    "14": "I-money",
    "15": "I-organization",
    "16": "I-percent",
    "17": "I-person",
    "18": "I-product",
    "19": "I-time",
    "20": "O"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "B-date": 0,
    "B-event": 1,
    "B-facility": 2,
    "B-location": 3,
    "B-money": 4,
    "B-organization": 5,
    "B-percent": 6,
    "B-person": 7,
    "B-product": 8,
    "B-time": 9,
    "I-date": 10,
    "I-event"

#### Sample Inference:

In [8]:
texts = [
    "مدیرکل محیط زیست استان البرز با بیان اینکه با بیان اینکه موضوع شیرابه‌های زباله‌های انتقال یافته در منطقه حلقه دره خطری برای این استان است، گفت: در این مورد گزارشاتی در ۲۵ مرداد ۱۳۹۷ تقدیم مدیران استان شده است.",
    "به گزارش خبرگزاری تسنیم از کرج، حسین محمدی در نشست خبری مشترک با معاون خدمات شهری شهرداری کرج که با حضور مدیرعامل سازمان‌های پسماند، پارک‌ها و فضای سبز و نماینده منابع طبیعی در سالن کنفرانس شهرداری کرج برگزار شد، اظهار داشت: ۸۰٪  جمعیت استان البرز در کلانشهر کرج زندگی می‌کنند.",
    "وی افزود: با همکاری‌های مشترک بین اداره کل محیط زیست و شهرداری کرج برنامه‌های مشترکی برای حفاظت از محیط زیست در شهر کرج در دستور کار قرار گرفته که این اقدامات آثار مثبتی داشته و تاکنون نزدیک به ۱۰۰ میلیارد هزینه جهت خریداری اکس-ریس صورت گرفته است.",
]

In [9]:
inference_output = ner_model.ner_inference(texts, device, ner_model.config.max_position_embeddings)

In [10]:
print(inference_output)

[[('[CLS]', 'O'), ('مدیرکل', 'O'), ('محیط', 'B-organization'), ('زیست', 'I-organization'), ('استان', 'I-organization'), ('البرز', 'I-organization'), ('با', 'O'), ('بیان', 'O'), ('اینکه', 'O'), ('با', 'O'), ('بیان', 'O'), ('اینکه', 'O'), ('موضوع', 'O'), ('شیرابه', 'O'), ('##های', 'O'), ('زبالههای', 'O'), ('انتقال', 'O'), ('یافته', 'O'), ('در', 'O'), ('منطقه', 'B-location'), ('حلقه', 'I-location'), ('دره', 'I-location'), ('خطری', 'O'), ('برای', 'O'), ('این', 'O'), ('استان', 'O'), ('است', 'O'), ('،', 'O'), ('گفت', 'O'), (':', 'O'), ('در', 'O'), ('این', 'O'), ('مورد', 'O'), ('گزارشاتی', 'O'), ('در', 'O'), ('۲۵', 'B-date'), ('مرداد', 'I-date'), ('۱۳۹۷', 'I-date'), ('تقدیم', 'O'), ('مدیران', 'O'), ('استان', 'O'), ('شده', 'O'), ('است', 'O'), ('.', 'O'), ('[SEP]', 'O')], [('[CLS]', 'O'), ('به', 'O'), ('گزارش', 'O'), ('خبرگزاری', 'B-organization'), ('تسنیم', 'I-organization'), ('از', 'O'), ('کرج', 'B-location'), ('،', 'O'), ('حسین', 'B-person'), ('محمدی', 'I-person'), ('در', 'O'), ('نشست', 'O')

In [11]:
#@title Live Playground { display-mode: "form" }

css_is_load = False
css = """<style>
.ner-box {
    direction: rtl;
    font-size: 18px !important;
    line-height: 20px !important;
    margin: 0 0 15px;
    padding: 10px;
    text-align: justify;
    color: #343434 !important;
}
.token, .token span {
    display: inline-block !important;
    padding: 2px;
    margin: 2px 0;
}
.token.token-ner {
    background-color: #f6cd61;
    font-weight: bold;
    color: #000;
}
.token.token-ner .ner-label {
    color: #9a1f40;
    margin: 0px 2px;
}
</style>"""

if not css_is_load:
    display(HTML(css))
    css_is_load = True

submit_wd = widgets.Button(description='Send', disabled=False, button_style='success', tooltip='Submit')
text_wd = widgets.Textarea(placeholder='Please enter you text ...', rows=5, layout=Layout(width='90%'))
output_wd = widgets.Output()

display(HTML("""
<h2>Test NER model</h2>
<p style="padding: 2px 20px; margin: 0 0 20px;">
</p>
<br /><br />
"""))

display(text_wd)
display(submit_wd)
display(output_wd)

def submit_text(sender):
    with output_wd:
        clear_output(wait=True)
        text = text_wd.value
        _output = ner_model.ner_inference([text], device, ner_model.config.max_position_embeddings)
        # print(_output)
        pred_sequence = []
        for token, label in _output[0]:
            if token not in ['[CLS]', '[SEP]']:
                if label != 'O':
                    pred_sequence.append(
                        '<span class="token token-ner">%s<span class="ner-label">%s</span></span>' 
                        % (token, label))
                else:
                    pred_sequence.append(
                        '<span class="token">%s</span>' 
                        % token)
            
        html = '<p class="ner-box">%s</p>' % ' '.join(pred_sequence) 
        display(HTML(html))

submit_wd.on_click(submit_text)

Textarea(value='', layout=Layout(width='90%'), placeholder='Please enter you text ...', rows=5)

Button(button_style='success', description='Send', style=ButtonStyle(), tooltip='Submit')

Output()

#### PEYMA dataset:
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes: 

- Organization
- Money
- Location
- Date
- Time
- Person
- Percent

|     Label    |   #   |
|:------------:|:-----:|
| Organization | 16964 |
|     Money    |  2037 |
|   Location   |  8782 |
|     Date     |  4259 |
|     Time     |  732  |
|    Person    |  7675 |
|    Percent   |  699  |

Download
You can download the dataset from [here](https://hooshvare.github.io/docs/datasets/ner) with leads to following google drive file of HooshvareLab:

In [12]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

adc.json  peyma.zip  sample_data


In [13]:
!unzip peyma.zip
!ls
!ls peyma

Archive:  peyma.zip
   creating: peyma/
  inflating: peyma/dev.txt           
  inflating: peyma/test.txt          
  inflating: peyma/train.txt         
adc.json  peyma  peyma.zip  sample_data
dev.txt  test.txt  train.txt


In [14]:
sentences, labels = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [15]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I_MON', 'B_PER', 'B_DAT', 'B_TIM', 'O', 'B_ORG', 'B_PCT', 'B_LOC', 'I_LOC', 'I_DAT', 'I_TIM', 'I_ORG', 'I_PCT', 'B_MON', 'I_PER'}
intersection: {'O'}
model_labels-dataset_labels: ['B-event', 'I-organization', 'I-money', 'B-product', 'I-facility', 'I-product', 'B-facility', 'I-percent', 'I-person', 'B-date', 'I-event', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-location', 'B-percent', 'B-location', 'B-person', 'B-money']
dataset_labels-model_labels: ['I_DAT', 'I_LOC', 'I_MON', 'B_LOC', 'B_PER', 'I_TIM', 'I_ORG', 'I_PCT', 'B_DAT', 'B_MON', 'B_TIM', 'I_PER', 'B_ORG', 'B_PCT']
False


In [16]:
label_translate = {
    'B_PER': 'B-person', 
    'I_PER': 'I-person',
    'B_LOC': 'B-location',
    'I_LOC': 'I-location',
    'B_ORG': 'B-organization',
    'I_ORG': 'I-organization', 
    'B_MON': 'B-money',
    'I_MON': 'I-money', 
    'B_DAT': 'B-date', 
    'I_DAT': 'I-date',
    'B_TIM': 'B-time',
    'I_TIM': 'I-time', 
    'B_PCT': 'B-percent',
    'I_PCT': 'I-percent',
    'O': 'O'
}
labels = ner_model.resolve_input_label_consistency(labels, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I-date', 'B-organization', 'I-time', 'B-time', 'I-organization', 'I-money', 'I-location', 'B-percent', 'B-location', 'O', 'I-percent', 'I-person', 'B-date', 'B-person', 'B-money'}
intersection: {'I-person', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-organization', 'I-money', 'I-location', 'B-date', 'B-person', 'B-percent', 'B-location', 'B-money', 'O', 'I-percent'}
model_labels-dataset_labels: ['B-event', 'B-product', 'I-facility', 'I-product', 'I-event', 'B-facility']
dataset_labels-model_labels: []
True


In [17]:
!nvidia-smi
!lscpu

Mon Aug 16 07:53:02 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    53W / 149W |   1139MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [18]:
inference_output_peyma = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 151
#samples: 1026
#batch: 3
Start to evaluate test data ...
inference time for step 0: 0.03492030799998247
inference time for step 1: 0.012574978000031933
inference time for step 2: 0.013928769000017382
average loss: 0.06544495125611623
total inference time: 0.061424055000031785
total inference time / #samples: 5.986750000003098e-05


In [19]:
for sample_output in inference_output_peyma[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

کنایه	O	O
سرلشگر	O	O
فیروزابادی	B-organization	B-person
به	O	O
پادشاه	O	O
عربستان	B-location	B-location
و	O	O
پسرش	O	O

ريیس	O	O
سابق	O	O
ستاد	B-organization	B-organization
کل	I-organization	I-organization
نیروهای	I-organization	I-organization
مسلح	I-organization	I-organization
با	O	O
بیان	O	O
اینکه	O	O
ال	O	O
سعود	O	O
با	O	O
حمایت	O	O
همه	O	O
جانبه	O	O
غرب	O	O
بر	O	O
سرزمین	B-location	B-location
حجاز	I-location	I-location
حاکم	O	O
شد	O	O
گفت	O	O
:	O	O
غرب	O	O
با	O	O
حاکم	O	O
کردد	O	O
ال	O	O
سعود	O	O
بر	O	O
حجاز	B-location	B-location
هدفی	O	O
جز	O	O
##ناب	O	O
##ودی	O	O
اسلام	O	O
نداشته	O	O
و	O	O
این	O	O
نقشه	O	O
انگلیس	B-location	B-location
بود	O	O
.	O	O

سرلشگر	O	O
حسن	B-person	B-person
فیروزابادی	I-person	I-person
روز	O	B-date
دوشنبه	O	O
درحاشیه	O	O
ايین	O	O
ختم	O	O
مادر	O	O
حیدر	B-person	B-person
مصلحی	I-person	I-person
درجمع	O	O
خبرنگاران	O	O
درباره	O	O
موضوع	O	O
یمن	B-location	B-location
افزود	O	O
:	O	O
ماهیت	O	O
انچه	O	O
در	O	O
یمن	B-location	B-location
اتفاق	O	O
می	O	O
افتد	O	O


In [20]:
ner_model.evaluate_prediction_results(labels, inference_output_peyma)

Test Accuracy: 0.9661031617816755
Test Precision: 0.8095768374164811
Test Recall: 0.6855256954266855
Test F1-Score: 0.7424049017104928
Test classification Report:
              precision    recall  f1-score   support

        date  0.7032967033 0.2909090909 0.4115755627       220
    location  0.9111498258 0.8616144975 0.8856900931       607
       money  0.3913043478 0.3461538462 0.3673469388        26
organization  0.8136826783 0.7862165963 0.7997138770       711
     percent  0.8000000000 0.3200000000 0.4571428571        50
      person  0.7105943152 0.5670103093 0.6307339450       485
        time  0.5714285714 0.3636363636 0.4444444444        22

   micro avg  0.8095768374 0.6855256954 0.7424049017      2121
   macro avg  0.7002080631 0.5050772434 0.5709496740      2121
weighted avg  0.7985408712 0.6855256954 0.7283587842      2121



In [21]:
output_file_name = "ner_peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_peyma:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman dataset:
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.

1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person


|     Label    |   #   |
|:------------:|:-----:|
| Organization | 30108 |
|   Location   | 12924 |
|   Facility   |  4458 |
|     Event    |  7557 |
|    Product   |  4389 |
|    Person    | 15645 |

**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)


In [22]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

--2021-08-16 07:53:35--  https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip [following]
--2021-08-16 07:53:35--  https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1931170 (1.8M) [application/zip]
Saving to: ‘ArmanPersoNERCorpus.zip’


2021-08-16 07:53:35 (66.6 MB/s) - ‘ArmanPersoNERCorpus.zip’ saved [1931170/1931170]

adc.json							   peyma
ArmanPersoNERCorpus.zip						   peyma.zip
ner_peyma_Hoos

In [23]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

Archive:  ArmanPersoNERCorpus.zip
  inflating: arman/test_fold1.txt    
  inflating: arman/ReadMe.txt        
  inflating: arman/train_fold3.txt   
  inflating: arman/train_fold2.txt   
  inflating: arman/train_fold1.txt   
  inflating: arman/test_fold3.txt    
  inflating: arman/test_fold2.txt    
adc.json							   peyma
arman								   peyma.zip
ArmanPersoNERCorpus.zip						   sample_data
ner_peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt


In [24]:
sentences, labels = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [25]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I-loc', 'B-event', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'O', 'I-pro', 'B-org', 'B-pers', 'B-loc', 'I-pers', 'I-event'}
intersection: {'O', 'B-event', 'I-event'}
model_labels-dataset_labels: ['I-person', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-organization', 'I-money', 'B-date', 'B-product', 'I-location', 'B-person', 'B-percent', 'I-facility', 'I-product', 'B-location', 'B-money', 'B-facility', 'I-percent']
dataset_labels-model_labels: ['I-loc', 'I-pro', 'B-org', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'B-pers', 'B-loc', 'I-pers']
False


In [26]:
label_translate = {
    'B-org': 'B-organization', 
    'I-org': 'I-organization',
    'B-loc': 'B-location',
    'I-loc': 'I-location',
    'B-pers': 'B-person', 
    'I-pers': 'I-person',
    'B-event': 'B-event', 
    'I-event': 'I-event',
    'B-pro': 'B-product', 
    'I-pro': 'I-product', 
    'B-fac': 'B-facility', 
    'I-fac': 'I-facility',
    'O': 'O'
}
labels = ner_model.resolve_input_label_consistency(labels, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'B-organization', 'B-event', 'I-organization', 'I-location', 'B-product', 'B-location', 'I-facility', 'I-product', 'O', 'B-facility', 'I-person', 'B-person', 'I-event'}
intersection: {'I-person', 'B-organization', 'B-event', 'I-organization', 'I-location', 'B-product', 'B-person', 'B-location', 'I-facility', 'I-product', 'I-event', 'O', 'B-facility'}
model_labels-dataset_labels: ['I-date', 'I-time', 'B-time', 'I-money', 'B-date', 'B-percent', 'B-money', 'I-percent']
dataset_labels-model_labels: []
True


batch size=256 -> inference time for one batch is about 205 s

batch size=512 -> inference time for one batch is about 410 s

batch size=1024 -> crach

In [27]:
!nvidia-smi
!lscpu

Mon Aug 16 07:53:36 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    56W / 149W |   5371MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [28]:
inference_output_arman = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 253
#samples: 7681
#batch: 16
Start to evaluate test data ...
inference time for step 0: 0.02770143099996858
inference time for step 1: 0.012351124000019809
inference time for step 2: 0.012560876000009102
inference time for step 3: 0.013245058999984849
inference time for step 4: 0.012573863000000074
inference time for step 5: 0.01375251800004662
inference time for step 6: 0.016405250000048
inference time for step 7: 0.013139903000023878
inference time for step 8: 0.013329993000070317
inference time for step 9: 0.013186048999955347
inference time for step 10: 0.012356480999983432
inference time for step 11: 0.015441280000004554
inference time for step 12: 0.013178325000012592
inference time for step 13: 0.013221824999959608
inference time for step 14: 0.013505151999993359
inference time for step 15: 0.013313620999952036
average loss: 0.19347388856112957
total inference time: 0.22926275000003216
total inference time / #samples: 2.9848034110146093e-05


In [29]:
for sample_output in inference_output_arman[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افقی	O	O
:	O	O
[UNK]	O	O
[UNK]	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخستوزیر	O	O
ایران	B-location	B-location
در	O	O
سالهای	O	O
ابتدايی	O	O
دهه	O	O
چهل	O	I-date
خورشیدی	O	I-date
[UNK]	O	O
جلد	O	O
سوم	O	O
یادداشتهای	O	O
##ش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B-location	B-location
منتشر	O	O
شد	O	O
[UNK]	O	O
[UNK]	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
##احوال	O	O
[UNK]	O	O
پوشاک	O	O
و	O	O
جامه	O	O
[UNK]	O	O
فانتزی	O	O
و	O	O
شیک	O	O
[UNK]	O	O
[UNK]	O	O
در	O	O
حال	O	O
وزیدن	O	O
[UNK]	O	O
اطلاعیه	O	O
[UNK]	O	O
پایتخت	O	O
جمهوری	O	O
استونی	B-location	B-location
در	I-location	O
حوضه	I-location	I-location
بالتیک	I-location	I-location
[UNK]	O	O
[UNK]	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
[UNK]	O	O
نوعی	O	O
شمع	O	O
[UNK]	O	O
[UNK]	O	O
حرف	O	O
جمع	O	O
مونث	O	O
[UNK]	O	O
در	O	O
ایران	B-location	B-location
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
میشود	O	O
[UNK]	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B-location	B-location
[UNK]	O	O
تا	O	O
عصر	O	O
ناصرالدینشاه	B-person	B-per

In [30]:
ner_model.evaluate_prediction_results(labels, inference_output_arman)

Test Accuracy: 0.9753489555803527
Test Precision: 0.7600182149362478
Test Recall: 0.7246020260492041
Test F1-Score: 0.7418876870647504
Test classification Report:
              precision    recall  f1-score   support

       event  0.6283367556 0.5230769231 0.5708955224       585
    facility  0.7905982906 0.6525573192 0.7149758454       567
    location  0.7245627002 0.8322014714 0.7746608719      3534
organization  0.8556495769 0.7318007663 0.7888939881      4698
      person  0.7155705453 0.6998350742 0.7076153419      3638
     product  0.7047781570 0.5175438596 0.5968208092       798

   micro avg  0.7600182149 0.7246020260 0.7418876871     13820
   macro avg  0.7365826709 0.6595025690 0.6923103965     13820
weighted avg  0.7642511679 0.7246020260 0.7405071115     13820



In [31]:
output_file_name = "ner_arman_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_arman:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman+Peyma

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

In [None]:
!unzip peyma.zip
!ls
!ls peyma

In [32]:
sentences_peyma, labels_peyma = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences_peyma), len(labels_peyma))
print(sentences_peyma[0])
print(labels_peyma[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [33]:
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I_MON', 'B_PER', 'B_DAT', 'B_TIM', 'O', 'B_ORG', 'B_PCT', 'B_LOC', 'I_LOC', 'I_DAT', 'I_TIM', 'I_ORG', 'I_PCT', 'B_MON', 'I_PER'}
intersection: {'O'}
model_labels-dataset_labels: ['B-event', 'I-organization', 'I-money', 'B-product', 'I-facility', 'I-product', 'B-facility', 'I-percent', 'I-person', 'B-date', 'I-event', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-location', 'B-percent', 'B-location', 'B-person', 'B-money']
dataset_labels-model_labels: ['I_DAT', 'I_LOC', 'I_MON', 'B_LOC', 'B_PER', 'I_TIM', 'I_ORG', 'I_PCT', 'B_DAT', 'B_MON', 'B_TIM', 'I_PER', 'B_ORG', 'B_PCT']
False


In [34]:
label_translate = {
    'B_PER': 'B-person', 
    'I_PER': 'I-person',
    'B_LOC': 'B-location',
    'I_LOC': 'I-location',
    'B_ORG': 'B-organization',
    'I_ORG': 'I-organization', 
    'B_MON': 'B-money',
    'I_MON': 'I-money', 
    'B_DAT': 'B-date', 
    'I_DAT': 'I-date',
    'B_TIM': 'B-time',
    'I_TIM': 'I-time', 
    'B_PCT': 'B-percent',
    'I_PCT': 'I-percent',
    'O': 'O'
}
labels_peyma = ner_model.resolve_input_label_consistency(labels_peyma, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I-date', 'B-organization', 'I-time', 'B-time', 'I-organization', 'I-money', 'I-location', 'B-percent', 'B-location', 'O', 'I-percent', 'I-person', 'B-date', 'B-person', 'B-money'}
intersection: {'I-person', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-organization', 'I-money', 'I-location', 'B-date', 'B-person', 'B-percent', 'B-location', 'B-money', 'O', 'I-percent'}
model_labels-dataset_labels: ['B-event', 'B-product', 'I-facility', 'I-product', 'I-event', 'B-facility']
dataset_labels-model_labels: []
True


In [None]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

In [None]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

In [35]:
sentences_arman, labels_arman = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences_arman), len(labels_arman))
print(sentences_arman[0])
print(labels_arman[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [36]:
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I-loc', 'B-event', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'O', 'I-pro', 'B-org', 'B-pers', 'B-loc', 'I-pers', 'I-event'}
intersection: {'O', 'B-event', 'I-event'}
model_labels-dataset_labels: ['I-person', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-organization', 'I-money', 'B-date', 'B-product', 'I-location', 'B-person', 'B-percent', 'I-facility', 'I-product', 'B-location', 'B-money', 'B-facility', 'I-percent']
dataset_labels-model_labels: ['I-loc', 'I-pro', 'B-org', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'B-pers', 'B-loc', 'I-pers']
False


In [37]:
label_translate = {
    'B-org': 'B-organization', 
    'I-org': 'I-organization',
    'B-loc': 'B-location',
    'I-loc': 'I-location',
    'B-pers': 'B-person', 
    'I-pers': 'I-person',
    'B-event': 'B-event', 
    'I-event': 'I-event',
    'B-pro': 'B-product', 
    'I-pro': 'I-product', 
    'B-fac': 'B-facility', 
    'I-fac': 'I-facility',
    'O': 'O'
}
labels_arman = ner_model.resolve_input_label_consistency(labels_arman, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'B-organization', 'B-event', 'I-organization', 'I-location', 'B-product', 'B-location', 'I-facility', 'I-product', 'O', 'B-facility', 'I-person', 'B-person', 'I-event'}
intersection: {'I-person', 'B-organization', 'B-event', 'I-organization', 'I-location', 'B-product', 'B-person', 'B-location', 'I-facility', 'I-product', 'I-event', 'O', 'B-facility'}
model_labels-dataset_labels: ['I-date', 'I-time', 'B-time', 'I-money', 'B-date', 'B-percent', 'B-money', 'I-percent']
dataset_labels-model_labels: []
True


In [38]:
sentences = sentences_arman + sentences_peyma
labels = labels_arman + labels_peyma
print(len(sentences), len(labels))

8707 8707


In [39]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I-date', 'B-organization', 'B-event', 'I-time', 'B-time', 'I-organization', 'I-money', 'I-location', 'B-product', 'B-percent', 'B-location', 'I-facility', 'I-product', 'O', 'B-facility', 'I-percent', 'I-person', 'B-date', 'B-person', 'I-event', 'B-money'}
intersection: {'B-event', 'I-organization', 'I-money', 'B-product', 'I-facility', 'I-product', 'O', 'B-facility', 'I-percent', 'I-person', 'B-date', 'I-event', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-location', 'B-percent', 'B-location', 'B-person', 'B-money'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [40]:
!nvidia-smi
!lscpu

Mon Aug 16 08:02:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P0    57W / 149W |   9441MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [41]:
inference_output = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 253
#samples: 8707
#batch: 18
Start to evaluate test data ...
inference time for step 0: 0.02850413800001661
inference time for step 1: 0.013322699000013927
inference time for step 2: 0.013239560999977584
inference time for step 3: 0.012876112999947509
inference time for step 4: 0.012286604000109946
inference time for step 5: 0.016279452000162564
inference time for step 6: 0.01294892400005665
inference time for step 7: 0.013680276000059166
inference time for step 8: 0.01359000999991622
inference time for step 9: 0.013171607999993284
inference time for step 10: 0.013327593999974852
inference time for step 11: 0.01317566100010481
inference time for step 12: 0.012725634999924296
inference time for step 13: 0.012499025000124675
inference time for step 14: 0.013813787000117372
inference time for step 15: 0.01261525699987942
inference time for step 16: 0.013569928000151776
inference time for step 17: 0.014423615000168866
average loss: 0.18995342983139885
total inference time: 0.2560

In [42]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افقی	O	O
:	O	O
[UNK]	O	O
[UNK]	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخستوزیر	O	O
ایران	B-location	B-location
در	O	O
سالهای	O	O
ابتدايی	O	O
دهه	O	O
چهل	O	I-date
خورشیدی	O	I-date
[UNK]	O	O
جلد	O	O
سوم	O	O
یادداشتهای	O	O
##ش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B-location	B-location
منتشر	O	O
شد	O	O
[UNK]	O	O
[UNK]	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
##احوال	O	O
[UNK]	O	O
پوشاک	O	O
و	O	O
جامه	O	O
[UNK]	O	O
فانتزی	O	O
و	O	O
شیک	O	O
[UNK]	O	O
[UNK]	O	O
در	O	O
حال	O	O
وزیدن	O	O
[UNK]	O	O
اطلاعیه	O	O
[UNK]	O	O
پایتخت	O	O
جمهوری	O	O
استونی	B-location	B-location
در	I-location	O
حوضه	I-location	I-location
بالتیک	I-location	I-location
[UNK]	O	O
[UNK]	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
[UNK]	O	O
نوعی	O	O
شمع	O	O
[UNK]	O	O
[UNK]	O	O
حرف	O	O
جمع	O	O
مونث	O	O
[UNK]	O	O
در	O	O
ایران	B-location	B-location
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
میشود	O	O
[UNK]	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B-location	B-location
[UNK]	O	O
تا	O	O
عصر	O	O
ناصرالدینشاه	B-person	B-per

In [43]:
ner_model.evaluate_prediction_results(labels, inference_output)

Test Accuracy: 0.9719239215497002
Test Precision: 0.7515315196627363
Test Recall: 0.7157016498337619
Test F1-Score: 0.7331791016001541
Test classification Report:
              precision    recall  f1-score   support

        date  0.4077669903 0.1909090909 0.2600619195       220
       event  0.6194331984 0.5230769231 0.5671918443       585
    facility  0.7773109244 0.6525573192 0.7094918504       567
    location  0.7435071904 0.8365129196 0.7872727273      4141
       money  0.0405405405 0.2307692308 0.0689655172        26
organization  0.8494439692 0.7343316694 0.7877045117      5409
     percent  0.1800000000 0.1800000000 0.1800000000        50
      person  0.7131645570 0.6832403590 0.6978818283      4123
     product  0.6941176471 0.5175438596 0.5929648241       798
        time  0.3333333333 0.4545454545 0.3846153846        22

   micro avg  0.7515315197 0.7157016498 0.7331791016     15941
   macro avg  0.5358618351 0.5003486826 0.5036150407     15941
weighted avg  0.757668207

In [44]:
output_file_name = "ner_arman-and-peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### WikiAnn

https://elisa-ie.github.io/wikiann/

In [45]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1QOG15HU8VfZvJUNKos024xI-OGm0zhEX'})
download.GetContentFile('fa.tar.gz')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
peyma
peyma.zip
sample_data


In [46]:
!tar -zxvf fa.tar.gz
!ls

README.txt
wikiann-fa.bio
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [47]:
sentences_all, labels_all, sentences_test, labels_test = ner_model.load_datasets(dataset_name="wikiann", dataset_dir="./")
print(len(sentences_all), len(sentences_all))
print(len(sentences_test), len(labels_test))
print(sentences_test[0])
print(labels_test[0])

all data: #data: 272266, #labels: 272266


  return array(a, dtype, copy=False, order=order)


without stratify
test part:
 #data: 27227, #labels: 27227
272266 272266
27227 27227
['**', 'زاغی', 'نوک\u200cزرد', ',', "''Pica", 'nuttalli', "''"]
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O']


In [48]:
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'B-LOC', 'I-ORG', 'B-PER', 'O', 'B-ORG', 'I-PER', 'I-LOC'}
intersection: {'O'}
model_labels-dataset_labels: ['B-event', 'I-organization', 'I-money', 'B-product', 'I-facility', 'I-product', 'B-facility', 'I-percent', 'I-person', 'B-date', 'I-event', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-location', 'B-percent', 'B-location', 'B-person', 'B-money']
dataset_labels-model_labels: ['B-LOC', 'I-ORG', 'B-PER', 'B-ORG', 'I-PER', 'I-LOC']
False


In [49]:
label_translate = {
    'B-LOC': 'B-location',
    'I-LOC': 'I-location',
    'B-PER': 'B-person',
    'I-PER': 'I-person',
    'B-ORG': 'B-organization',
    'I-ORG': 'I-organization',
    'O': 'O'
}
labels_test = ner_model.resolve_input_label_consistency(labels_test, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I-person', 'B-organization', 'I-organization', 'B-person', 'I-location', 'B-location', 'O'}
intersection: {'I-person', 'B-organization', 'I-organization', 'I-location', 'B-person', 'B-location', 'O'}
model_labels-dataset_labels: ['I-date', 'B-event', 'I-time', 'B-time', 'I-money', 'B-date', 'B-product', 'B-percent', 'I-facility', 'I-product', 'I-event', 'B-money', 'B-facility', 'I-percent']
dataset_labels-model_labels: []
True


In [50]:
!nvidia-smi
!lscpu

Mon Aug 16 08:13:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P0    58W / 149W |   9439MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [51]:
inference_output_wikiann = ner_model.ner_evaluation(sentences_test, labels_test, device, batch_size=512)

max_len: 95
#samples: 27227
#batch: 54
Start to evaluate test data ...
inference time for step 0: 0.023989707000055205
inference time for step 1: 0.012486017000128413
inference time for step 2: 0.012027557000010347
inference time for step 3: 0.012387160000116637
inference time for step 4: 0.01225049399999989
inference time for step 5: 0.012856169000087903
inference time for step 6: 0.012444744000049468
inference time for step 7: 0.013113630999896486
inference time for step 8: 0.013032027000008384
inference time for step 9: 0.013570710000067265
inference time for step 10: 0.012283849000141345
inference time for step 11: 0.012784522999936598
inference time for step 12: 0.012605644000132088
inference time for step 13: 0.012410196000018914
inference time for step 14: 0.013294407000103092
inference time for step 15: 0.012627677999944353
inference time for step 16: 0.012863186000004134
inference time for step 17: 0.012220810999906462
inference time for step 18: 0.012622571999827414
inference

In [52]:
for sample_output in inference_output_wikiann[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

*	O	O
*	O	O
زاغی	B-location	O
نوک	I-location	O
##زرد	I-location	O
,	O	O
'	O	O
'	O	O
pic	O	O
##a	O	O
nut	O	O
##ta	O	O
##ll	O	O
##i	O	O
'	O	O
'	O	O

تغییر	O	O
##مسیر	O	O
مک	B-location	B-person
##ویل	B-location	I-person
،	B-location	O
داکوتای	I-location	B-location
شمالی	I-location	I-location

وست	B-location	O
یونیورسیتی	I-location	O
پلیس	I-location	O
،	I-location	O
تگزاس	I-location	B-location

تغییر	O	O
##مسیر	O	O
دلت	B-person	O
##ف	B-person	O
فون	I-person	O
لیل	I-person	O
##نس	I-person	O
##رون	I-person	O

تغییر	O	O
##مسیر	O	O
نیروگاههای	B-organization	O
زنجیرهای	I-organization	O
یاسوج	I-organization	B-location



In [53]:
ner_model.evaluate_prediction_results(labels_test, inference_output_wikiann)

Test Accuracy: 0.5095925221770143
Test Precision: 0.18525429930479326
Test Recall: 0.13820494622481846
Test F1-Score: 0.15830779813645177
Test classification Report:
              precision    recall  f1-score   support

    location  0.1208080577 0.1090505217 0.1146285838     19358
organization  0.5073932092 0.1632311487 0.2470007998     11352
      person  0.1771437782 0.1855165429 0.1812335092      5924

   micro avg  0.1852542993 0.1382049462 0.1583077981     36634
   macro avg  0.2684483484 0.1525994044 0.1809542976     36634
weighted avg  0.2497114657 0.1382049462 0.1664180956     36634



In [54]:
output_file_name = "ner_wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_wikiann:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Hooshvare - Arman+Peyma+WikiAnn

https://github.com/hooshvare/parsner

In [55]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1fC2WGlpqumUTaT9Dr_U1jO2no3YMKFJ4'})
download.GetContentFile('ner-v1.zip')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner-v1.zip
ner_wikiann_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [56]:
!unzip ner-v1.zip
!ls
!ls ner

Archive:  ner-v1.zip
   creating: ner/
  inflating: ner/valid.csv           
  inflating: ner/ner.csv             
  inflating: ner/test.csv            
  inflating: ner/train.csv           
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner
ner_arman-and-peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner_arman_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner_peyma_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
ner-v1.zip
ner_wikiann_HooshvareLab-bert-base-parsbert-ner-uncased_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio
ner.csv  test.csv  train.csv  valid.csv


In [57]:
sentences_paw, labels_paw = ner_model.load_test_datasets(dataset_name="hooshvare-peyman+arman+wikiann", dataset_dir="./ner/")
print(len(sentences_paw), len(labels_paw))
print(sentences_paw[0])
print(labels_paw[0])

test part:
 #sentences: 6049, #sentences_tags: 6049
6049 6049
['همچنین', 'عملیات', 'لرزه\u200cنگاری', 'دوبعدی', 'نیز', 'با', 'فعالیت', 'مستمر', 'چهار', 'گروه', 'کاری', 'در', 'مناطقی', 'که', 'از', 'نظر', 'اکتشافی', 'مورد', 'نظر', 'بود', '،', 'به', 'پایان', 'رسید', 'که', 'نتایج', 'آن', 'در', 'حال', 'بررسی', 'است', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [58]:
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'B-LOC', 'B-DAT', 'I-EVE', 'B-TIM', 'B-EVE', 'I-PER', 'O', 'I-ORG', 'I-PCT', 'I-DAT', 'I-PRO', 'B-PER', 'B-FAC', 'B-MON', 'I-FAC', 'I-MON', 'B-PCT', 'B-PRO', 'B-ORG', 'I-LOC', 'I-TIM'}
intersection: {'O'}
model_labels-dataset_labels: ['B-event', 'I-organization', 'I-money', 'B-product', 'I-facility', 'I-product', 'B-facility', 'I-percent', 'I-person', 'B-date', 'I-event', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-location', 'B-percent', 'B-location', 'B-person', 'B-money']
dataset_labels-model_labels: ['B-LOC', 'I-EVE', 'B-TIM', 'B-EVE', 'I-PRO', 'B-MON', 'I-MON', 'B-PCT', 'I-DAT', 'B-DAT', 'I-PER', 'I-ORG', 'I-PCT', 'B-PER', 'B-FAC', 'I-FAC', 'B-PRO', 'B-ORG', 'I-LOC', 'I-TIM']
Fal

In [59]:
label_translate = {
    'B-MON': 'B-money', 
    'I-MON': 'I-money', 
    'B-ORG': 'B-organization', 
    'I-ORG': 'I-organization', 
    'B-LOC': 'B-location', 
    'I-LOC': 'I-location', 
    'B-PER': 'B-person', 
    'I-PER': 'I-person', 
    'B-DAT': 'B-date', 
    'I-DAT': 'I-date', 
    'B-TIM': 'B-time', 
    'I-TIM': 'I-time', 
    'B-FAC': 'B-facility', 
    'I-FAC': 'I-facility', 
    'B-PCT': 'B-percent', 
    'I-PCT': 'I-percent', 
    'B-EVE': 'B-event', 
    'I-EVE': 'I-event', 
    'B-PRO': 'B-product', 
    'I-PRO': 'I-product', 
    'O': 'O'
}
labels_paw = ner_model.resolve_input_label_consistency(labels_paw, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B-date', 'B-event', 'B-facility', 'B-location', 'B-money', 'B-organization', 'B-percent', 'B-person', 'B-product', 'B-time', 'I-date', 'I-event', 'I-facility', 'I-location', 'I-money', 'I-organization', 'I-percent', 'I-person', 'I-product', 'I-time', 'O'])
dataset labels: {'I-date', 'B-organization', 'B-event', 'I-time', 'B-time', 'I-organization', 'I-money', 'I-location', 'B-product', 'B-percent', 'B-location', 'I-facility', 'I-product', 'O', 'B-facility', 'I-percent', 'I-person', 'B-date', 'B-person', 'I-event', 'B-money'}
intersection: {'B-event', 'I-organization', 'I-money', 'B-product', 'I-facility', 'I-product', 'O', 'B-facility', 'I-percent', 'I-person', 'B-date', 'I-event', 'I-date', 'B-organization', 'I-time', 'B-time', 'I-location', 'B-percent', 'B-location', 'B-person', 'B-money'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [60]:
!nvidia-smi
!lscpu

Mon Aug 16 08:25:06 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P0    57W / 149W |   4111MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [61]:
inference_output = ner_model.ner_evaluation_2(sentences_paw, labels_paw, device, batch_size=256)

len(input_text): 6049
len(input_labels): 6049
max_len: 448
#samples: 6049
#batch: 24
Start to evaluate test data ...
inference time for step 0: 0.029246019000311207
inference time for step 1: 0.015161274000092817
inference time for step 2: 0.012310294000144495
inference time for step 3: 0.013281282999741961
inference time for step 4: 0.013457263999953284
inference time for step 5: 0.013609144999918499
inference time for step 6: 0.01371802999983629
inference time for step 7: 0.013282042999890109
inference time for step 8: 0.013736756000071182
inference time for step 9: 0.01343041300015102
inference time for step 10: 0.013843698000073346
inference time for step 11: 0.01328383300005953
inference time for step 12: 0.015214668999760761
inference time for step 13: 0.013380304999827786
inference time for step 14: 0.013858869000159757
inference time for step 15: 0.012648330000047281
inference time for step 16: 0.013361322000037035
inference time for step 17: 0.012890236000203004
inference time

In [62]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

همچنین	O	O
عملیات	O	O
لرزهنگاری	O	O
دوبعدی	O	O
نیز	O	O
با	O	O
فعالیت	O	O
مستمر	O	O
چهار	O	O
گروه	O	O
کاری	O	O
در	O	O
مناطقی	O	O
که	O	O
از	O	O
نظر	O	O
اکتشافی	O	O
مورد	O	O
نظر	O	O
بود	O	O
،	O	O
به	O	O
پایان	O	O
رسید	O	O
که	O	O
نتایج	O	O
ان	O	O
در	O	O
حال	O	O
بررسی	O	O
است	O	O
.	O	O

محدث	B-person	O
در	O	O
مورد	O	O
مشارکت	O	O
شرکتهای	O	O
خارجی	O	O
در	O	O
فعالیتهای	O	O
اکتشافی	O	O
کشور	O	O
گفت	O	O
:	O	O
تاکنون	O	O
چند	O	O
منطقه	O	O
اکتشافی	O	O
را	O	O
برای	O	O
مشارکت	O	O
و	O	O
سرمایهگذاری	O	O
شرکتهای	O	O
خارجی	O	O
اعلام	O	O
کردهایم	O	O
و	O	O
در	O	O
حال	O	O
مذاکره	O	O
با	O	O
طرفهای	O	O
خارجی	O	O
هستیم	O	O
و	O	O
انتظار	O	O
میرود	O	O
تا	O	O
اخر	O	O
امسال	O	O
بتوانیم	O	O
چند	O	O
قرارداد	O	O
را	O	O
نهایی	O	O
کنیم	O	O
.	O	O

مدیر	O	O
امور	B-organization	B-organization
اکتشاف	I-organization	I-organization
شرکت	I-organization	I-organization
ملی	I-organization	I-organization
نفت	I-organization	I-organization
فرو	O	O
##افتادگی	O	O
دزفول	B-location	B-location
و	O	O
منطقه	B-location	B-location
گسل	I-l

In [63]:
ner_model.evaluate_prediction_results(labels_paw, inference_output)

Test Accuracy: 0.9635493519441675
Test Precision: 0.7158046866935067
Test Recall: 0.6529190207156309
Test F1-Score: 0.6829172206628256
Test classification Report:
              precision    recall  f1-score   support

        date  0.2608695652 0.0733496333 0.1145038168       409
       event  0.5280373832 0.4379844961 0.4788135593       258
    facility  0.6163793103 0.5564202335 0.5848670757       257
    location  0.7280728376 0.8101992570 0.7669437340      2961
       money  0.1940298507 0.2549019608 0.2203389831       102
organization  0.7730547550 0.6505001516 0.7065020576      3299
     percent  0.3243243243 0.1263157895 0.1818181818        95
      person  0.7352024922 0.6676096181 0.6997776130      2828
     product  0.5616438356 0.4456521739 0.4969696970       368
        time  0.5416666667 0.3023255814 0.3880597015        43

   micro avg  0.7158046867 0.6529190207 0.6829172207     10620
   macro avg  0.5263281021 0.4325258895 0.4638594420     10620
weighted avg  0.703126289

In [64]:
output_file_name = "ner_arman-and-peyma-and-wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()