# ALBERT Persian
A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language

[ALBERT-Persian](https://github.com/m3hrdadfi/albert-persian) is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, like the way we did for ParsBERT.



## Persian NER [ARMAN, PEYMA]

This task aims to extract named entities in the text, such as names and label with appropriate **NER** classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with **IOB** format. In this format, tokens that are not part of an entity are tagged as **”O”**, the **”B”** tag corresponds to the first word of an object, and the **”I”** tag corresponds to the rest of the terms of the same entity. Both **”B”** and **”I”** tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the **NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text**.

There are two primary datasets used in Persian NER, **ARMAN**, and **PEYMA**. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.


In [1]:
!nvidia-smi
!lscpu

Mon Aug 16 14:15:31 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install hazm==0.7.0
!pip install seqeval==1.2.2
!pip install sentencepiece==0.1.96
!pip install transformers==4.7.0

Collecting hazm==0.7.0
  Downloading hazm-0.7.0-py3-none-any.whl (316 kB)
[K     |████████████████████████████████| 316 kB 8.9 MB/s 
[?25hCollecting nltk==3.3
  Downloading nltk-3.3.0.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 48.0 MB/s 
[?25hCollecting libwapiti>=0.2.1
  Downloading libwapiti-0.2.1.tar.gz (233 kB)
[K     |████████████████████████████████| 233 kB 61.9 MB/s 
Building wheels for collected packages: nltk, libwapiti
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.3-py3-none-any.whl size=1394482 sha256=f781c9d1a85eeb0cc26d2dac902303d741d0d0c0368550a740311fdbfaa3abef
  Stored in directory: /root/.cache/pip/wheels/9b/fd/0c/d92302c876e5de87ebd7fc0979d82edb93e2d8d768bf71fac4
  Building wheel for libwapiti (setup.py) ... [?25l[?25hdone
  Created wheel for libwapiti: filename=libwapiti-0.2.1-cp37-cp37m-linux_x86_64.whl size=154591 sha256=11b3bb791d7194f1497286693688f72b3f5c867b74826d0fa29e12c467dbb29b
 

In [3]:
!pip install PyDrive
import os
import IPython.display as ipd
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [4]:
import os
import gc
import ast
import time
import hazm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

import transformers
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForTokenClassification

from IPython.display import display, HTML, clear_output
from ipywidgets import widgets, Layout

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

print()
print('numpy', np.__version__)
print('pandas', pd.__version__)
print('transformers', transformers.__version__)
print('torch', torch.__version__)
print()

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


numpy 1.19.5
pandas 1.1.5
transformers 4.7.0
torch 1.9.0+cu102

There are 1 GPU(s) available.
We will use the GPU: Tesla T4


In [5]:
class NER:
    def __init__(self, model_name):
        self.normalizer = hazm.Normalizer()
        self.model_name = model_name
        self.config = AutoConfig.from_pretrained(self.model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(self.model_name)
        # self.labels = list(self.config.label2id.keys())
        self.id2label = self.config.id2label

    @staticmethod
    def load_ner_data(file_path, word_index, tag_index, delimiter, join=False):
        dataset, labels = [], []
        with open(file_path, encoding="utf8") as infile:
            sample_text, sample_label = [], []
            for line in infile:
                parts = line.strip().split(delimiter)
                if len(parts) > 1:
                    word, tag = parts[word_index], parts[tag_index]
                    if not word:
                        continue
                    sample_text.append(word)
                    sample_label.append(tag)
                else:
                    # end of sample
                    if sample_text and sample_label:
                        if join:
                            dataset.append(' '.join(sample_text))
                            labels.append(' '.join(sample_label))
                        else:
                            dataset.append(sample_text)
                            labels.append(sample_label)
                    sample_text, sample_label = [], []
        if sample_text and sample_label:
            if join:
                dataset.append(' '.join(sample_text))
                labels.append(' '.join(sample_label))
            else:
                dataset.append(sample_text)
                labels.append(sample_label)
        return dataset, labels

    def load_test_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "peyma":
            ner_file_path = dataset_dir + 'test.txt'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            return self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='|',
                                      join=kwargs.get('join', False))
        elif dataset_name.lower() == "arman":
            dataset, labels = [], []
            for i in range(1, 4):
                ner_file_path = dataset_dir + f'test_fold{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter=' ',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "hooshvare-peyman+arman+wikiann":
            ner_file_path = dataset_dir + 'test.csv'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            data = pd.read_csv(ner_file_path, delimiter="\t")
            sentences, sentences_tags = data['tokens'].values.tolist(), data['ner_tags'].values.tolist()
            sentences = [ast.literal_eval(ss) for ss in sentences]
            sentences_tags = [ast.literal_eval(ss) for ss in sentences_tags]
            print(f'test part:\n #sentences: {len(sentences)}, #sentences_tags: {len(sentences_tags)}')
            return sentences, sentences_tags

    def load_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "farsiyar":
            dataset, labels = [], []
            for i in range(1, 6):
                ner_file_path = dataset_dir + 'Persian-NER-part{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='\t',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "wikiann":
            ner_file_path = dataset_dir + 'wikiann-fa.bio'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            dataset_all, labels_all = self.load_ner_data(ner_file_path, word_index=0, tag_index=-1, delimiter=' ',
                                                         join=kwargs.get('join', False))
            print(f'all data: #data: {len(dataset_all)}, #labels: {len(labels_all)}')

            try:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1,
                                                               stratify=labels_all)
                print("with stratify")
            except:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1)
                print("without stratify")
            print(f'test part:\n #data: {len(data_test)}, #labels: {len(label_test)}')
            return dataset_all, labels_all, data_test, label_test

    def ner_inference(self, input_text, device, max_length):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        pt_batch = self.tokenizer(
            [self.normalizer.normalize(sequence) for sequence in input_text],
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        pt_batch = pt_batch.to(device)
        pt_outputs = self.model(**pt_batch)
        pt_predictions = torch.argmax(pt_outputs.logits, dim=-1)
        pt_predictions = pt_predictions.cpu().detach().numpy().tolist()

        output_predictions = []
        for i, sequence in enumerate(input_text):
            tokens = self.tokenizer.tokenize(self.tokenizer.decode(self.tokenizer.encode(sequence)))
            predictions = [(token, self.id2label[prediction]) for token, prediction in
                           zip(tokens, pt_predictions[i])]
            output_predictions.append(predictions)
        return output_predictions

    def ner_evaluation(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_sentence, new_sentence_label = [], []
            for word, label in zip(sentence, sentence_label):
                # Tokenize the word and count # of subwords the word is broken into
                tokenized_word = self.tokenizer.tokenize(word)
                n_subwords = len(tokenized_word)

                # Add the tokenized word to the final tokenized word list
                tokenized_sentence.extend(tokenized_word)
                # Add the same label to the new list of labels `n_subwords` times
                new_sentence_label.extend([label] * n_subwords)

            max_len = max(max_len, len(tokenized_sentence))
            tokenized_texts.append(tokenized_sentence)
            new_labels.append(new_sentence_label)

        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences([self.tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                                  maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences([[self.config.label2id.get(l) for l in lab] for lab in new_labels],
                                     maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_loss, total_time = 0, 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')
            # get the loss
            total_loss += outputs.loss.item()

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        # Calculate the average loss over the training data.
        avg_train_loss = total_loss / len(data_loader)
        print("average loss:", avg_train_loss)
        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def ner_evaluation_2(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        print("len(input_text):", len(input_text))
        print("len(input_labels):", len(input_labels))
        c = 0
        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_words = self.tokenizer(sentence, padding=False, add_special_tokens=False).input_ids
            tokenized_sentence_ids, new_sentence_label = [], []
            for i, tokenized_word in enumerate(tokenized_words):
                # Add the tokenized word to the final tokenized word list
                tokenized_sentence_ids += tokenized_word
                # Add the same label to the new list of labels `number of subwords` times
                new_sentence_label.extend([self.config.label2id.get(sentence_label[i])] * len(tokenized_word))

            max_len = max(max_len, len(tokenized_sentence_ids))
            tokenized_texts.append(tokenized_sentence_ids)
            new_labels.append(new_sentence_label)
            c += 1
            if c % 10000 == 0:
                print("c:", c)
        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences(tokenized_texts, maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences(new_labels, maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_time = 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def check_input_label_consistency(self, labels):
        model_labels = self.config.label2id.keys()
        dataset_labels = set()
        for l in labels:
            dataset_labels.update(set(l))
        print("model labels:", model_labels)
        print("dataset labels:", dataset_labels)
        print("intersection:", set(model_labels).intersection(dataset_labels))
        print("model_labels-dataset_labels:", list(set(model_labels) - set(dataset_labels)))
        print("dataset_labels-model_labels:", list(set(dataset_labels) - set(model_labels)))
        if list(set(dataset_labels) - set(model_labels)):
            return False
        return True

    @staticmethod
    def resolve_input_label_consistency(labels, label_translation_map):
        for i, sentence_labels in enumerate(labels):
            for j, label in enumerate(sentence_labels):
                labels[i][j] = label_translation_map.get(label)
        return labels

    @staticmethod
    def evaluate_prediction_results(labels, output_predictions):
        dataset_labels = set()
        for label in labels:
            dataset_labels.update(set(label))

        true_labels, predictions = [], []
        for sample_output in output_predictions:
            sample_true_labels = []
            sample_predicted_labels = []
            for token, true_label, predicted_label in sample_output:
                sample_true_labels.append(true_label)
                if predicted_label in dataset_labels:
                    sample_predicted_labels.append(predicted_label)
                else:
                    sample_predicted_labels.append('O')
            true_labels.append(sample_true_labels)
            predictions.append(sample_predicted_labels)

        print("Test Accuracy: {}".format(accuracy_score(true_labels, predictions)))
        print("Test Precision: {}".format(precision_score(true_labels, predictions)))
        print("Test Recall: {}".format(recall_score(true_labels, predictions)))
        print("Test F1-Score: {}".format(f1_score(true_labels, predictions)))
        print("Test classification Report:\n{}".format(classification_report(true_labels, predictions, digits=10)))


In [6]:
model_name='m3hrdadfi/albert-fa-base-v2-ner-arman'
ner_model = NER(model_name)

Downloading:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.88M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/72.4M [00:00<?, ?B/s]

In [7]:
print(ner_model.config)

AlbertConfig {
  "architectures": [
    "AlbertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "id2label": {
    "0": "B-event",
    "1": "B-fac",
    "2": "B-loc",
    "3": "B-org",
    "4": "B-pers",
    "5": "B-pro",
    "6": "I-event",
    "7": "I-fac",
    "8": "I-loc",
    "9": "I-org",
    "10": "I-pers",
    "11": "I-pro",
    "12": "O"
  },
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "label2id": {
    "B-event": 0,
    "B-fac": 1,
    "B-loc": 2,
    "B-org": 3,
    "B-pers": 4,
    "B-pro": 5,
    "I-event": 6,
    "I-fac": 7,
    "I-loc": 8,
    "I-org": 9,
    "I-pers": 10,
    "I-pro": 11,
    "O": 12
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "albert",
  "n

#### Sample Inference:

In [8]:
texts = [
    "مدیرکل محیط زیست استان البرز با بیان اینکه با بیان اینکه موضوع شیرابه‌های زباله‌های انتقال یافته در منطقه حلقه دره خطری برای این استان است، گفت: در این مورد گزارشاتی در ۲۵ مرداد ۱۳۹۷ تقدیم مدیران استان شده است.",
    "به گزارش خبرگزاری تسنیم از کرج، حسین محمدی در نشست خبری مشترک با معاون خدمات شهری شهرداری کرج که با حضور مدیرعامل سازمان‌های پسماند، پارک‌ها و فضای سبز و نماینده منابع طبیعی در سالن کنفرانس شهرداری کرج برگزار شد، اظهار داشت: ۸۰٪  جمعیت استان البرز در کلانشهر کرج زندگی می‌کنند.",
    "وی افزود: با همکاری‌های مشترک بین اداره کل محیط زیست و شهرداری کرج برنامه‌های مشترکی برای حفاظت از محیط زیست در شهر کرج در دستور کار قرار گرفته که این اقدامات آثار مثبتی داشته و تاکنون نزدیک به ۱۰۰ میلیارد هزینه جهت خریداری اکس-ریس صورت گرفته است.",
]

In [9]:
inference_output = ner_model.ner_inference(texts, device, ner_model.config.max_position_embeddings)

In [10]:
print(inference_output)

[[('[CLS]', 'O'), ('▁مدیرکل', 'O'), ('▁محیط', 'B-org'), ('▁زیست', 'I-org'), ('▁استان', 'I-org'), ('▁البرز', 'I-org'), ('▁با', 'O'), ('▁بیان', 'O'), ('▁اینکه', 'O'), ('▁با', 'O'), ('▁بیان', 'O'), ('▁اینکه', 'O'), ('▁موضوع', 'O'), ('▁شیر', 'O'), ('ابه', 'O'), ('▁های', 'O'), ('▁زباله', 'O'), ('▁های', 'O'), ('▁انتقال', 'O'), ('▁یافته', 'O'), ('▁در', 'O'), ('▁منطقه', 'O'), ('▁حلقه', 'O'), ('▁در', 'O'), ('ه', 'O'), ('▁خطری', 'O'), ('▁برای', 'O'), ('▁این', 'O'), ('▁استان', 'O'), ('▁است', 'O'), ('،', 'O'), ('▁گفت', 'O'), (':', 'O'), ('▁در', 'O'), ('▁این', 'O'), ('▁مورد', 'O'), ('▁گزارشات', 'O'), ('ی', 'O'), ('▁در', 'O'), ('▁۲۵', 'O'), ('▁مرداد', 'O'), ('▁۱۳۹۷', 'O'), ('▁تقدیم', 'O'), ('▁مدیران', 'O'), ('▁استان', 'O'), ('▁شده', 'O'), ('▁است', 'O'), ('.', 'O'), ('[SEP]', 'O')], [('[CLS]', 'O'), ('▁به', 'O'), ('▁گزارش', 'O'), ('▁خبرگزاری', 'B-org'), ('▁تسنیم', 'I-org'), ('▁از', 'O'), ('▁کرج', 'B-loc'), ('،', 'O'), ('▁حسین', 'B-pers'), ('▁محمدی', 'I-pers'), ('▁در', 'O'), ('▁نشست', 'O'), ('▁خبری', 

In [11]:
#@title Live Playground { display-mode: "form" }

css_is_load = False
css = """<style>
.ner-box {
    direction: rtl;
    font-size: 18px !important;
    line-height: 20px !important;
    margin: 0 0 15px;
    padding: 10px;
    text-align: justify;
    color: #343434 !important;
}
.token, .token span {
    display: inline-block !important;
    padding: 2px;
    margin: 2px 0;
}
.token.token-ner {
    background-color: #f6cd61;
    font-weight: bold;
    color: #000;
}
.token.token-ner .ner-label {
    color: #9a1f40;
    margin: 0px 2px;
}
</style>"""

if not css_is_load:
    display(HTML(css))
    css_is_load = True

submit_wd = widgets.Button(description='Send', disabled=False, button_style='success', tooltip='Submit')
text_wd = widgets.Textarea(placeholder='Please enter you text ...', rows=5, layout=Layout(width='90%'))
output_wd = widgets.Output()

display(HTML("""
<h2>Test NER model</h2>
<p style="padding: 2px 20px; margin: 0 0 20px;">
</p>
<br /><br />
"""))

display(text_wd)
display(submit_wd)
display(output_wd)

def submit_text(sender):
    with output_wd:
        clear_output(wait=True)
        text = text_wd.value
        _output = ner_model.ner_inference([text], device, ner_model.config.max_position_embeddings)
        # print(_output)
        pred_sequence = []
        for token, label in _output[0]:
            if token not in ['[CLS]', '[SEP]']:
                if label != 'O':
                    pred_sequence.append(
                        '<span class="token token-ner">%s<span class="ner-label">%s</span></span>' 
                        % (token, label))
                else:
                    pred_sequence.append(
                        '<span class="token">%s</span>' 
                        % token)
            
        html = '<p class="ner-box">%s</p>' % ' '.join(pred_sequence) 
        display(HTML(html))

submit_wd.on_click(submit_text)

Textarea(value='', layout=Layout(width='90%'), placeholder='Please enter you text ...', rows=5)

Button(button_style='success', description='Send', style=ButtonStyle(), tooltip='Submit')

Output()

#### PEYMA dataset:
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes: 

- Organization
- Money
- Location
- Date
- Time
- Person
- Percent

|     Label    |   #   |
|:------------:|:-----:|
| Organization | 16964 |
|     Money    |  2037 |
|   Location   |  8782 |
|     Date     |  4259 |
|     Time     |  732  |
|    Person    |  7675 |
|    Percent   |  699  |

Download
You can download the dataset from [here](https://hooshvare.github.io/docs/datasets/ner) with leads to following google drive file of HooshvareLab:

In [12]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

adc.json  peyma.zip  sample_data


In [13]:
!unzip peyma.zip
!ls
!ls peyma

Archive:  peyma.zip
   creating: peyma/
  inflating: peyma/dev.txt           
  inflating: peyma/test.txt          
  inflating: peyma/train.txt         
adc.json  peyma  peyma.zip  sample_data
dev.txt  test.txt  train.txt


In [14]:
sentences, labels = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [15]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'O', 'B_LOC', 'I_TIM', 'B_PCT', 'B_MON', 'I_LOC', 'B_DAT', 'I_PER', 'I_ORG', 'I_MON', 'B_PER', 'I_PCT', 'B_ORG', 'I_DAT', 'B_TIM'}
intersection: {'O'}
model_labels-dataset_labels: ['I-pro', 'I-loc', 'B-event', 'I-event', 'B-pers', 'B-org', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers']
dataset_labels-model_labels: ['I_PER', 'I_ORG', 'B_LOC', 'I_TIM', 'B_PCT', 'I_MON', 'B_MON', 'B_PER', 'I_PCT', 'B_ORG', 'I_DAT', 'B_TIM', 'I_LOC', 'B_DAT']
False


In [16]:
label_translate = {
    'B_ORG': 'B-org', 
    'I_ORG': 'I-org', 
    'B_MON': 'O', 
    'I_MON': 'O', 
    'B_TIM': 'O', 
    'I_TIM': 'O', 
    'B_PER': 'B-pers', 
    'I_PER': 'I-pers', 
    'B_LOC': 'B-loc', 
    'I_LOC': 'I-loc', 
    'B_DAT': 'O', 
    'I_DAT': 'O', 
    'B_PCT': 'O', 
    'I_PCT': 'O', 
    'O': 'O'
}
labels = ner_model.resolve_input_label_consistency(labels, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-loc', 'O', 'B-org', 'I-org', 'B-pers', 'B-loc', 'I-pers'}
intersection: {'I-loc', 'O', 'B-pers', 'B-org', 'I-org', 'B-loc', 'I-pers'}
model_labels-dataset_labels: ['I-pro', 'B-event', 'I-event', 'I-fac', 'B-fac', 'B-pro']
dataset_labels-model_labels: []
True


In [17]:
!nvidia-smi
!lscpu

Mon Aug 16 14:18:22 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P0    28W /  70W |   1230MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [18]:
inference_output_peyma = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 157
#samples: 1026
#batch: 3
Start to evaluate test data ...
inference time for step 0: 0.027959314999975504
inference time for step 1: 0.009731066000028932
inference time for step 2: 0.011429601000031653
average loss: 0.07772300268212955
total inference time: 0.04911998200003609
total inference time / #samples: 4.7875226120892875e-05


In [19]:
for sample_output in inference_output_peyma[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

کنایه	O	O
سرلشگر	O	O
فیروز	B-org	O
ابادی	B-org	O
به	O	O
پادشاه	O	O
عربستان	B-loc	B-org
و	O	O
پسرش	O	O

ر	O	O
<unk>	O	O
یس	O	O
سابق	O	O
ستاد	B-org	B-org
کل	I-org	I-org
نیروهای	I-org	I-org
مسلح	I-org	I-org
با	O	O
بیان	O	O
اینکه	O	O
ال	O	B-pers
سعود	O	I-org
با	O	O
حمایت	O	O
همه	O	O
جانبه	O	O
غرب	O	O
بر	O	O
سرزمین	B-loc	O
حجاز	I-loc	B-loc
حاکم	O	O
شد	O	O
گفت	O	O
:	O	O
غرب	O	O
با	O	O
حاکم	O	O
کرد	O	O
د	O	O
ال	O	B-pers
سعود	O	I-org
بر	O	O
حجاز	B-loc	B-loc
هدفی	O	O
جز	O	O
نابود	O	O
ی	O	O
اسلام	O	O
نداشته	O	O
و	O	O
این	O	O
نقشه	O	O
انگلیس	B-loc	B-org
بود	O	O
	O	O
.	O	O

سرلشگر	O	O
حسن	B-pers	B-pers
فیروز	I-pers	I-pers
ابادی	I-pers	O
روز	O	O
دوشنبه	O	O
در	O	O
حاشیه	O	O
ا	O	O
<unk>	O	O
ین	O	O
ختم	O	O
مادر	O	O
حیدر	B-pers	B-pers
مصلح	I-pers	I-pers
ی	I-pers	I-pers
در	O	O
جمع	O	O
خبرنگاران	O	O
درباره	O	O
موضوع	O	O
یمن	B-loc	O
افزود	O	O
:	O	O
ماهیت	O	O
	O	O
انچه	O	O
در	O	O
یمن	B-loc	O
اتفاق	O	O
می	O	O
افتد	O	O
وهابیت	O	O
است	O	O
وهابیت	O	O
یک	O	O
مذهب	O	O
انگلیسی	O	O
است	O	O
	O	O
.	O	O

وی	O	O
ادامه

In [20]:
ner_model.evaluate_prediction_results(labels, inference_output_peyma)

Test Accuracy: 0.9323801648483555
Test Precision: 0.5377906976744186
Test Recall: 0.2812975164723771
Test F1-Score: 0.36938435940099834
Test classification Report:
              precision    recall  f1-score   support

         loc  0.7509727626 0.2928679818 0.4213973799       659
         org  0.4683333333 0.3731739708 0.4153732446       753
        pers  0.4628571429 0.1443850267 0.2201086957       561

   micro avg  0.5377906977 0.2812975165 0.3693843594      1973
   macro avg  0.5607210796 0.2701423264 0.3522931067      1973
weighted avg  0.5611803891 0.2812975165 0.3618641180      1973



In [21]:
output_file_name = "ner_peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_peyma:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman dataset:
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.

1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person


|     Label    |   #   |
|:------------:|:-----:|
| Organization | 30108 |
|   Location   | 12924 |
|   Facility   |  4458 |
|     Event    |  7557 |
|    Product   |  4389 |
|    Person    | 15645 |

**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)


In [22]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

--2021-08-16 14:18:49--  https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip [following]
--2021-08-16 14:18:50--  https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1931170 (1.8M) [application/zip]
Saving to: ‘ArmanPersoNERCorpus.zip’


2021-08-16 14:18:50 (41.9 MB/s) - ‘ArmanPersoNERCorpus.zip’ saved [1931170/1931170]

adc.json						     peyma
ArmanPersoNERCorpus.zip					     peyma.zip
ner_peym

In [23]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

Archive:  ArmanPersoNERCorpus.zip
  inflating: arman/test_fold1.txt    
  inflating: arman/ReadMe.txt        
  inflating: arman/train_fold3.txt   
  inflating: arman/train_fold2.txt   
  inflating: arman/train_fold1.txt   
  inflating: arman/test_fold3.txt    
  inflating: arman/test_fold2.txt    
adc.json						     peyma
arman							     peyma.zip
ArmanPersoNERCorpus.zip					     sample_data
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt


In [24]:
sentences, labels = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [25]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-pro', 'O', 'I-loc', 'B-event', 'I-event', 'B-org', 'B-pers', 'I-fac', 'B-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers'}
intersection: {'I-pro', 'I-loc', 'B-event', 'O', 'I-event', 'B-org', 'B-pers', 'I-fac', 'B-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


batch size=256 -> inference time for one batch is about 205 s

batch size=512 -> inference time for one batch is about 410 s

batch size=1024 -> crach

In [26]:
!nvidia-smi
!lscpu

Mon Aug 16 14:18:51 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    29W /  70W |   8138MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [27]:
inference_output_arman = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 336
#samples: 7681
#batch: 16
Start to evaluate test data ...
inference time for step 0: 0.2841411040000139
inference time for step 1: 2.063989807999974
inference time for step 2: 0.00996770900002275
inference time for step 3: 0.010155363999956535
inference time for step 4: 0.01008237299998882
inference time for step 5: 0.009979308999959358
inference time for step 6: 0.009784642999989046
inference time for step 7: 0.010944468000047891
inference time for step 8: 0.009798693999982788
inference time for step 9: 0.009733252999978959
inference time for step 10: 0.009970598999984759
inference time for step 11: 0.009885033999921689
inference time for step 12: 0.009982058999980836
inference time for step 13: 0.009979889999954139
inference time for step 14: 0.010104045000048245
inference time for step 15: 0.036855937000041195
average loss: 0.04800332710146904
total inference time: 2.515354288999845
total inference time / #samples: 0.00032747744942062816


In [28]:
for sample_output in inference_output_arman[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افق	O	O
ی	O	O
:	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخست	O	O
وزیر	O	O
ایران	B-loc	B-org
در	O	O
سالهای	O	O
ابتدا	O	O
<unk>	O	O
ی	O	O
دهه	O	O
چهل	O	O
خورشیدی	O	O
	O	O
<unk>	O	O
ه	O	O
جلد	O	O
سوم	O	O
یادداشت	O	O
هایش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B-loc	B-loc
منتشر	O	O
شد	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
احوال	O	O
	O	O
<unk>	O	O
پوشاک	O	O
و	O	O
جامه	O	O
	O	O
<unk>	O	O
فانتزی	O	O
و	O	O
شیک	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
در	O	O
حال	O	O
و	O	O
زیدن	O	O
	O	O
<unk>	O	O
اطلاعی	O	O
ه	O	O
	O	O
<unk>	O	O
پایتخت	O	O
جمهوری	O	B-org
استونی	B-loc	B-loc
در	I-loc	O
حوضه	I-loc	B-loc
بالتیک	I-loc	I-loc
	O	O
<unk>	O	O
	O	O
<unk>	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
	O	O
<unk>	O	O
نوعی	O	O
شمع	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
حرف	O	O
جمع	O	O
مونث	O	O
	O	O
<unk>	O	O
در	O	O
ایران	B-loc	B-loc
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
می	O	O
شود	O	O
	O	O
<unk>	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B-loc	I-loc
	O

In [29]:
ner_model.evaluate_prediction_results(labels, inference_output_arman)

Test Accuracy: 0.9363070417572091
Test Precision: 0.6254681647940075
Test Recall: 0.2858682558803018
Test F1-Score: 0.3923940475154469
Test classification Report:
              precision    recall  f1-score   support

       event  0.2425249169 0.1018131102 0.1434184676       717
         fac  0.5292307692 0.2838283828 0.3694951665       606
         loc  0.7159090909 0.3163444640 0.4387950548      3983
         org  0.6122846929 0.3568857738 0.4509334610      5279
        pers  0.6650062267 0.2513532596 0.3648163962      4249
         pro  0.3714285714 0.0553780618 0.0963855422       939

   micro avg  0.6254681648 0.2858682559 0.3923940475     15773
   macro avg  0.5227307113 0.2276005087 0.3106406814     15773
weighted avg  0.6183163571 0.2858682559 0.3864549831     15773



In [30]:
output_file_name = "ner_arman_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_arman:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman+Peyma

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

In [None]:
!unzip peyma.zip
!ls
!ls peyma

In [31]:
sentences_peyma, labels_peyma = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences_peyma), len(labels_peyma))
print(sentences_peyma[0])
print(labels_peyma[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [32]:
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'O', 'B_LOC', 'I_TIM', 'B_PCT', 'B_MON', 'I_LOC', 'B_DAT', 'I_PER', 'I_ORG', 'I_MON', 'B_PER', 'I_PCT', 'B_ORG', 'I_DAT', 'B_TIM'}
intersection: {'O'}
model_labels-dataset_labels: ['I-pro', 'I-loc', 'B-event', 'I-event', 'B-pers', 'B-org', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers']
dataset_labels-model_labels: ['I_PER', 'I_ORG', 'B_LOC', 'I_TIM', 'B_PCT', 'I_MON', 'B_MON', 'B_PER', 'I_PCT', 'B_ORG', 'I_DAT', 'B_TIM', 'I_LOC', 'B_DAT']
False


In [33]:
label_translate = {
    'B_ORG': 'B-org', 
    'I_ORG': 'I-org', 
    'B_MON': 'O', 
    'I_MON': 'O', 
    'B_TIM': 'O', 
    'I_TIM': 'O', 
    'B_PER': 'B-pers', 
    'I_PER': 'I-pers', 
    'B_LOC': 'B-loc', 
    'I_LOC': 'I-loc', 
    'B_DAT': 'O', 
    'I_DAT': 'O', 
    'B_PCT': 'O', 
    'I_PCT': 'O', 
    'O': 'O'
}
labels_peyma = ner_model.resolve_input_label_consistency(labels_peyma, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-loc', 'O', 'B-org', 'I-org', 'B-pers', 'B-loc', 'I-pers'}
intersection: {'I-loc', 'O', 'B-pers', 'B-org', 'I-org', 'B-loc', 'I-pers'}
model_labels-dataset_labels: ['I-pro', 'B-event', 'I-event', 'I-fac', 'B-fac', 'B-pro']
dataset_labels-model_labels: []
True


In [None]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

In [None]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

In [34]:
sentences_arman, labels_arman = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences_arman), len(labels_arman))
print(sentences_arman[0])
print(labels_arman[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [35]:
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-pro', 'O', 'I-loc', 'B-event', 'I-event', 'B-org', 'B-pers', 'I-fac', 'B-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers'}
intersection: {'I-pro', 'I-loc', 'B-event', 'O', 'I-event', 'B-org', 'B-pers', 'I-fac', 'B-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [36]:
sentences = sentences_arman + sentences_peyma
labels = labels_arman + labels_peyma
print(len(sentences), len(labels))

8707 8707


In [37]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-pro', 'O', 'I-loc', 'B-event', 'I-event', 'B-org', 'B-pers', 'I-fac', 'B-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers'}
intersection: {'I-pro', 'I-loc', 'B-event', 'O', 'I-event', 'B-org', 'B-pers', 'I-fac', 'B-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [38]:
!nvidia-smi
!lscpu

Mon Aug 16 14:24:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P0    32W /  70W |   1250MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [39]:
inference_output = ner_model.ner_evaluation(sentences, labels, device, batch_size=256)

max_len: 336
#samples: 8707
#batch: 35
Start to evaluate test data ...
inference time for step 0: 0.01855940500001907
inference time for step 1: 0.009843100000011873
inference time for step 2: 0.00986980200002563
inference time for step 3: 0.010030331999928421
inference time for step 4: 0.00997533800000383
inference time for step 5: 0.009905415000048379
inference time for step 6: 0.009560296000017843
inference time for step 7: 0.00978597499999978
inference time for step 8: 0.009569354999939605
inference time for step 9: 0.009718388000010236
inference time for step 10: 0.00986461499996949
inference time for step 11: 0.009801041999935478
inference time for step 12: 0.009667519999993601
inference time for step 13: 0.011084438000011687
inference time for step 14: 0.009890981000012289
inference time for step 15: 0.009637065999982042
inference time for step 16: 0.00957005499992647
inference time for step 17: 0.009519446000012977
inference time for step 18: 0.00978489100009483
inference time 

In [40]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افق	O	O
ی	O	O
:	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخست	O	O
وزیر	O	O
ایران	B-loc	B-org
در	O	O
سالهای	O	O
ابتدا	O	O
<unk>	O	O
ی	O	O
دهه	O	O
چهل	O	O
خورشیدی	O	O
	O	O
<unk>	O	O
ه	O	O
جلد	O	O
سوم	O	O
یادداشت	O	O
هایش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B-loc	B-loc
منتشر	O	O
شد	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
احوال	O	O
	O	O
<unk>	O	O
پوشاک	O	O
و	O	O
جامه	O	O
	O	O
<unk>	O	O
فانتزی	O	O
و	O	O
شیک	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
در	O	O
حال	O	O
و	O	O
زیدن	O	O
	O	O
<unk>	O	O
اطلاعی	O	O
ه	O	O
	O	O
<unk>	O	O
پایتخت	O	O
جمهوری	O	B-org
استونی	B-loc	B-loc
در	I-loc	O
حوضه	I-loc	B-loc
بالتیک	I-loc	I-loc
	O	O
<unk>	O	O
	O	O
<unk>	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
	O	O
<unk>	O	O
نوعی	O	O
شمع	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
حرف	O	O
جمع	O	O
مونث	O	O
	O	O
<unk>	O	O
در	O	O
ایران	B-loc	B-loc
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
می	O	O
شود	O	O
	O	O
<unk>	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B-loc	I-loc
	O

In [41]:
ner_model.evaluate_prediction_results(labels, inference_output)

Test Accuracy: 0.9352220364852096
Test Precision: 0.6142367906066536
Test Recall: 0.2829933506142229
Test F1-Score: 0.38747010261553894
Test classification Report:
              precision    recall  f1-score   support

       event  0.2232415902 0.1018131102 0.1398467433       717
         fac  0.5043988270 0.2838283828 0.3632523759       606
         loc  0.7241379310 0.3121499354 0.4362486828      4642
         org  0.5908839779 0.3546087533 0.4432242022      6032
        pers  0.6526980482 0.2363825364 0.3470695971      4810
         pro  0.3586206897 0.0553780618 0.0959409594       939

   micro avg  0.6142367906 0.2829933506 0.3874701026     17746
   macro avg  0.5089968440 0.2240267966 0.3042637601     17746
weighted avg  0.6123978801 0.2829933506 0.3819727911     17746



In [42]:
output_file_name = "ner_arman-and-peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### WikiAnn dataset:

In [43]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1QOG15HU8VfZvJUNKos024xI-OGm0zhEX'})
download.GetContentFile('fa.tar.gz')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner_arman_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
peyma
peyma.zip
sample_data


In [44]:
!tar -zxvf fa.tar.gz
!ls

README.txt
wikiann-fa.bio
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner_arman_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [45]:
sentences_all, labels_all, sentences_test, labels_test = ner_model.load_datasets(dataset_name="wikiann", dataset_dir="./")
print(len(sentences_all), len(sentences_all))
print(len(sentences_test), len(labels_test))
print(sentences_test[0])
print(labels_test[0])

all data: #data: 272266, #labels: 272266


  return array(a, dtype, copy=False, order=order)


without stratify
test part:
 #data: 27227, #labels: 27227
272266 272266
27227 27227
['**', 'زاغی', 'نوک\u200cزرد', ',', "''Pica", 'nuttalli', "''"]
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O']


In [46]:
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-PER', 'O', 'I-ORG', 'B-ORG', 'B-LOC', 'B-PER', 'I-LOC'}
intersection: {'O'}
model_labels-dataset_labels: ['I-pro', 'I-loc', 'B-event', 'I-event', 'B-pers', 'B-org', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers']
dataset_labels-model_labels: ['I-PER', 'I-ORG', 'B-ORG', 'B-LOC', 'B-PER', 'I-LOC']
False


In [47]:
label_translate = {
    'B-ORG': 'B-org', 
    'I-ORG': 'I-org',
    'B-LOC': 'B-loc',
    'I-LOC': 'I-loc',
    'B-PER': 'B-pers', 
    'I-PER': 'I-pers',
    'O': 'O'
}
labels_test = ner_model.resolve_input_label_consistency(labels_test, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-loc', 'O', 'B-org', 'B-pers', 'I-org', 'B-loc', 'I-pers'}
intersection: {'I-loc', 'O', 'B-pers', 'B-org', 'I-org', 'B-loc', 'I-pers'}
model_labels-dataset_labels: ['I-pro', 'B-event', 'I-event', 'I-fac', 'B-fac', 'B-pro']
dataset_labels-model_labels: []
True


In [48]:
!nvidia-smi
!lscpu

Mon Aug 16 14:30:44 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0    32W /  70W |   8010MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [49]:
inference_output_wikiann = ner_model.ner_evaluation_2(sentences_test, labels_test, device, batch_size=512)

len(input_text): 27227
len(input_labels): 27227
c: 10000
c: 20000
max_len: 133
#samples: 27227
#batch: 54
Start to evaluate test data ...
inference time for step 0: 0.038403795000022
inference time for step 1: 0.012896017999992182
inference time for step 2: 0.009986004999973375
inference time for step 3: 0.009825083000123414
inference time for step 4: 0.010110297999972317
inference time for step 5: 0.009672426999941308
inference time for step 6: 0.009888551999893025
inference time for step 7: 0.009757372999956715
inference time for step 8: 0.009500666000121782
inference time for step 9: 0.009857393999936903
inference time for step 10: 0.009779161999858843
inference time for step 11: 0.009614136999971379
inference time for step 12: 0.010209060999841313
inference time for step 13: 0.009889075000046432
inference time for step 14: 0.009589998000137712
inference time for step 15: 0.00953738299995166
inference time for step 16: 0.011048679999930755
inference time for step 17: 0.0096241430001

In [50]:
for sample_output in inference_output_wikiann[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

**	O	O
زاغ	B-loc	O
ی	B-loc	O
نوک	I-loc	O
زرد	I-loc	O
,	O	O
"	O	O
p	O	O
ica	O	O
	O	O
nut	O	O
ta	O	O
lli	O	O
"	O	O

تغییر	O	O
مسیر	O	O
مک	B-loc	O
ویل	B-loc	O
،	B-loc	O
داکوتا	I-loc	O
ی	I-loc	O
شمالی	I-loc	O

و	B-loc	O
ست	B-loc	O
یونیورسیت	I-loc	O
ی	I-loc	O
پلیس	I-loc	O
،	I-loc	O
تگزاس	I-loc	O

تغییر	O	O
مسیر	O	O
دل	B-pers	O
تف	B-pers	O
فون	I-pers	O
لیل	I-pers	O
نسر	I-pers	O
ون	I-pers	O

تغییر	O	O
مسیر	O	O
نیروگاه	B-org	O
های	B-org	O
زنجیره	I-org	O
ای	I-org	O
یاسوج	I-org	O



In [51]:
ner_model.evaluate_prediction_results(labels_test, inference_output_wikiann)

Test Accuracy: 0.346973119918487
Test Precision: 0.3301707779886148
Test Recall: 0.0123258559622196
Test F1-Score: 0.02376453984657759
Test classification Report:
              precision    recall  f1-score   support

         loc  0.3724247227 0.0105192480 0.0204605807     22340
         org  0.2973333333 0.0174150722 0.0329029878     12805
        pers  0.3200000000 0.0088827203 0.0172856178      7205

   micro avg  0.3301707780 0.0123258560 0.0237645398     42350
   macro avg  0.3299193520 0.0122723469 0.0235497288     42350
weighted avg  0.3408009832 0.0123258560 0.0236825268     42350



In [52]:
output_file_name = "ner_wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_wikiann:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Hooshvare - Arman+Peyma+WikiAnn

https://github.com/hooshvare/parsner

In [53]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1fC2WGlpqumUTaT9Dr_U1jO2no3YMKFJ4'})
download.GetContentFile('ner-v1.zip')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner_arman_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner-v1.zip
ner_wikiann_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [54]:
!unzip ner-v1.zip
!ls
!ls ner

Archive:  ner-v1.zip
   creating: ner/
  inflating: ner/valid.csv           
  inflating: ner/ner.csv             
  inflating: ner/test.csv            
  inflating: ner/train.csv           
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner
ner_arman-and-peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner_arman_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
ner-v1.zip
ner_wikiann_m3hrdadfi-albert-fa-base-v2-ner-arman_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio
ner.csv  test.csv  train.csv  valid.csv


In [55]:
sentences_paw, labels_paw = ner_model.load_test_datasets(dataset_name="hooshvare-peyman+arman+wikiann", dataset_dir="./ner/")
print(len(sentences_paw), len(labels_paw))
print(sentences_paw[0])
print(labels_paw[0])

test part:
 #sentences: 6049, #sentences_tags: 6049
6049 6049
['همچنین', 'عملیات', 'لرزه\u200cنگاری', 'دوبعدی', 'نیز', 'با', 'فعالیت', 'مستمر', 'چهار', 'گروه', 'کاری', 'در', 'مناطقی', 'که', 'از', 'نظر', 'اکتشافی', 'مورد', 'نظر', 'بود', '،', 'به', 'پایان', 'رسید', 'که', 'نتایج', 'آن', 'در', 'حال', 'بررسی', 'است', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [56]:
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'B-TIM', 'I-PER', 'I-DAT', 'O', 'I-ORG', 'I-PCT', 'B-DAT', 'B-FAC', 'I-FAC', 'I-EVE', 'I-LOC', 'B-PCT', 'I-PRO', 'B-MON', 'B-PRO', 'B-ORG', 'B-LOC', 'B-EVE', 'I-MON', 'B-PER', 'I-TIM'}
intersection: {'O'}
model_labels-dataset_labels: ['I-pro', 'I-loc', 'B-event', 'I-event', 'B-pers', 'B-org', 'B-fac', 'I-fac', 'I-org', 'B-pro', 'B-loc', 'I-pers']
dataset_labels-model_labels: ['I-DAT', 'I-ORG', 'B-DAT', 'I-FAC', 'I-EVE', 'I-PRO', 'B-LOC', 'B-EVE', 'I-MON', 'B-TIM', 'I-PER', 'I-PCT', 'B-FAC', 'B-PCT', 'I-LOC', 'B-MON', 'B-PRO', 'B-ORG', 'B-PER', 'I-TIM']
False


In [57]:
label_translate = {
    'B-PER': 'B-pers', 
    'I-PER': 'I-pers', 
    'B-PCT': 'O', 
    'I-PCT': 'O', 
    'B-DAT': 'O', 
    'I-DAT': 'O', 
    'B-PRO': 'B-pro', 
    'I-PRO': 'I-pro',
    'B-MON': 'O', 
    'I-MON': 'O', 
    'B-TIM': 'O', 
    'I-TIM': 'O', 
    'B-FAC': 'B-fac', 
    'I-FAC': 'I-fac', 
    'B-EVE': 'B-event', 
    'I-EVE': 'I-event', 
    'B-LOC': 'B-loc', 
    'I-LOC': 'I-loc', 
    'B-ORG': 'B-org', 
    'I-ORG': 'I-org', 
    'O': 'O'
}
labels_paw = ner_model.resolve_input_label_consistency(labels_paw, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B-event', 'B-fac', 'B-loc', 'B-org', 'B-pers', 'B-pro', 'I-event', 'I-fac', 'I-loc', 'I-org', 'I-pers', 'I-pro', 'O'])
dataset labels: {'I-pro', 'O', 'I-loc', 'B-event', 'I-event', 'B-org', 'B-pers', 'I-org', 'B-fac', 'I-fac', 'B-pro', 'B-loc', 'I-pers'}
intersection: {'I-pro', 'I-loc', 'B-event', 'O', 'I-event', 'B-org', 'B-pers', 'I-org', 'B-fac', 'I-fac', 'B-pro', 'B-loc', 'I-pers'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [58]:
!nvidia-smi
!lscpu

Mon Aug 16 14:37:42 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0    32W /  70W |   6922MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [59]:
inference_output = ner_model.ner_evaluation_2(sentences_paw, labels_paw, device, batch_size=256)

len(input_text): 6049
len(input_labels): 6049
max_len: 512
#samples: 6049
#batch: 24
Start to evaluate test data ...
inference time for step 0: 0.20471088799990866
inference time for step 1: 0.010605428999951982
inference time for step 2: 0.010438328999953228
inference time for step 3: 0.00972552099983659
inference time for step 4: 0.009970867999982147
inference time for step 5: 0.009801550000020143
inference time for step 6: 0.009837374999960957
inference time for step 7: 0.009708599000077811
inference time for step 8: 0.009966488000145546
inference time for step 9: 0.01003615899981014
inference time for step 10: 0.010557189999872207
inference time for step 11: 0.009808259999999791
inference time for step 12: 0.009732859999985521
inference time for step 13: 0.00956010899994908
inference time for step 14: 0.009612392000008185
inference time for step 15: 0.009890806000157681
inference time for step 16: 0.00973875400018187
inference time for step 17: 0.009648479000134103
inference time f

In [60]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

همچنین	O	O
عملیات	O	O
	O	O
لرزه	O	O
نگاری	O	O
دو	O	O
بعدی	O	O
نیز	O	O
با	O	O
فعالیت	O	O
مستمر	O	O
چهار	O	O
گروه	O	O
کاری	O	O
در	O	O
مناطق	O	O
ی	O	O
که	O	O
از	O	O
نظر	O	O
اکتشاف	O	O
ی	O	O
مورد	O	O
نظر	O	O
بود	O	O
،	O	O
به	O	O
پایان	O	O
رسید	O	O
که	O	O
نتایج	O	O
ان	O	O
در	O	O
حال	O	O
بررسی	O	O
است	O	O
	O	O
.	O	O

محدث	B-pers	O
در	O	O
مورد	O	O
مشارکت	O	O
شرکتهای	O	O
خارجی	O	O
در	O	O
فعالیتهای	O	O
اکتشاف	O	O
ی	O	O
کشور	O	O
گفت	O	O
:	O	O
تاکنون	O	O
چند	O	O
منطقه	O	O
اکتشاف	O	O
ی	O	O
را	O	O
برای	O	O
مشارکت	O	O
و	O	O
سرمایه	O	O
گذاری	O	O
شرکتهای	O	O
خارجی	O	O
اعلام	O	O
کرده	O	O
ایم	O	O
و	O	O
در	O	O
حال	O	O
مذاکره	O	O
با	O	O
طرفهای	O	O
خارجی	O	O
هستیم	O	O
و	O	O
انتظار	O	O
می	O	O
رود	O	O
تا	O	O
اخر	O	O
امسال	O	O
ب	O	O
توانیم	O	O
چند	O	O
قرارداد	O	O
را	O	O
نهایی	O	O
کنیم	O	O
	O	O
.	O	O

مدیر	O	O
امور	B-org	O
اکتشاف	I-org	O
شرکت	I-org	B-org
ملی	I-org	I-org
نفت	I-org	I-org
فرو	O	O
افتادگی	O	O
دزفول	B-loc	B-loc
و	O	O
منطقه	B-loc	B-loc
گسل	I-loc	I-loc
کازرون	I-loc	B-loc
تا	O	O
بالا	B-loc	O
رود	B-loc

In [61]:
ner_model.evaluate_prediction_results(labels_paw, inference_output)

Test Accuracy: 0.9367223496014849
Test Precision: 0.5516513056835638
Test Recall: 0.2651347360649686
Test F1-Score: 0.3581401146846173
Test classification Report:
              precision    recall  f1-score   support

       event  0.0765765766 0.0532915361 0.0628465804       319
         fac  0.3859649123 0.2435424354 0.2986425339       271
         loc  0.7358934169 0.2943573668 0.4205105240      3190
         org  0.4700598802 0.3145392101 0.3768861454      3494
        pers  0.6497777778 0.2311097060 0.3409514925      3163
         pro  0.2763157895 0.0526315789 0.0884210526       399

   micro avg  0.5516513057 0.2651347361 0.3581401147     10836
   macro avg  0.4324313922 0.1982453055 0.2647097215     10836
weighted avg  0.5799566954 0.2651347361 0.3574158841     10836



In [62]:
output_file_name = "ner_arman-and-peyma-and-wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()