# ALBERT Persian
A Lite BERT for Self-supervised Learning of Language Representations for the Persian Language

[ALBERT-Persian](https://github.com/m3hrdadfi/albert-persian) is the first attempt on ALBERT for the Persian Language. The model was trained based on Google's ALBERT BASE Version 2.0 over various writing styles from numerous subjects (e.g., scientific, novels, news) with more than 3.9M documents, 73M sentences, and 1.3B words, like the way we did for ParsBERT.



## Persian NER [ARMAN, PEYMA]

This task aims to extract named entities in the text, such as names and label with appropriate **NER** classes such as locations, organizations, etc. The datasets used for this task contain sentences that are marked with **IOB** format. In this format, tokens that are not part of an entity are tagged as **”O”**, the **”B”** tag corresponds to the first word of an object, and the **”I”** tag corresponds to the rest of the terms of the same entity. Both **”B”** and **”I”** tags are followed by a hyphen (or underscore), followed by the entity category. Therefore, the **NER task is a multi-class token classification problem that labels the tokens upon being fed a raw text**.

There are two primary datasets used in Persian NER, **ARMAN**, and **PEYMA**. In ParsBERT, we prepared ner for both datasets as well as a combination of both datasets.


In [1]:
!nvidia-smi
!lscpu

Mon Aug 16 13:54:55 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!pip install hazm==0.7.0
!pip install seqeval==1.2.2
!pip install sentencepiece==0.1.96
!pip install transformers==4.7.0

Collecting hazm==0.7.0
  Downloading hazm-0.7.0-py3-none-any.whl (316 kB)
[K     |████████████████████████████████| 316 kB 4.2 MB/s 
[?25hCollecting libwapiti>=0.2.1
  Downloading libwapiti-0.2.1.tar.gz (233 kB)
[K     |████████████████████████████████| 233 kB 38.2 MB/s 
[?25hCollecting nltk==3.3
  Downloading nltk-3.3.0.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 42.3 MB/s 
Building wheels for collected packages: nltk, libwapiti
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.3-py3-none-any.whl size=1394486 sha256=846547c51aa4658c1bf23e38843c8ae241e3adec7a98759e1bd4df7ce67442c4
  Stored in directory: /root/.cache/pip/wheels/9b/fd/0c/d92302c876e5de87ebd7fc0979d82edb93e2d8d768bf71fac4
  Building wheel for libwapiti (setup.py) ... [?25l[?25hdone
  Created wheel for libwapiti: filename=libwapiti-0.2.1-cp37-cp37m-linux_x86_64.whl size=154592 sha256=2097aac07c6167b9cefd9a052f9a90842c1b167e306b9bf64d5165fe0f85babc
 

In [3]:
!pip install PyDrive
import os
import IPython.display as ipd
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [4]:
import os
import gc
import ast
import time
import hazm
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

import transformers
from transformers import AutoTokenizer, AutoConfig
from transformers import AutoModelForTokenClassification

from IPython.display import display, HTML, clear_output
from ipywidgets import widgets, Layout

from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

print()
print('numpy', np.__version__)
print('pandas', pd.__version__)
print('transformers', transformers.__version__)
print('torch', torch.__version__)
print()

# If there's a GPU available...
if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")


numpy 1.19.5
pandas 1.1.5
transformers 4.7.0
torch 1.9.0+cu102

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [5]:
class NER:
    def __init__(self, model_name):
        self.normalizer = hazm.Normalizer()
        self.model_name = model_name
        self.config = AutoConfig.from_pretrained(self.model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(self.model_name)
        # self.labels = list(self.config.label2id.keys())
        self.id2label = self.config.id2label

    @staticmethod
    def load_ner_data(file_path, word_index, tag_index, delimiter, join=False):
        dataset, labels = [], []
        with open(file_path, encoding="utf8") as infile:
            sample_text, sample_label = [], []
            for line in infile:
                parts = line.strip().split(delimiter)
                if len(parts) > 1:
                    word, tag = parts[word_index], parts[tag_index]
                    if not word:
                        continue
                    sample_text.append(word)
                    sample_label.append(tag)
                else:
                    # end of sample
                    if sample_text and sample_label:
                        if join:
                            dataset.append(' '.join(sample_text))
                            labels.append(' '.join(sample_label))
                        else:
                            dataset.append(sample_text)
                            labels.append(sample_label)
                    sample_text, sample_label = [], []
        if sample_text and sample_label:
            if join:
                dataset.append(' '.join(sample_text))
                labels.append(' '.join(sample_label))
            else:
                dataset.append(sample_text)
                labels.append(sample_label)
        return dataset, labels

    def load_test_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "peyma":
            ner_file_path = dataset_dir + 'test.txt'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            return self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='|',
                                      join=kwargs.get('join', False))
        elif dataset_name.lower() == "arman":
            dataset, labels = [], []
            for i in range(1, 4):
                ner_file_path = dataset_dir + f'test_fold{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter=' ',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "hooshvare-peyman+arman+wikiann":
            ner_file_path = dataset_dir + 'test.csv'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            data = pd.read_csv(ner_file_path, delimiter="\t")
            sentences, sentences_tags = data['tokens'].values.tolist(), data['ner_tags'].values.tolist()
            sentences = [ast.literal_eval(ss) for ss in sentences]
            sentences_tags = [ast.literal_eval(ss) for ss in sentences_tags]
            print(f'test part:\n #sentences: {len(sentences)}, #sentences_tags: {len(sentences_tags)}')
            return sentences, sentences_tags

    def load_datasets(self, dataset_name, dataset_dir, **kwargs):
        if dataset_name.lower() == "farsiyar":
            dataset, labels = [], []
            for i in range(1, 6):
                ner_file_path = dataset_dir + 'Persian-NER-part{i}.txt'
                if not os.path.exists(ner_file_path):
                    print(ner_file_path)
                dataset_part, labels_part = self.load_ner_data(ner_file_path, word_index=0, tag_index=1, delimiter='\t',
                                                               join=kwargs.get('join', False))
                dataset += dataset_part
                labels += labels_part
            return dataset, labels
        elif dataset_name.lower() == "wikiann":
            ner_file_path = dataset_dir + 'wikiann-fa.bio'
            if not os.path.exists(ner_file_path):
                print(ner_file_path)
                exit(1)
            dataset_all, labels_all = self.load_ner_data(ner_file_path, word_index=0, tag_index=-1, delimiter=' ',
                                                         join=kwargs.get('join', False))
            print(f'all data: #data: {len(dataset_all)}, #labels: {len(labels_all)}')

            try:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1,
                                                               stratify=labels_all)
                print("with stratify")
            except:
                _, data_test, _, label_test = train_test_split(dataset_all, labels_all, test_size=0.1, random_state=1)
                print("without stratify")
            print(f'test part:\n #data: {len(data_test)}, #labels: {len(label_test)}')
            return dataset_all, labels_all, data_test, label_test

    def ner_inference(self, input_text, device, max_length):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        pt_batch = self.tokenizer(
            [self.normalizer.normalize(sequence) for sequence in input_text],
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
        )

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        pt_batch = pt_batch.to(device)
        pt_outputs = self.model(**pt_batch)
        pt_predictions = torch.argmax(pt_outputs.logits, dim=-1)
        pt_predictions = pt_predictions.cpu().detach().numpy().tolist()

        output_predictions = []
        for i, sequence in enumerate(input_text):
            tokens = self.tokenizer.tokenize(self.tokenizer.decode(self.tokenizer.encode(sequence)))
            predictions = [(token, self.id2label[prediction]) for token, prediction in
                           zip(tokens, pt_predictions[i])]
            output_predictions.append(predictions)
        return output_predictions

    def ner_evaluation(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_sentence, new_sentence_label = [], []
            for word, label in zip(sentence, sentence_label):
                # Tokenize the word and count # of subwords the word is broken into
                tokenized_word = self.tokenizer.tokenize(word)
                n_subwords = len(tokenized_word)

                # Add the tokenized word to the final tokenized word list
                tokenized_sentence.extend(tokenized_word)
                # Add the same label to the new list of labels `n_subwords` times
                new_sentence_label.extend([label] * n_subwords)

            max_len = max(max_len, len(tokenized_sentence))
            tokenized_texts.append(tokenized_sentence)
            new_labels.append(new_sentence_label)

        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences([self.tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                                  maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences([[self.config.label2id.get(l) for l in lab] for lab in new_labels],
                                     maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_loss, total_time = 0, 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')
            # get the loss
            total_loss += outputs.loss.item()

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        # Calculate the average loss over the training data.
        avg_train_loss = total_loss / len(data_loader)
        print("average loss:", avg_train_loss)
        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def ner_evaluation_2(self, input_text, input_labels, device, batch_size=4):
        if not self.model or not self.tokenizer or not self.id2label:
            print('Something wrong has been happened!')
            return

        print("len(input_text):", len(input_text))
        print("len(input_labels):", len(input_labels))
        c = 0
        max_len = 0
        tokenized_texts, new_labels = [], []
        for sentence, sentence_label in zip(input_text, input_labels):
            if type(sentence) == str:
                sentence = sentence.strip().split()
            if len(sentence) != len(sentence_label):
                print('Something wrong has been happened! Length of a sentence and its label is not equal!')
                return
            tokenized_words = self.tokenizer(sentence, padding=False, add_special_tokens=False).input_ids
            tokenized_sentence_ids, new_sentence_label = [], []
            for i, tokenized_word in enumerate(tokenized_words):
                # Add the tokenized word to the final tokenized word list
                tokenized_sentence_ids += tokenized_word
                # Add the same label to the new list of labels `number of subwords` times
                new_sentence_label.extend([self.config.label2id.get(sentence_label[i])] * len(tokenized_word))

            max_len = max(max_len, len(tokenized_sentence_ids))
            tokenized_texts.append(tokenized_sentence_ids)
            new_labels.append(new_sentence_label)
            c += 1
            if c % 10000 == 0:
                print("c:", c)
        max_len = min(max_len, self.config.max_position_embeddings)
        print("max_len:", max_len)
        input_ids = pad_sequences(tokenized_texts, maxlen=max_len, dtype="long", value=self.config.pad_token_id,
                                  truncating="post", padding="post")
        del tokenized_texts
        input_labels = pad_sequences(new_labels, maxlen=max_len, value=self.config.label2id.get('O'), padding="post",
                                     dtype="long", truncating="post")
        del new_labels

        train_data = TensorDataset(torch.tensor(input_ids), torch.tensor(input_labels))
        data_loader = DataLoader(train_data, batch_size=batch_size)
        # data_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=batch_size)
        print("#samples:", len(input_ids))
        print("#batch:", len(data_loader))

        gc.collect()
        torch.cuda.empty_cache()
        # Tell pytorch to run this model on the GPU.
        if device.type != 'cpu':
            self.model.cuda()

        total_time = 0
        output_predictions = []
        print("Start to evaluate test data ...")
        for step, batch in enumerate(data_loader):
            b_input_ids, b_labels = batch

            # move tensors to GPU if CUDA is available
            b_input_ids = b_input_ids.to(device)
            b_labels = b_labels.to(device)

            # This will return the loss (rather than the model output) because we have provided the `labels`.
            with torch.no_grad():
                start = time.monotonic()
                outputs = self.model(b_input_ids, labels=b_labels)
                end = time.monotonic()
                total_time += end - start
                print(f'inference time for step {step}: {end - start}')

            b_predictions = torch.argmax(outputs.logits, dim=2)
            b_predictions = b_predictions.cpu().detach().numpy().tolist()
            b_labels = b_labels.cpu().detach().numpy().tolist()

            for i, sample in enumerate(b_input_ids):
                sample_input = list(sample)
                # remove pad tokens
                while sample_input[-1] == self.config.pad_token_id:
                    sample_input.pop()
                # tokens = self.tokenizer.tokenize(self.tokenizer.decode(sample_input))
                tokens = [self.tokenizer.decode([t]) for t in sample_input]
                sample_true_labels = [self.id2label[e] for e in b_labels[i][:len(sample_input)]]
                sample_predictions = [self.id2label[e] for e in b_predictions[i][:len(sample_input)]]
                output_predictions.append(
                    [(t, sample_true_labels[j], sample_predictions[j]) for j, t in enumerate(tokens)])

        print("total inference time:", total_time)
        print("total inference time / #samples:", total_time / len(input_ids))

        return output_predictions

    def check_input_label_consistency(self, labels):
        model_labels = self.config.label2id.keys()
        dataset_labels = set()
        for l in labels:
            dataset_labels.update(set(l))
        print("model labels:", model_labels)
        print("dataset labels:", dataset_labels)
        print("intersection:", set(model_labels).intersection(dataset_labels))
        print("model_labels-dataset_labels:", list(set(model_labels) - set(dataset_labels)))
        print("dataset_labels-model_labels:", list(set(dataset_labels) - set(model_labels)))
        if list(set(dataset_labels) - set(model_labels)):
            return False
        return True

    @staticmethod
    def resolve_input_label_consistency(labels, label_translation_map):
        for i, sentence_labels in enumerate(labels):
            for j, label in enumerate(sentence_labels):
                labels[i][j] = label_translation_map.get(label)
        return labels

    @staticmethod
    def evaluate_prediction_results(labels, output_predictions):
        dataset_labels = set()
        for label in labels:
            dataset_labels.update(set(label))

        true_labels, predictions = [], []
        for sample_output in output_predictions:
            sample_true_labels = []
            sample_predicted_labels = []
            for token, true_label, predicted_label in sample_output:
                sample_true_labels.append(true_label)
                if predicted_label in dataset_labels:
                    sample_predicted_labels.append(predicted_label)
                else:
                    sample_predicted_labels.append('O')
            true_labels.append(sample_true_labels)
            predictions.append(sample_predicted_labels)

        print("Test Accuracy: {}".format(accuracy_score(true_labels, predictions)))
        print("Test Precision: {}".format(precision_score(true_labels, predictions)))
        print("Test Recall: {}".format(recall_score(true_labels, predictions)))
        print("Test F1-Score: {}".format(f1_score(true_labels, predictions)))
        print("Test classification Report:\n{}".format(classification_report(true_labels, predictions, digits=10)))


In [6]:
model_name='m3hrdadfi/albert-fa-base-v2-ner-peyma'
ner_model = NER(model_name)

Downloading:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.88M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/62.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/72.4M [00:00<?, ?B/s]

In [7]:
print(ner_model.config)

AlbertConfig {
  "architectures": [
    "AlbertForTokenClassification"
  ],
  "attention_probs_dropout_prob": 0,
  "bos_token_id": 2,
  "classifier_dropout_prob": 0.1,
  "down_scale_factor": 1,
  "embedding_size": 128,
  "eos_token_id": 3,
  "gap_size": 0,
  "hidden_act": "gelu_new",
  "hidden_dropout_prob": 0,
  "hidden_size": 768,
  "id2label": {
    "0": "B_DAT",
    "1": "B_LOC",
    "2": "B_MON",
    "3": "B_ORG",
    "4": "B_PCT",
    "5": "B_PER",
    "6": "B_TIM",
    "7": "I_DAT",
    "8": "I_LOC",
    "9": "I_MON",
    "10": "I_ORG",
    "11": "I_PCT",
    "12": "I_PER",
    "13": "I_TIM",
    "14": "O"
  },
  "initializer_range": 0.02,
  "inner_group_num": 1,
  "intermediate_size": 3072,
  "label2id": {
    "B_DAT": 0,
    "B_LOC": 1,
    "B_MON": 2,
    "B_ORG": 3,
    "B_PCT": 4,
    "B_PER": 5,
    "B_TIM": 6,
    "I_DAT": 7,
    "I_LOC": 8,
    "I_MON": 9,
    "I_ORG": 10,
    "I_PCT": 11,
    "I_PER": 12,
    "I_TIM": 13,
    "O": 14
  },
  "layer_norm_eps": 1e-12,
  "m

#### Sample Inference:

In [8]:
texts = [
    "مدیرکل محیط زیست استان البرز با بیان اینکه با بیان اینکه موضوع شیرابه‌های زباله‌های انتقال یافته در منطقه حلقه دره خطری برای این استان است، گفت: در این مورد گزارشاتی در ۲۵ مرداد ۱۳۹۷ تقدیم مدیران استان شده است.",
    "به گزارش خبرگزاری تسنیم از کرج، حسین محمدی در نشست خبری مشترک با معاون خدمات شهری شهرداری کرج که با حضور مدیرعامل سازمان‌های پسماند، پارک‌ها و فضای سبز و نماینده منابع طبیعی در سالن کنفرانس شهرداری کرج برگزار شد، اظهار داشت: ۸۰٪  جمعیت استان البرز در کلانشهر کرج زندگی می‌کنند.",
    "وی افزود: با همکاری‌های مشترک بین اداره کل محیط زیست و شهرداری کرج برنامه‌های مشترکی برای حفاظت از محیط زیست در شهر کرج در دستور کار قرار گرفته که این اقدامات آثار مثبتی داشته و تاکنون نزدیک به ۱۰۰ میلیارد هزینه جهت خریداری اکس-ریس صورت گرفته است.",
]

In [9]:
inference_output = ner_model.ner_inference(texts, device, ner_model.config.max_position_embeddings)

In [10]:
print(inference_output)

[[('[CLS]', 'O'), ('▁مدیرکل', 'O'), ('▁محیط', 'B_ORG'), ('▁زیست', 'I_ORG'), ('▁استان', 'I_ORG'), ('▁البرز', 'I_ORG'), ('▁با', 'O'), ('▁بیان', 'O'), ('▁اینکه', 'O'), ('▁با', 'O'), ('▁بیان', 'O'), ('▁اینکه', 'O'), ('▁موضوع', 'O'), ('▁شیر', 'O'), ('ابه', 'O'), ('▁های', 'O'), ('▁زباله', 'O'), ('▁های', 'O'), ('▁انتقال', 'O'), ('▁یافته', 'O'), ('▁در', 'O'), ('▁منطقه', 'B_LOC'), ('▁حلقه', 'I_LOC'), ('▁در', 'I_LOC'), ('ه', 'I_LOC'), ('▁خطری', 'O'), ('▁برای', 'O'), ('▁این', 'O'), ('▁استان', 'O'), ('▁است', 'O'), ('،', 'O'), ('▁گفت', 'O'), (':', 'O'), ('▁در', 'O'), ('▁این', 'O'), ('▁مورد', 'O'), ('▁گزارشات', 'O'), ('ی', 'O'), ('▁در', 'O'), ('▁۲۵', 'B_DAT'), ('▁مرداد', 'I_DAT'), ('▁۱۳۹۷', 'I_DAT'), ('▁تقدیم', 'O'), ('▁مدیران', 'O'), ('▁استان', 'O'), ('▁شده', 'O'), ('▁است', 'O'), ('.', 'I_ORG'), ('[SEP]', 'O')], [('[CLS]', 'O'), ('▁به', 'O'), ('▁گزارش', 'O'), ('▁خبرگزاری', 'B_ORG'), ('▁تسنیم', 'I_ORG'), ('▁از', 'O'), ('▁کرج', 'B_LOC'), ('،', 'O'), ('▁حسین', 'B_PER'), ('▁محمدی', 'I_PER'), ('▁در', 'O

In [11]:
#@title Live Playground { display-mode: "form" }

css_is_load = False
css = """<style>
.ner-box {
    direction: rtl;
    font-size: 18px !important;
    line-height: 20px !important;
    margin: 0 0 15px;
    padding: 10px;
    text-align: justify;
    color: #343434 !important;
}
.token, .token span {
    display: inline-block !important;
    padding: 2px;
    margin: 2px 0;
}
.token.token-ner {
    background-color: #f6cd61;
    font-weight: bold;
    color: #000;
}
.token.token-ner .ner-label {
    color: #9a1f40;
    margin: 0px 2px;
}
</style>"""

if not css_is_load:
    display(HTML(css))
    css_is_load = True

submit_wd = widgets.Button(description='Send', disabled=False, button_style='success', tooltip='Submit')
text_wd = widgets.Textarea(placeholder='Please enter you text ...', rows=5, layout=Layout(width='90%'))
output_wd = widgets.Output()

display(HTML("""
<h2>Test NER model</h2>
<p style="padding: 2px 20px; margin: 0 0 20px;">
</p>
<br /><br />
"""))

display(text_wd)
display(submit_wd)
display(output_wd)

def submit_text(sender):
    with output_wd:
        clear_output(wait=True)
        text = text_wd.value
        _output = ner_model.ner_inference([text], device, ner_model.config.max_position_embeddings)
        # print(_output)
        pred_sequence = []
        for token, label in _output[0]:
            if token not in ['[CLS]', '[SEP]']:
                if label != 'O':
                    pred_sequence.append(
                        '<span class="token token-ner">%s<span class="ner-label">%s</span></span>' 
                        % (token, label))
                else:
                    pred_sequence.append(
                        '<span class="token">%s</span>' 
                        % token)
            
        html = '<p class="ner-box">%s</p>' % ' '.join(pred_sequence) 
        display(HTML(html))

submit_wd.on_click(submit_text)

Textarea(value='', layout=Layout(width='90%'), placeholder='Please enter you text ...', rows=5)

Button(button_style='success', description='Send', style=ButtonStyle(), tooltip='Submit')

Output()

#### PEYMA dataset:
PEYMA dataset includes 7,145 sentences with a total of 302,530 tokens from which 41,148 tokens are tagged with seven different classes: 

- Organization
- Money
- Location
- Date
- Time
- Person
- Percent

|     Label    |   #   |
|:------------:|:-----:|
| Organization | 16964 |
|     Money    |  2037 |
|   Location   |  8782 |
|     Date     |  4259 |
|     Time     |  732  |
|    Person    |  7675 |
|    Percent   |  699  |

Download
You can download the dataset from [here](https://hooshvare.github.io/docs/datasets/ner) with leads to following google drive file of HooshvareLab:

In [12]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

adc.json  peyma.zip  sample_data


In [13]:
!unzip peyma.zip
!ls
!ls peyma

Archive:  peyma.zip
   creating: peyma/
  inflating: peyma/dev.txt           
  inflating: peyma/test.txt          
  inflating: peyma/train.txt         
adc.json  peyma  peyma.zip  sample_data
dev.txt  test.txt  train.txt


In [14]:
sentences, labels = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [15]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B_ORG', 'I_ORG', 'I_DAT', 'B_LOC', 'B_MON', 'B_TIM', 'B_DAT', 'I_PCT', 'B_PER', 'B_PCT', 'I_MON', 'O', 'I_LOC', 'I_PER', 'I_TIM'}
intersection: {'B_DAT', 'B_PER', 'B_ORG', 'I_DAT', 'B_LOC', 'B_PCT', 'I_MON', 'B_MON', 'O', 'I_PER', 'B_TIM', 'I_LOC', 'I_ORG', 'I_TIM', 'I_PCT'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [16]:
!nvidia-smi
!lscpu

Mon Aug 16 13:57:11 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0    70W / 149W |    609MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [17]:
inference_output_peyma = ner_model.ner_evaluation(sentences, labels, device, batch_size=512)

max_len: 157
#samples: 1026
#batch: 3
Start to evaluate test data ...
inference time for step 0: 0.0404351499999791
inference time for step 1: 0.014057037999975819
inference time for step 2: 0.014984594999987166
average loss: 0.0954966905216376
total inference time: 0.06947678299994209
total inference time / #samples: 6.771616276797474e-05


In [18]:
for sample_output in inference_output_peyma[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

کنایه	O	O
سرلشگر	O	O
فیروز	B_ORG	O
ابادی	B_ORG	O
به	O	O
پادشاه	O	O
عربستان	B_LOC	O
و	O	O
پسرش	O	O

ر	O	O
<unk>	O	O
یس	O	O
سابق	O	O
ستاد	B_ORG	O
کل	I_ORG	O
نیروهای	I_ORG	O
مسلح	I_ORG	O
با	O	O
بیان	O	O
اینکه	O	O
ال	O	O
سعود	O	O
با	O	O
حمایت	O	O
همه	O	O
جانبه	O	O
غرب	O	O
بر	O	O
سرزمین	B_LOC	O
حجاز	I_LOC	O
حاکم	O	O
شد	O	O
گفت	O	O
:	O	O
غرب	O	O
با	O	O
حاکم	O	O
کرد	O	O
د	O	O
ال	O	O
سعود	O	O
بر	O	O
حجاز	B_LOC	O
هدفی	O	O
جز	O	O
نابود	O	O
ی	O	O
اسلام	O	O
نداشته	O	O
و	O	O
این	O	O
نقشه	O	O
انگلیس	B_LOC	O
بود	O	O
	O	O
.	O	O

سرلشگر	O	O
حسن	B_PER	O
فیروز	I_PER	O
ابادی	I_PER	O
روز	O	O
دوشنبه	O	O
در	O	O
حاشیه	O	O
ا	O	O
<unk>	O	O
ین	O	O
ختم	O	O
مادر	O	O
حیدر	B_PER	B_PER
مصلح	I_PER	O
ی	I_PER	O
در	O	O
جمع	O	O
خبرنگاران	O	O
درباره	O	O
موضوع	O	O
یمن	B_LOC	B_LOC
افزود	O	O
:	O	O
ماهیت	O	O
	O	O
انچه	O	O
در	O	O
یمن	B_LOC	B_LOC
اتفاق	O	O
می	O	O
افتد	O	O
وهابیت	O	O
است	O	O
وهابیت	O	O
یک	O	O
مذهب	O	O
انگلیسی	O	O
است	O	O
	O	O
.	O	O

وی	O	O
ادامه	O	O
داد	O	O
:	O	O
وقتی	O	O
که	O	O
انقلاب	O	O
اسلامی	O	O
به	O	O
پیروز

In [19]:
ner_model.evaluate_prediction_results(labels, inference_output_peyma)

Test Accuracy: 0.9000773560244338
Test Precision: 0.5889763779527559




Test Recall: 0.15714285714285714
Test F1-Score: 0.2480928689883914
Test classification Report:
              precision    recall  f1-score   support

        _DAT  0.3148148148 0.0682730924 0.1122112211       249
        _LOC  0.7663043478 0.2139605463 0.3345195730       659
        _MON  0.2500000000 0.0625000000 0.1000000000        48
        _ORG  0.5620689655 0.2164674635 0.3125599233       753
        _PCT  0.5000000000 0.0459770115 0.0842105263        87
        _PER  0.5301204819 0.0784313725 0.1366459627       561
        _TIM  0.5000000000 0.0869565217 0.1481481481        23

   micro avg  0.5889763780 0.1571428571 0.2480928690      2380
   macro avg  0.4890440872 0.1103665726 0.1754707649      2380
weighted avg  0.5760583931 0.1571428571 0.2419910602      2380



In [20]:
output_file_name = "ner_peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_peyma:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman dataset:
ARMAN dataset holds 7,682 sentences with 250,015 sentences tagged over six different classes.

1. Organization
2. Location
3. Facility
4. Event
5. Product
6. Person


|     Label    |   #   |
|:------------:|:-----:|
| Organization | 30108 |
|   Location   | 12924 |
|   Facility   |  4458 |
|     Event    |  7557 |
|    Product   |  4389 |
|    Person    | 15645 |

**Download**
You can download the dataset from [here](https://github.com/HaniehP/PersianNER)


In [21]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

--2021-08-16 13:57:55--  https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
Resolving github.com (github.com)... 52.69.186.44
Connecting to github.com (github.com)|52.69.186.44|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip [following]
--2021-08-16 13:57:55--  https://raw.githubusercontent.com/HaniehP/PersianNER/master/ArmanPersoNERCorpus.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1931170 (1.8M) [application/zip]
Saving to: ‘ArmanPersoNERCorpus.zip’


2021-08-16 13:57:56 (26.7 MB/s) - ‘ArmanPersoNERCorpus.zip’ saved [1931170/1931170]

adc.json						     peyma
ArmanPersoNERCorpus.zip					     peyma.zip
ner_peyma_m3

In [22]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

Archive:  ArmanPersoNERCorpus.zip
  inflating: arman/test_fold1.txt    
  inflating: arman/ReadMe.txt        
  inflating: arman/train_fold3.txt   
  inflating: arman/train_fold2.txt   
  inflating: arman/train_fold1.txt   
  inflating: arman/test_fold3.txt    
  inflating: arman/test_fold2.txt    
adc.json						     peyma
arman							     peyma.zip
ArmanPersoNERCorpus.zip					     sample_data
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt


In [23]:
sentences, labels = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences), len(labels))
print(sentences[0])
print(labels[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [24]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B-pro', 'I-org', 'B-fac', 'I-pro', 'I-fac', 'B-pers', 'B-event', 'I-loc', 'I-pers', 'O', 'B-org', 'I-event', 'B-loc'}
intersection: {'O'}
model_labels-dataset_labels: ['B_DAT', 'B_PER', 'B_ORG', 'B_PCT', 'B_LOC', 'I_DAT', 'I_MON', 'B_MON', 'I_PER', 'B_TIM', 'I_LOC', 'I_ORG', 'I_TIM', 'I_PCT']
dataset_labels-model_labels: ['B-event', 'B-pro', 'I-org', 'B-fac', 'I-pro', 'I-loc', 'I-fac', 'I-pers', 'B-pers', 'B-org', 'I-event', 'B-loc']
False


In [25]:
label_translate = {
    'B-pers': 'B_PER', 
    'I-pers': 'I_PER', 
    'B-event': 'O',
    'I-event': 'O', 
    'B-loc': 'B_LOC', 
    'I-loc': 'I_LOC', 
    'B-org': 'B_ORG', 
    'I-org': 'I_ORG', 
    'B-pro': 'O', 
    'I-pro': 'O', 
    'B-fac': 'O', 
    'I-fac': 'O', 
    'O': 'O'
}
labels = ner_model.resolve_input_label_consistency(labels, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B_PER', 'B_ORG', 'I_ORG', 'B_LOC', 'O', 'I_LOC', 'I_PER'}
intersection: {'B_PER', 'B_ORG', 'B_LOC', 'O', 'I_PER', 'I_LOC', 'I_ORG'}
model_labels-dataset_labels: ['I_DAT', 'B_PCT', 'I_MON', 'B_MON', 'B_TIM', 'B_DAT', 'I_TIM', 'I_PCT']
dataset_labels-model_labels: []
True


batch size=256 -> inference time for one batch is about 205 s

batch size=512 -> inference time for one batch is about 410 s

batch size=1024 -> crach

In [26]:
!nvidia-smi
!lscpu

Mon Aug 16 13:57:57 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P0    73W / 149W |   7517MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [28]:
inference_output_arman = ner_model.ner_evaluation(sentences, labels, device, batch_size=256)

max_len: 336
#samples: 7681
#batch: 31
Start to evaluate test data ...
inference time for step 0: 0.032970910000017284
inference time for step 1: 0.015531383999984882
inference time for step 2: 0.014728606999995009
inference time for step 3: 0.014132066000001942
inference time for step 4: 0.014132545999984814
inference time for step 5: 0.019810608999989654
inference time for step 6: 0.014099981000015305
inference time for step 7: 0.014522559999988971
inference time for step 8: 0.014899057999969045
inference time for step 9: 0.0136983069999701
inference time for step 10: 0.014005396000015935
inference time for step 11: 0.014272190999918166
inference time for step 12: 0.01382738999996036
inference time for step 13: 0.0139554219999809
inference time for step 14: 0.015440815999909319
inference time for step 15: 0.015264982000076088
inference time for step 16: 0.01459742700001243
inference time for step 17: 0.014215785000033065
inference time for step 18: 0.01502711000000545
inference time 

In [29]:
for sample_output in inference_output_arman[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افق	O	O
ی	O	O
:	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخست	O	O
وزیر	O	O
ایران	B_LOC	O
در	O	O
سالهای	O	O
ابتدا	O	O
<unk>	O	O
ی	O	O
دهه	O	O
چهل	O	O
خورشیدی	O	O
	O	O
<unk>	O	O
ه	O	O
جلد	O	O
سوم	O	O
یادداشت	O	O
هایش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B_LOC	O
منتشر	O	O
شد	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
احوال	O	O
	O	O
<unk>	O	O
پوشاک	O	O
و	O	O
جامه	O	O
	O	O
<unk>	O	O
فانتزی	O	O
و	O	O
شیک	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
در	O	O
حال	O	O
و	O	O
زیدن	O	O
	O	O
<unk>	O	O
اطلاعی	O	O
ه	O	O
	O	O
<unk>	O	O
پایتخت	O	O
جمهوری	O	O
استونی	B_LOC	O
در	I_LOC	O
حوضه	I_LOC	O
بالتیک	I_LOC	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
	O	O
<unk>	O	O
نوعی	O	O
شمع	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
حرف	O	O
جمع	O	O
مونث	O	O
	O	O
<unk>	O	O
در	O	O
ایران	B_LOC	O
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
می	O	O
شود	O	O
	O	O
<unk>	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B_LOC	O
	O	O
<unk>	O	O
ه	O	O
تا	O	O
عصر	O	

In [30]:
ner_model.evaluate_prediction_results(labels, inference_output_arman)

Test Accuracy: 0.9251874473874826




Test Precision: 0.33372827804107424
Test Recall: 0.06254163274369033
Test F1-Score: 0.10534189366078664
Test classification Report:
              precision    recall  f1-score   support

        _LOC  0.4598825832 0.1180015064 0.1878121878      3983
        _ORG  0.1796554553 0.0414851298 0.0674053555      5279
        _PER  0.5360824742 0.0367145211 0.0687224670      4249

   micro avg  0.3337282780 0.0625416327 0.1053418937     13511
   macro avg  0.3918735042 0.0654003857 0.1079800034     13511
weighted avg  0.3743562956 0.0625416327 0.1033151194     13511



In [31]:
output_file_name = "ner_arman_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_arman:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Arman+Peyma

In [None]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1WZxpFRtEs5HZWyWQ2Pyg9CCuIBs1Kmvx'})
download.GetContentFile('peyma.zip')
!ls

In [None]:
!unzip peyma.zip
!ls
!ls peyma

In [32]:
sentences_peyma, labels_peyma = ner_model.load_test_datasets(dataset_name="peyma", dataset_dir="./peyma/")
print(len(sentences_peyma), len(labels_peyma))
print(sentences_peyma[0])
print(labels_peyma[0])

1026 1026
['کنایه', 'سرلشگر', 'فیروزآبادی', 'به', 'پادشاه', 'عربستان', 'و', 'پسرش']
['O', 'O', 'B_ORG', 'O', 'O', 'B_LOC', 'O', 'O']


In [33]:
is_consistent = ner_model.check_input_label_consistency(labels_peyma)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B_ORG', 'I_ORG', 'I_DAT', 'B_LOC', 'B_MON', 'B_TIM', 'B_DAT', 'I_PCT', 'B_PER', 'B_PCT', 'I_MON', 'O', 'I_LOC', 'I_PER', 'I_TIM'}
intersection: {'B_DAT', 'B_PER', 'B_ORG', 'I_DAT', 'B_LOC', 'B_PCT', 'I_MON', 'B_MON', 'O', 'I_PER', 'B_TIM', 'I_LOC', 'I_ORG', 'I_TIM', 'I_PCT'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [None]:
!wget https://github.com/HaniehP/PersianNER/raw/master/ArmanPersoNERCorpus.zip
!ls

In [None]:
!unzip ArmanPersoNERCorpus.zip -d arman
!ls

In [34]:
sentences_arman, labels_arman = ner_model.load_test_datasets(dataset_name="arman", dataset_dir="./arman/")
print(len(sentences_arman), len(labels_arman))
print(sentences_arman[0])
print(labels_arman[0])

7681 7681
['افقی', ':', '0', 'ـ', 'از', 'عوامل', 'دوران', 'پهلوی', 'و', 'نخست\u200cوزیر', 'ایران', 'در', 'سالهای', 'ابتدائی', 'دهه', 'چهل', 'خورشیدی', 'كه', 'جلد', 'سوم', 'یادداشتهایش', 'هم', 'چندی', 'پیش', 'در', 'تهران', 'منتشر', 'شد', '0', 'ـ', 'پرستاری', 'از', 'ناخوش\u200cاحوال', 'ـ', 'پوشاک', 'و', 'جامه', 'ـ', 'فانتزی', 'و', 'شیک', '0', 'ـ', 'در', 'حال', 'وزیدن', 'ـ', 'اطلاعیه', 'ـ', 'پایتخت', 'جمهوری', 'استونی', 'در', 'حوضه', 'بالتیک', '0', 'ـ', 'علم', 'راهبرد', 'مؤسسه', 'و', 'سازمان', 'ـ', 'نوعی', 'شمع', '0', 'ـ', 'حرف', 'جمع', 'مؤنث', 'ـ', 'در', 'ایران', 'به', 'تولیدکننده', 'کتاب', 'اطلاق', 'می\u200cشود', 'ـ', 'از', 'شهرهای', 'باختری', 'افغانستان', 'كه', 'تا', 'عصر', 'ناصرالدین\u200cشاه', 'جزئی', 'از', 'خراسان', 'بود', 'ـ', 'ویتامین', 'انعقاد', '0', 'ـ', 'سبزی', 'غده\u200cای', 'ـ', 'دوستی', 'و', 'محبت', 'ـ', 'داستان', 'بلند', 'ـ', 'شهری', 'در', 'آلمان', '0', 'ـ', 'سلول', 'بدن', 'موجودات', 'ـ', 'از', 'انواع', 'کالباس', '0', 'ـ', 'حاشیه', 'و', 'هامش', 'ـ', 'پیدا', 'نشدنی', 'ـ', 'خ

In [35]:
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B-pro', 'I-org', 'B-fac', 'I-pro', 'I-fac', 'B-pers', 'B-event', 'I-loc', 'I-pers', 'O', 'B-org', 'I-event', 'B-loc'}
intersection: {'O'}
model_labels-dataset_labels: ['B_DAT', 'B_PER', 'B_ORG', 'B_PCT', 'B_LOC', 'I_DAT', 'I_MON', 'B_MON', 'I_PER', 'B_TIM', 'I_LOC', 'I_ORG', 'I_TIM', 'I_PCT']
dataset_labels-model_labels: ['B-event', 'B-pro', 'I-org', 'B-fac', 'I-pro', 'I-loc', 'I-fac', 'I-pers', 'B-pers', 'B-org', 'I-event', 'B-loc']
False


In [36]:
label_translate = {
    'B-pers': 'B_PER', 
    'I-pers': 'I_PER', 
    'B-event': 'O',
    'I-event': 'O', 
    'B-loc': 'B_LOC', 
    'I-loc': 'I_LOC', 
    'B-org': 'B_ORG', 
    'I-org': 'I_ORG', 
    'B-pro': 'O', 
    'I-pro': 'O', 
    'B-fac': 'O', 
    'I-fac': 'O', 
    'O': 'O'
}
labels_arman = ner_model.resolve_input_label_consistency(labels_arman, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_arman)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B_PER', 'B_ORG', 'I_ORG', 'B_LOC', 'O', 'I_LOC', 'I_PER'}
intersection: {'B_PER', 'B_ORG', 'B_LOC', 'O', 'I_PER', 'I_LOC', 'I_ORG'}
model_labels-dataset_labels: ['I_DAT', 'B_PCT', 'I_MON', 'B_MON', 'B_TIM', 'B_DAT', 'I_TIM', 'I_PCT']
dataset_labels-model_labels: []
True


In [37]:
sentences = sentences_arman + sentences_peyma
labels = labels_arman + labels_peyma
print(len(sentences), len(labels))

8707 8707


In [38]:
is_consistent = ner_model.check_input_label_consistency(labels)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B_ORG', 'I_DAT', 'B_LOC', 'B_MON', 'B_TIM', 'B_DAT', 'I_PCT', 'B_PER', 'B_PCT', 'I_MON', 'O', 'I_PER', 'I_LOC', 'I_ORG', 'I_TIM'}
intersection: {'B_PER', 'B_ORG', 'I_ORG', 'I_DAT', 'B_LOC', 'B_PCT', 'I_MON', 'B_MON', 'O', 'I_PER', 'B_TIM', 'I_LOC', 'B_DAT', 'I_TIM', 'I_PCT'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [39]:
!nvidia-smi
!lscpu

Mon Aug 16 14:08:06 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    78W / 149W |  10967MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [40]:
inference_output = ner_model.ner_evaluation(sentences, labels, device, batch_size=256)

max_len: 336
#samples: 8707
#batch: 35
Start to evaluate test data ...
inference time for step 0: 0.06385357600004227
inference time for step 1: 0.01540072800003145
inference time for step 2: 0.014590150999993057
inference time for step 3: 0.014473282000039944
inference time for step 4: 0.014762500000074397
inference time for step 5: 0.014058994000038183
inference time for step 6: 0.014505207999945924
inference time for step 7: 0.014534344000026067
inference time for step 8: 0.014117662999979075
inference time for step 9: 0.014306441999906383
inference time for step 10: 0.014848192000044946
inference time for step 11: 0.015012852999916504
inference time for step 12: 0.014388229000132924
inference time for step 13: 0.014794212000197149
inference time for step 14: 0.014481970000133515
inference time for step 15: 0.014420552000046882
inference time for step 16: 0.01497991200017168
inference time for step 17: 0.014516256999968391
inference time for step 18: 0.01426652500003911
inference ti

In [41]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

افق	O	O
ی	O	O
:	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
از	O	O
عوامل	O	O
دوران	O	O
پهلوی	O	O
و	O	O
نخست	O	O
وزیر	O	O
ایران	B_LOC	O
در	O	O
سالهای	O	O
ابتدا	O	O
<unk>	O	O
ی	O	O
دهه	O	O
چهل	O	O
خورشیدی	O	O
	O	O
<unk>	O	O
ه	O	O
جلد	O	O
سوم	O	O
یادداشت	O	O
هایش	O	O
هم	O	O
چندی	O	O
پیش	O	O
در	O	O
تهران	B_LOC	O
منتشر	O	O
شد	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
پرستاری	O	O
از	O	O
ناخوش	O	O
احوال	O	O
	O	O
<unk>	O	O
پوشاک	O	O
و	O	O
جامه	O	O
	O	O
<unk>	O	O
فانتزی	O	O
و	O	O
شیک	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
در	O	O
حال	O	O
و	O	O
زیدن	O	O
	O	O
<unk>	O	O
اطلاعی	O	O
ه	O	O
	O	O
<unk>	O	O
پایتخت	O	O
جمهوری	O	O
استونی	B_LOC	O
در	I_LOC	O
حوضه	I_LOC	O
بالتیک	I_LOC	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
علم	O	O
راهبرد	O	O
موسسه	O	O
و	O	O
سازمان	O	O
	O	O
<unk>	O	O
نوعی	O	O
شمع	O	O
	O	O
<unk>	O	O
	O	O
<unk>	O	O
حرف	O	O
جمع	O	O
مونث	O	O
	O	O
<unk>	O	O
در	O	O
ایران	B_LOC	O
به	O	O
تولیدکننده	O	O
کتاب	O	O
اطلاق	O	O
می	O	O
شود	O	O
	O	O
<unk>	O	O
از	O	O
شهرهای	O	O
باختری	O	O
افغانستان	B_LOC	O
	O	O
<unk>	O	O
ه	O	O
تا	O	O
عصر	O	

In [42]:
ner_model.evaluate_prediction_results(labels, inference_output)

Test Accuracy: 0.9190202776364561




Test Precision: 0.31491041603401154
Test Recall: 0.06525706374677491
Test F1-Score: 0.10811092577147624
Test classification Report:
              precision    recall  f1-score   support

        _DAT  0.0274914089 0.0321285141 0.0296296296       249
        _LOC  0.4894736842 0.1202068074 0.1930127983      4642
        _MON  0.0517241379 0.0625000000 0.0566037736        48
        _ORG  0.2081887578 0.0497347480 0.0802890405      6032
        _PCT  0.0000000000 0.0000000000 0.0000000000        87
        _PER  0.5217391304 0.0349272349 0.0654715511      4810
        _TIM  0.0000000000 0.0000000000 0.0000000000        23

   micro avg  0.3149104160 0.0652570637 0.1081109258     15891
   macro avg  0.1855167313 0.0427853292 0.0607152562     15891
weighted avg  0.3805188324 0.0652570637 0.1073111712     15891



In [43]:
output_file_name = "ner_arman-and-peyma_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### WikiAnn dataset:

In [44]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1QOG15HU8VfZvJUNKos024xI-OGm0zhEX'})
download.GetContentFile('fa.tar.gz')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner_arman_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
peyma
peyma.zip
sample_data


In [45]:
!tar -zxvf fa.tar.gz
!ls

README.txt
wikiann-fa.bio
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner_arman_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [46]:
sentences_all, labels_all, sentences_test, labels_test = ner_model.load_datasets(dataset_name="wikiann", dataset_dir="./")
print(len(sentences_all), len(sentences_all))
print(len(sentences_test), len(labels_test))
print(sentences_test[0])
print(labels_test[0])

all data: #data: 272266, #labels: 272266


  return array(a, dtype, copy=False, order=order)


without stratify
test part:
 #data: 27227, #labels: 27227
272266 272266
27227 27227
['**', 'زاغی', 'نوک\u200cزرد', ',', "''Pica", 'nuttalli', "''"]
['O', 'B-LOC', 'I-LOC', 'O', 'O', 'O', 'O']


In [47]:
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B-LOC', 'I-LOC', 'I-ORG', 'I-PER', 'O', 'B-ORG', 'B-PER'}
intersection: {'O'}
model_labels-dataset_labels: ['B_DAT', 'B_PER', 'B_ORG', 'B_PCT', 'B_LOC', 'I_DAT', 'I_MON', 'B_MON', 'I_PER', 'B_TIM', 'I_LOC', 'I_ORG', 'I_TIM', 'I_PCT']
dataset_labels-model_labels: ['B-LOC', 'I-LOC', 'I-ORG', 'I-PER', 'B-ORG', 'B-PER']
False


In [48]:
label_translate = {
    'B-ORG': 'B_ORG', 
    'I-ORG': 'I_ORG',
    'B-LOC': 'B_LOC',
    'I-LOC': 'I_LOC',
    'B-PER': 'B_PER', 
    'I-PER': 'I_PER',
    'O': 'O'
}
labels_test = ner_model.resolve_input_label_consistency(labels_test, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_test)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B_PER', 'B_ORG', 'B_LOC', 'O', 'I_PER', 'I_LOC', 'I_ORG'}
intersection: {'B_PER', 'B_ORG', 'I_ORG', 'B_LOC', 'O', 'I_LOC', 'I_PER'}
model_labels-dataset_labels: ['I_DAT', 'B_PCT', 'I_MON', 'B_MON', 'B_TIM', 'B_DAT', 'I_TIM', 'I_PCT']
dataset_labels-model_labels: []
True


In [49]:
!nvidia-smi
!lscpu

Mon Aug 16 14:18:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0    74W / 149W |  10963MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [50]:
inference_output_wikiann = ner_model.ner_evaluation_2(sentences_test, labels_test, device, batch_size=512)

len(input_text): 27227
len(input_labels): 27227
c: 10000
c: 20000
max_len: 133
#samples: 27227
#batch: 54
Start to evaluate test data ...
inference time for step 0: 0.054613161000133914
inference time for step 1: 0.014763162000008379
inference time for step 2: 0.014771948000088742
inference time for step 3: 0.014427125000111118
inference time for step 4: 0.019951488000060635
inference time for step 5: 0.015165906999982326
inference time for step 6: 0.014400830999875325
inference time for step 7: 0.013868322999996963
inference time for step 8: 0.014751146999969933
inference time for step 9: 0.01600926299988714
inference time for step 10: 0.014473411999915697
inference time for step 11: 0.014959512000132236
inference time for step 12: 0.015638147999879948
inference time for step 13: 0.014213175000122646
inference time for step 14: 0.01576823899995361
inference time for step 15: 0.014616603999911604
inference time for step 16: 0.024368340000137323
inference time for step 17: 0.01407612999

In [51]:
for sample_output in inference_output_wikiann[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

**	O	O
زاغ	B_LOC	O
ی	B_LOC	O
نوک	I_LOC	O
زرد	I_LOC	O
,	O	O
"	O	O
p	O	O
ica	O	O
	O	O
nut	O	O
ta	O	O
lli	O	O
"	O	O

تغییر	O	O
مسیر	O	O
مک	B_LOC	O
ویل	B_LOC	O
،	B_LOC	O
داکوتا	I_LOC	O
ی	I_LOC	O
شمالی	I_LOC	O

و	B_LOC	O
ست	B_LOC	O
یونیورسیت	I_LOC	O
ی	I_LOC	O
پلیس	I_LOC	O
،	I_LOC	O
تگزاس	I_LOC	O

تغییر	O	O
مسیر	O	O
دل	B_PER	O
تف	B_PER	O
فون	I_PER	O
لیل	I_PER	O
نسر	I_PER	O
ون	I_PER	O

تغییر	O	O
مسیر	O	O
نیروگاه	B_ORG	O
های	B_ORG	O
زنجیره	I_ORG	O
ای	I_ORG	O
یاسوج	I_ORG	O



In [52]:
ner_model.evaluate_prediction_results(labels_test, inference_output_wikiann)

Test Accuracy: 0.34559090662334896




Test Precision: 0.4666666666666667
Test Recall: 0.01487603305785124
Test F1-Score: 0.02883295194508009
Test classification Report:
              precision    recall  f1-score   support

        _LOC  0.6077210461 0.0218442256 0.0421725792     22340
        _ORG  0.2269129288 0.0067161265 0.0130461165     12805
        _PER  0.3333333333 0.0077723803 0.0151905602      7205

   micro avg  0.4666666667 0.0148760331 0.0288329519     42350
   macro avg  0.3893224361 0.0121109108 0.0234697519     42350
weighted avg  0.4458978722 0.0148760331 0.0287754174     42350



In [53]:
output_file_name = "ner_wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output_wikiann:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()

#### Hooshvare - Arman+Peyma+WikiAnn

https://github.com/hooshvare/parsner

In [54]:
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
download = drive.CreateFile({'id': '1fC2WGlpqumUTaT9Dr_U1jO2no3YMKFJ4'})
download.GetContentFile('ner-v1.zip')
!ls

adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner_arman-and-peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner_arman_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner-v1.zip
ner_wikiann_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio


In [55]:
!unzip ner-v1.zip
!ls
!ls ner

Archive:  ner-v1.zip
   creating: ner/
  inflating: ner/valid.csv           
  inflating: ner/ner.csv             
  inflating: ner/test.csv            
  inflating: ner/train.csv           
adc.json
arman
ArmanPersoNERCorpus.zip
fa.tar.gz
ner
ner_arman-and-peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner_arman_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner_peyma_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
ner-v1.zip
ner_wikiann_m3hrdadfi-albert-fa-base-v2-ner-peyma_outputs.txt
peyma
peyma.zip
README.txt
sample_data
wikiann-fa.bio
ner.csv  test.csv  train.csv  valid.csv


In [56]:
sentences_paw, labels_paw = ner_model.load_test_datasets(dataset_name="hooshvare-peyman+arman+wikiann", dataset_dir="./ner/")
print(len(sentences_paw), len(labels_paw))
print(sentences_paw[0])
print(labels_paw[0])

test part:
 #sentences: 6049, #sentences_tags: 6049
6049 6049
['همچنین', 'عملیات', 'لرزه\u200cنگاری', 'دوبعدی', 'نیز', 'با', 'فعالیت', 'مستمر', 'چهار', 'گروه', 'کاری', 'در', 'مناطقی', 'که', 'از', 'نظر', 'اکتشافی', 'مورد', 'نظر', 'بود', '،', 'به', 'پایان', 'رسید', 'که', 'نتایج', 'آن', 'در', 'حال', 'بررسی', 'است', '.']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [57]:
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'I-TIM', 'B-LOC', 'I-ORG', 'B-PCT', 'B-MON', 'B-DAT', 'B-TIM', 'I-PRO', 'I-MON', 'I-LOC', 'B-PRO', 'B-FAC', 'I-PER', 'I-PCT', 'O', 'B-EVE', 'B-ORG', 'I-FAC', 'I-DAT', 'I-EVE', 'B-PER'}
intersection: {'O'}
model_labels-dataset_labels: ['B_DAT', 'B_PER', 'B_ORG', 'B_PCT', 'B_LOC', 'I_DAT', 'I_MON', 'B_MON', 'I_PER', 'B_TIM', 'I_LOC', 'I_ORG', 'I_TIM', 'I_PCT']
dataset_labels-model_labels: ['B-LOC', 'I-ORG', 'B-PCT', 'B-MON', 'B-DAT', 'B-TIM', 'I-PRO', 'I-MON', 'I-LOC', 'B-FAC', 'I-DAT', 'B-PER', 'I-TIM', 'B-PRO', 'I-PER', 'I-PCT', 'B-EVE', 'B-ORG', 'I-FAC', 'I-EVE']
False


In [58]:
label_translate = {
    'B-ORG': 'B_ORG',
    'I-ORG': 'I_ORG', 
    'B-LOC': 'B_LOC', 
    'I-LOC': 'I_LOC', 
    'B-TIM': 'B_TIM', 
    'I-TIM': 'I_TIM', 
    'B-DAT': 'B_DAT', 
    'I-DAT': 'I_DAT', 
    'B-PRO': 'O', 
    'I-PRO': 'O', 
    'B-EVE': 'O', 
    'I-EVE': 'O', 
    'B-FAC': 'O', 
    'I-FAC': 'O', 
    'B-MON': 'B_MON', 
    'I-MON': 'I_MON', 
    'B-PER': 'B_PER', 
    'I-PER': 'I_PER', 
    'B-PCT': 'B_PCT', 
    'I-PCT': 'I_PCT', 
    'O': 'O'
}
labels_paw = ner_model.resolve_input_label_consistency(labels_paw, label_translate)
is_consistent = ner_model.check_input_label_consistency(labels_paw)
print(is_consistent)

model labels: dict_keys(['B_DAT', 'B_LOC', 'B_MON', 'B_ORG', 'B_PCT', 'B_PER', 'B_TIM', 'I_DAT', 'I_LOC', 'I_MON', 'I_ORG', 'I_PCT', 'I_PER', 'I_TIM', 'O'])
dataset labels: {'B_ORG', 'I_ORG', 'I_DAT', 'B_LOC', 'B_MON', 'B_TIM', 'B_DAT', 'I_PCT', 'B_PER', 'B_PCT', 'I_MON', 'O', 'I_LOC', 'I_PER', 'I_TIM'}
intersection: {'B_DAT', 'B_PER', 'B_ORG', 'I_DAT', 'B_LOC', 'B_PCT', 'I_MON', 'B_MON', 'O', 'I_PER', 'B_TIM', 'I_LOC', 'I_ORG', 'I_TIM', 'I_PCT'}
model_labels-dataset_labels: []
dataset_labels-model_labels: []
True


In [59]:
!nvidia-smi
!lscpu

Mon Aug 16 14:29:14 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   71C    P0    75W / 149W |  10497MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [62]:
inference_output = ner_model.ner_evaluation_2(sentences_paw, labels_paw, device, batch_size=128)

len(input_text): 6049
len(input_labels): 6049
max_len: 512
#samples: 6049
#batch: 48
Start to evaluate test data ...
inference time for step 0: 0.02853205400015213
inference time for step 1: 0.01493689600010839
inference time for step 2: 0.014254252999762684
inference time for step 3: 0.014426007999645662
inference time for step 4: 0.016489972000272246
inference time for step 5: 0.01377309300005436
inference time for step 6: 0.015014910999980202
inference time for step 7: 0.014477529000032519
inference time for step 8: 0.014668965000055323
inference time for step 9: 0.014497504000246408
inference time for step 10: 0.015001515000221843
inference time for step 11: 0.014542757000072015
inference time for step 12: 0.0150684549998914
inference time for step 13: 0.014418241999919701
inference time for step 14: 0.01458116199955839
inference time for step 15: 0.014508458999898721
inference time for step 16: 0.014937352999822906
inference time for step 17: 0.01476886000000377
inference time for

In [63]:
for sample_output in inference_output[:5]:
  for token, true_label, predicted_label in sample_output:
    print('{}\t{}\t{}'.format(token, true_label, predicted_label))
  print()

همچنین	O	O
عملیات	O	O
	O	O
لرزه	O	O
نگاری	O	O
دو	O	O
بعدی	O	O
نیز	O	O
با	O	O
فعالیت	O	O
مستمر	O	O
چهار	O	O
گروه	O	O
کاری	O	O
در	O	O
مناطق	O	O
ی	O	O
که	O	O
از	O	O
نظر	O	O
اکتشاف	O	O
ی	O	O
مورد	O	O
نظر	O	O
بود	O	O
،	O	O
به	O	O
پایان	O	O
رسید	O	O
که	O	O
نتایج	O	O
ان	O	O
در	O	O
حال	O	O
بررسی	O	O
است	O	O
	O	O
.	O	O

محدث	B_PER	O
در	O	O
مورد	O	O
مشارکت	O	O
شرکتهای	O	O
خارجی	O	O
در	O	O
فعالیتهای	O	O
اکتشاف	O	O
ی	O	O
کشور	O	O
گفت	O	O
:	O	O
تاکنون	O	O
چند	O	O
منطقه	O	O
اکتشاف	O	O
ی	O	O
را	O	O
برای	O	O
مشارکت	O	O
و	O	O
سرمایه	O	O
گذاری	O	O
شرکتهای	O	O
خارجی	O	O
اعلام	O	O
کرده	O	O
ایم	O	O
و	O	O
در	O	O
حال	O	O
مذاکره	O	O
با	O	O
طرفهای	O	O
خارجی	O	O
هستیم	O	O
و	O	O
انتظار	O	O
می	O	O
رود	O	O
تا	O	O
اخر	O	O
امسال	O	O
ب	O	O
توانیم	O	O
چند	O	O
قرارداد	O	O
را	O	O
نهایی	O	O
کنیم	O	O
	O	O
.	O	O

مدیر	O	O
امور	B_ORG	O
اکتشاف	I_ORG	O
شرکت	I_ORG	O
ملی	I_ORG	O
نفت	I_ORG	O
فرو	O	O
افتادگی	O	O
دزفول	B_LOC	O
و	O	O
منطقه	B_LOC	O
گسل	I_LOC	O
کازرون	I_LOC	O
تا	O	O
بالا	B_LOC	O
رود	B_LOC	O
در	O	O
اطراف	O	O
لرستان	B_

In [64]:
ner_model.evaluate_prediction_results(labels_paw, inference_output)

Test Accuracy: 0.9134709029370018




Test Precision: 0.2663738019169329
Test Recall: 0.06185662617082444
Test F1-Score: 0.10039888612929931
Test classification Report:
              precision    recall  f1-score   support

        _DAT  0.0743243243 0.0210727969 0.0328358209       522
        _LOC  0.5236768802 0.1178683386 0.1924257932      3190
        _MON  0.0454545455 0.0054644809 0.0097560976       183
        _ORG  0.1011857708 0.0366342301 0.0537928136      3494
        _PCT  0.0000000000 0.0000000000 0.0000000000       172
        _PER  0.5246478873 0.0471071767 0.0864519872      3163
        _TIM  0.4000000000 0.0338983051 0.0625000000        59

   micro avg  0.2663738019 0.0618566262 0.1003988861     10783
   macro avg  0.2384699154 0.0374350469 0.0625375018     10783
weighted avg  0.3481636908 0.0618566262 0.1018131477     10783



In [65]:
output_file_name = "ner_arman-and-peyma-and-wikiann_{}_outputs.txt".format(model_name.replace('/','-'))
with open(output_file_name, "w", encoding='utf8') as output_file:
  for sample_output in inference_output:
    for token, true_label, predicted_label in sample_output:
      output_file.write('{}\t{}\t{}\n'.format(token, true_label, predicted_label))
    output_file.write('\n')
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
upload = drive.CreateFile({'title': output_file_name})
upload.SetContentFile(output_file_name)
upload.Upload()