In [None]:
!pip3 install transformers seqeval
!pip3 install torch torchvision torchaudio
!pip3 install accelerate
!pip3 install datasets ipywidgets

In [None]:
import transformers

print(transformers.__version__)

4.38.2


In [None]:
from transformers.utils import send_example_telemetry

send_example_telemetry("token_classification_notebook", framework="pytorch")

# Fine-tuning a model on a token classification task

In [None]:
task = "ner" # Should be one of "ner", "pos" or "chunk"
model_checkpoint = "roberta-base"
batch_size = 16

## Loading the dataset

We will use the [🤗 Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions `load_dataset` and `load_metric`.  

In [None]:
from datasets import load_dataset, load_metric

In [None]:
# datasets = load_dataset("conll2003")
# datasets = load_dataset("conll2012_ontonotesv5", "english_v12")

datasets = load_dataset("./englishv12/")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading data:   0%|          | 0.00/194M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/115812 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/15680 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12217 [00:00<?, ? examples/s]

The `datasets` object itself is [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), which contains one key for the training, validation and test set.

Since the labels are lists of `ClassLabel`, the actual names of the labels are nested in the `feature` attribute of the object above:

In [None]:
label_list = datasets["train"].features[f"{task}_tags"].feature.names
label_list

['O',
 'B-PERSON',
 'I-PERSON',
 'B-NORP',
 'I-NORP',
 'B-FAC',
 'I-FAC',
 'B-ORG',
 'I-ORG',
 'B-GPE',
 'I-GPE',
 'B-LOC',
 'I-LOC',
 'B-PRODUCT',
 'I-PRODUCT',
 'B-DATE',
 'I-DATE',
 'B-TIME',
 'I-TIME',
 'B-PERCENT',
 'I-PERCENT',
 'B-MONEY',
 'I-MONEY',
 'B-QUANTITY',
 'I-QUANTITY',
 'B-ORDINAL',
 'I-ORDINAL',
 'B-CARDINAL',
 'I-CARDINAL',
 'B-EVENT',
 'I-EVENT',
 'B-WORK_OF_ART',
 'I-WORK_OF_ART',
 'B-LAW',
 'I-LAW',
 'B-LANGUAGE',
 'I-LANGUAGE']

## Preprocessing the data

Before we can feed those texts to our model, we need to preprocess them. This is done by a 🤗 Transformers `Tokenizer` which will (as the name indicates) tokenize the inputs (including converting the tokens to their corresponding IDs in the pretrained vocabulary) and put it in a format the model expects, as well as generate the other inputs that model requires.

To do all of this, we instantiate our tokenizer with the `AutoTokenizer.from_pretrained` method, which will ensure:

- we get a tokenizer that corresponds to the model architecture we want to use,
- we download the vocabulary used when pretraining this specific checkpoint.

That vocabulary will be cached, so it's not downloaded again the next time we run the cell.

In [None]:
from transformers import AutoTokenizer

# add add_prefix_space=True for Roberta
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, add_prefix_space=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The following assertion ensures that our tokenizer is a fast tokenizers (backed by Rust) from the 🤗 Tokenizers library. Those fast tokenizers are available for almost all models, and we will need some of the special features they have for our preprocessing.

In [None]:
import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

You can check which type of models have a fast tokenizer available and which don't on the [big table of models](https://huggingface.co/transformers/index.html#bigtable).

Here we set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from. Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word. We propose the two strategies here, just change the value of the following flag:

In [None]:
label_all_tokens = True

We're now ready to write the function that will preprocess our samples. We feed them to the `tokenizer` with the argument `truncation=True` (to truncate texts that are bigger than the maximum size allowed by the model) and `is_split_into_words=True` (as seen above). Then we align the labels with the token ids using the strategy we picked:

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to 0 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(0)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or 0, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else 0)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

This function works with one or several examples. In the case of several examples, the tokenizer will return a list of lists for each key:

In [None]:
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/115812 [00:00<?, ? examples/s]

Map:   0%|          | 0/15680 [00:00<?, ? examples/s]

Map:   0%|          | 0/12217 [00:00<?, ? examples/s]

Even better, the results are automatically cached by the 🤗 Datasets library to avoid spending time on this step the next time you run your notebook. The 🤗 Datasets library is normally smart enough to detect when the function you pass to map has changed (and thus requires to not use the cache data). For instance, it will properly detect if you change the task in the first cell and rerun the notebook. 🤗 Datasets warns you when it uses cached files, you can pass `load_from_cache_file=False` in the call to `map` to not use the cached files and force the preprocessing to be applied again.

Note that we passed `batched=True` to encode the texts by batches together. This is to leverage the full benefit of the fast tokenizer we loaded earlier, which will use multi-threading to treat the texts in a batch concurrently.

## Define HGN model

In [None]:
import torch
import torch.nn as nn
from transformers import (PreTrainedModel, AutoModel,AutoConfig)
import math
import random
import numpy as np
from transformers import AutoModelWithLMHead

class BiLSTM(nn.Module):
    def __init__(self, hidden_size):
        super(BiLSTM, self).__init__()
        #self.setup_seed(seed)
        self.forward_lstm = nn.LSTM(hidden_size, hidden_size//2, num_layers=1, bidirectional=False, batch_first=True)
        self.backward_lstm = nn.LSTM(hidden_size, hidden_size//2, num_layers=1, bidirectional=False, batch_first=True)

    def forward(self, x):
        batch_size,max_len,feat_dim = x.shape
        out1, (h1,c1) = self.forward_lstm(x)
        reverse_x = torch.zeros([batch_size, max_len, feat_dim], dtype=torch.float32, device='cuda')
        for i in range(max_len):
            reverse_x[:,i,:] = x[:,max_len-1-i,:]

        out2, (h2,c2) = self.backward_lstm(reverse_x)

        output = torch.cat((out1, out2), 2)
        return output,(1,1)

class HGNER(nn.Module):
    def __init__(self, args, num_labels,hidden_dropout_prob=0.1,windows_list=None):
        super(HGNER, self).__init__()


        config = AutoConfig.from_pretrained(args.bert_model)
        self.bert = AutoModel.from_pretrained(args.bert_model)


        self.dropout = nn.Dropout(hidden_dropout_prob)
        self.num_labels = num_labels


        self.use_bilstm = args.use_bilstm


        self.use_multiple_window = args.use_multiple_window
        self.windows_list = windows_list
        self.connect_type = args.connect_type
        connect_type = args.connect_type
        self.d_model = args.d_model
        self.num_labels = num_labels


        if self.use_multiple_window and self.windows_list != None:
            if self.use_bilstm:
                self.bilstm_layers = nn.ModuleList([BiLSTM(self.d_model) for _ in self.windows_list])

            else:
                self.bilstm_layers = nn.ModuleList([nn.LSTM(self.d_model, self.d_model, num_layers=1, bidirectional=False, batch_first=True) for _ in self.windows_list])

            if connect_type=='dot-att':
                self.linear = nn.Linear(self.d_model, self.num_labels)
            elif connect_type=='mlp-att':
                self.linear = nn.Linear(self.d_model, self.num_labels)
                self.Q = nn.Linear(self.d_model * (len(windows_list) + 1), self.d_model)
        else:
            self.linear = nn.Linear(self.d_model, self.num_labels)


    def windows_sequence(self,sequence_output, windows, lstm_layer):
        batch_size, max_len, feat_dim = sequence_output.shape
        local_final = torch.zeros([batch_size, max_len, feat_dim], dtype=torch.float32, device='cuda')
        for i in range(max_len):
            index_list = []
            for u in range(1, windows // 2 + 1):
                if i - u >= 0:
                    index_list.append(i - u)
                if i + u <= max_len - 1:
                    index_list.append(i + u)
            index_list.append(i)
            index_list.sort()
            temp = sequence_output[:, index_list, :]
            out,(h,b) = lstm_layer(temp)
            local_f = out[:, -1, :]
            local_final[:, i, :] = local_f
        return local_final



    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None,valid_ids=None,attention_mask_label=None):

        sequence_output = self.bert(input_ids, token_type_ids= token_type_ids, attention_mask=attention_mask,head_mask=None)[0]
        batch_size,max_len,feat_dim = sequence_output.shape
        valid_output = torch.zeros(batch_size,max_len,feat_dim,dtype=torch.float32,device='cuda')

        for i in range(batch_size):
            jj = -1
            for j in range(max_len):
                    if valid_ids[i][j].item() == 1:
                        jj += 1
                        valid_output[i][jj] = sequence_output[i][j]
        sequence_output = self.dropout(valid_output)


        if self.use_multiple_window:
            mutiple_windows = []

            for i,window in enumerate(self.windows_list):
                if self.use_bilstm:
                    local_final = self.windows_sequence(sequence_output, window, self.bilstm_layers[i])
                mutiple_windows.append(local_final)


            if self.connect_type=='dot-att':
                muti_local_features = torch.stack(mutiple_windows, dim=2)
                sequence_output = sequence_output.unsqueeze(dim=2)
                d_k = sequence_output.size(-1)
                attn = torch.matmul(sequence_output, muti_local_features.permute(0, 1, 3, 2)) / math.sqrt(d_k)
                attn = torch.softmax(attn, dim=-1)
                local_features = torch.matmul(attn, muti_local_features).squeeze()
                sequence_output = sequence_output.squeeze()
                sequence_output = sequence_output + local_features
            elif self.connect_type == 'mlp-att':
                mutiple_windows.append(sequence_output)
                muti_features = torch.cat(mutiple_windows, dim=-1)
                muti_local_features = torch.stack(mutiple_windows, dim=2)
                query = self.Q(muti_features)
                d_k = query.size(-1)
                query = query.unsqueeze(dim=2)
                attn = torch.matmul(query, muti_local_features.permute(0, 1, 3, 2)) / math.sqrt(d_k)
                attn = torch.softmax(attn, dim=-1)
                sequence_output = torch.matmul(attn, muti_local_features).squeeze()


        logits = self.linear(sequence_output)

        if labels is not None:

            loss_fct = nn.CrossEntropyLoss(ignore_index=0)
            # Only keep active parts of the loss
            #attention_mask_label = None
            if attention_mask_label is not None:
                active_loss = attention_mask_label.view(-1) == 1
                active_logits = logits.view(-1, self.num_labels)[active_loss]
                active_labels = labels.view(-1)[active_loss]
                loss = loss_fct(active_logits, active_labels)
            else:
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            return loss
        else:

            return logits


## Run script

### Import package and set arguments.

In [None]:
from __future__ import absolute_import, division, print_function

import argparse
import csv
import json
import logging
import os
import random
import sys
import time
import re
import numpy as np
import torch
import torch.nn.functional as F

from transformers import WEIGHTS_NAME, AdamW, AutoConfig, AutoTokenizer, get_linear_schedule_with_warmup
from transformers import (PreTrainedModel, AutoModel,AutoConfig)

from torch import nn
from torch.utils.data import (DataLoader, RandomSampler, SequentialSampler,
                              TensorDataset)
from torch.utils.data.distributed import DistributedSampler

from tqdm import tqdm, trange

from seqeval.metrics import classification_report

logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.INFO)
logger = logging.getLogger("HGNER")

def setup_seed(seed):
	torch.manual_seed(seed)
	if torch.cuda.is_available():
		torch.cuda.manual_seed(seed)
		torch.cuda.manual_seed_all(seed)
	random.seed(seed)
	np.random.seed(seed)
	torch.backends.cudnn.deterministic = True
	torch.backends.cudnn.benchmard = False
	torch.random.manual_seed(seed)


parser = argparse.ArgumentParser()

## Required parameters
parser.add_argument("--train_data_dir",
                    default=None,
                    type=str,
                    help="The input data dir. Should contain the .tsv files (or other data files) for the task.")

parser.add_argument("--dev_data_dir",
                    default=None,
                    type=str,
                    help="The input data dir. Should contain the .tsv files (or other data files) for the task.")

parser.add_argument("--test_data_dir",
                    default=None,
                    type=str,
                    help="The input data dir. Should contain the .tsv files (or other data files) for the task.")

parser.add_argument("--gpu_id",
                    default='0',
                    nargs='+',
                    type=str,
                    help="The input data dir. Should contain the .tsv files (or other data files) for the task.")

parser.add_argument("--use_crf",
                    action='store_true',
                    help="Whether use crf")

parser.add_argument("--bert_model", default=model_checkpoint, type=str,
                    help="Bert pre-trained model selected in the list: bert-base-uncased, "
                    "bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, "
                    "bert-base-multilingual-cased, bert-base-chinese.")
parser.add_argument("--task_name",
                    default=task,
                    type=str,
                    help="The name of the task to train.")
parser.add_argument("--output_dir",
                    default='./output',
                    type=str,
                    help="The output directory where the model predictions and checkpoints will be written.")


## Other parameters
parser.add_argument("--cache_dir",
                    default="",
                    type=str,
                    help="Where do you want to store the pre-trained models downloaded from s3")

parser.add_argument("--label_list",
                    default=["O"],
                    type=str,
                    nargs='+',
                    help="Where do you want to store the pre-trained models downloaded from s3")

parser.add_argument("--max_seq_length",
                    default=128,
                    type=int,
                    help="The maximum total input sequence length after WordPiece tokenization. \n"
                          "Sequences longer than this will be truncated, and sequences shorter \n"
                          "than this will be padded.")
parser.add_argument("--do_train",
                    default=True,
                    action='store_true',
                    help="Whether to run training.")
parser.add_argument("--do_eval",
                    action='store_true',
                    help="Whether to run eval or not.")

parser.add_argument("--do_predict",
                    action='store_true',
                    help="Whether to run eval or not.")

parser.add_argument("--eval_on",
                    default="dev",
                    help="Whether to run eval on the dev set or test set.")
parser.add_argument("--do_lower_case",
                    action='store_true',
                    help="Set this flag if you are using an uncased model.")
parser.add_argument("--train_batch_size",
                    default=batch_size,
                    type=int,
                    help="Total batch size for training.")
parser.add_argument("--eval_batch_size",
                    default=batch_size,
                    type=int,
                    help="Total batch size for eval.")
parser.add_argument("--learning_rate",
                    default=2e-5,
                    type=float,
                    help="The initial learning rate for Adam.")
parser.add_argument("--num_train_epochs",
                    default=3.0,
                    type=float,
                    help="Total number of training epochs to perform.")
parser.add_argument("--warmup_proportion",
                    default=0.1,
                    type=float,
                    help="Proportion of training to perform linear learning rate warmup for. "
                          "E.g., 0.1 = 10%% of training.")
parser.add_argument("--weight_decay", default=0.01, type=float,
                    help="Weight deay if we apply some.")
parser.add_argument("--adam_epsilon", default=1e-8, type=float,
                    help="Epsilon for Adam optimizer.")
parser.add_argument("--max_grad_norm", default=1.0, type=float,
                    help="Max gradient norm.")
parser.add_argument("--no_cuda",
                    action='store_true',
                    help="Whether not to use CUDA when available")
parser.add_argument("--local_rank",
                    type=int,
                    default=-1,
                    help="local_rank for distributed training on gpus")
parser.add_argument('--seed',
                    type=int,
                    default=42,
                    help="random seed for initialization")
parser.add_argument('--gradient_accumulation_steps',
                    type=int,
                    default=1,
                    help="Number of updates steps to accumulate before performing a backward/update pass.")
parser.add_argument('--fp16',
                    action='store_true',
                    help="Whether to use 16-bit float precision instead of 32-bit")
parser.add_argument('--fp16_opt_level', type=str, default='O1',
                    help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']."
                          "See details at https://nvidia.github.io/apex/amp.html")
parser.add_argument('--loss_scale',
                    type=float, default=0,
                    help="Loss scaling to improve fp16 numeric stability. Only used when fp16 set to True.\n"
                          "0 (default value): dynamic loss scaling.\n"
                          "Positive power of 2: static loss scaling value.\n")
parser.add_argument('--server_ip', type=str, default='', help="Can be used for distant debugging.")
parser.add_argument('--server_port', type=str, default='', help="Can be used for distant debugging.")

parser.add_argument("--hidden_dropout_prob",
                    default=0.1,
                    type=float,
                    help="hidden_dropout_prob")

parser.add_argument("--window_size",
                    default=-1,
                    type=int,
                    help="window_size")

parser.add_argument("--d_model",
                    default=768,
                    type=int,
                    help="pre-trained model size")

#####
parser.add_argument("--use_bilstm",
                    default=True,
                    action='store_true',
                    help="Set this flag if you are using an uncased model.")


parser.add_argument("--use_single_window",
                    action='store_true',
                    help="Set this flag if you are using an uncased model.")
parser.add_argument("--use_multiple_window",
                    default=True,
                    action='store_true',
                    help="Set this flag if you are using an multiple.")

parser.add_argument("--use_global_lstm",
                    action='store_true',
                    help="Set this flag if you are using an uncased model.")

parser.add_argument("--use_n_gram",
                    action='store_true',
                    help="Set this flag if you are using an uncased model.")
parser.add_argument('--windows_list', type=str, default='1qq3qq5qq7', help="window list")
parser.add_argument('--connect_type', type=str, default='dot-att', help="window list")


parser.add_argument('-f', help="compatible with notebook.")

args = parser.parse_args()



### Set environment

In [None]:
def remove_dir(path):
    if os.path.isdir(path):
        for subpath in os.listdir(path):
            subpath = os.path.join(path, subpath)
            if os.path.isdir(subpath):
                remove_dir(subpath)
            else:
                os.remove(subpath)
        os.rmdir(path)

if args.do_train:
    if os.path.exists(args.output_dir):
        remove_dir(args.output_dir)  # 删除已存在的输出目录及其所有内容
    os.makedirs(args.output_dir)  # 创建新的输出目录


handler = logging.FileHandler(args.output_dir+'/log.txt', encoding='UTF-8')
logger.addHandler(handler)

gpu_ids = ''
for ids in args.gpu_id:
  gpu_ids = gpu_ids + str(ids) +','

os.environ["CUDA_VISIBLE_DEVICES"] = gpu_ids

if args.local_rank == -1 or args.no_cuda:
    device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")
    n_gpu = torch.cuda.device_count()
else:
    torch.cuda.set_device(args.local_rank)
    device = torch.device("cuda", args.local_rank)
    n_gpu = 1
    # Initializes the distributed backend which will take care of sychronizing nodes/GPUs
    torch.distributed.init_process_group(backend='nccl')

logger.info("device: {} n_gpu: {}, distributed training: {}, 16-bits training: {}".format(
    device, n_gpu, bool(args.local_rank != -1), args.fp16))

if args.gradient_accumulation_steps < 1:
    raise ValueError("Invalid gradient_accumulation_steps parameter: {}, should be >= 1".format(
                        args.gradient_accumulation_steps))

args.train_batch_size = args.train_batch_size // args.gradient_accumulation_steps

setup_seed(args.seed)

task_name = args.task_name.lower()

### Define uitility

In [None]:
class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id, valid_ids=None, label_mask=None, domain_label=None, seq_len=None):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id
        self.valid_ids = valid_ids
        self.label_mask = label_mask
        self.domain_label = domain_label
        self.seq_len = seq_len

def convert_dataset_to_features(datasets_part:str, label_list, max_seq_length, tokenizer):
    """Loads a data file into a list of `InputFeatures`s."""
    # datasets_part = "train" or "validation" or "test"
    dataset = tokenized_datasets[datasets_part]
    orig_dataset = datasets[datasets_part]

    label_map = {label : i for i, label in enumerate(label_list,0)}

    features = []
    ori_sents = []
    for i in range(dataset.num_rows):
        sentence = dataset[i]
        orig_sentence = orig_dataset[i]

        ori_sents.append(orig_sentence['tokens'])

        input_ids = sentence['input_ids']
        input_mask = [1] * len(input_ids)
        segment_ids = [0] * len(input_ids)
        label_ids = sentence['labels']
        valid = [1 if ele % 2 or ele == 0 else 0 for ele in label_ids]
        label_mask = [1] * len(label_ids)
        seq_len = []
        seq_len.append(len(orig_sentence['tokens']))

        if len(input_ids) >= max_seq_length - 1:
            input_ids = input_ids[0:(max_seq_length - 2)]
            input_mask = input_mask[0:(max_seq_length - 2)]
            segment_ids = segment_ids[0:(max_seq_length - 2)]
            label_ids = label_ids[0:(max_seq_length - 2)]
            valid = valid[0:(max_seq_length - 2)]
            label_mask = label_mask[0:(max_seq_length - 2)]

        while len(input_ids) < max_seq_length:
            input_ids.append(0)
            input_mask.append(0)
            segment_ids.append(0)
            label_ids.append(0)
            valid.append(1)
            label_mask.append(0)
        while len(label_ids) < max_seq_length:
            label_ids.append(0)
            label_mask.append(0)
        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length
        assert len(label_ids) == max_seq_length
        assert len(valid) == max_seq_length
        assert len(label_mask) == max_seq_length

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_id=label_ids,
                              valid_ids=valid,
                              label_mask=label_mask,
                              seq_len=seq_len))
    return features, ori_sents

### Preprocess data

In [None]:

num_train_optimization_steps = 0
if args.do_train:
    num_train_optimization_steps = int(
        datasets["train"].num_rows / args.train_batch_size / args.gradient_accumulation_steps) * args.num_train_epochs

    if args.local_rank != -1:
        num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()

if args.local_rank not in [-1, 0]:
    torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

num_labels = len(label_list)
logger.info(args)

# 7.Prepare model
model = HGNER(args,
              hidden_dropout_prob=args.hidden_dropout_prob,
              num_labels=num_labels,
              windows_list = [int(k) for k in args.windows_list.split('qq')] if args.windows_list else args.window_size,
              )

n_params = sum([p.nelement() for p in model.parameters()])
print('n_params',n_params)

if args.local_rank == 0:
    torch.distributed.barrier()  # Make sure only the first process in distributed training will download model & vocab

model.to(device)

logger.info(model)

param_optimizer = list(model.named_parameters())
no_decay = ['bias','LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]
warmup_steps = int(args.warmup_proportion * num_train_optimization_steps)
optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=num_train_optimization_steps)

# multi-gpu training (should be after apex fp16 initialization)
if n_gpu > 1:
    model = torch.nn.DataParallel(model)

if args.local_rank != -1:
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank],
                                                      output_device=args.local_rank,
                                                      find_unused_parameters=True)

global_step = 0
nb_tr_steps = 0
tr_loss = 0
label_map = {i : label for i, label in enumerate(label_list,0)}


logger.info("*** Label map ***")
logger.info(label_map)
logger.info("*******************************************")


best_epoch = -1
best_p = -1
best_r = -1
best_f = -1
best_test_f = -1
best_eval_f = -1
if args.do_train:
    #load train data
    train_features,_ = convert_dataset_to_features(
        "train", label_list, args.max_seq_length, tokenizer)
    logger.info("***** Running training *****")
    logger.info("  Num examples = %d", datasets["train"].num_rows)
    logger.info("  Batch size = %d", args.train_batch_size)
    logger.info("  Num steps = %d", num_train_optimization_steps)
    all_input_ids = torch.tensor([f.input_ids for f in train_features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in train_features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in train_features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in train_features], dtype=torch.long)
    all_valid_ids = torch.tensor([f.valid_ids for f in train_features], dtype=torch.long)
    all_lmask_ids = torch.tensor([f.label_mask for f in train_features], dtype=torch.long)
    all_seq_lens = torch.tensor([f.seq_len for f in train_features], dtype=torch.long)
    train_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,all_valid_ids,all_lmask_ids, all_seq_lens)

    #load valid data

    eval_features,_ = convert_dataset_to_features("validation", label_list, args.max_seq_length, tokenizer)
    all_input_ids = torch.tensor([f.input_ids for f in eval_features], dtype=torch.long)
    all_input_mask = torch.tensor([f.input_mask for f in eval_features], dtype=torch.long)
    all_segment_ids = torch.tensor([f.segment_ids for f in eval_features], dtype=torch.long)
    all_label_ids = torch.tensor([f.label_id for f in eval_features], dtype=torch.long)
    all_valid_ids = torch.tensor([f.valid_ids for f in eval_features], dtype=torch.long)
    all_lmask_ids = torch.tensor([f.label_mask for f in eval_features], dtype=torch.long)
    all_seq_lens = torch.tensor([f.seq_len for f in eval_features], dtype=torch.long)
    eval_data = TensorDataset(all_input_ids, all_input_mask, all_segment_ids, all_label_ids,all_valid_ids,all_lmask_ids, all_seq_lens)

    #load test data
    test_features,_ = convert_dataset_to_features("test", label_list, args.max_seq_length, tokenizer)
    all_input_ids_dev = torch.tensor([f.input_ids for f in test_features], dtype=torch.long)
    all_input_mask_dev = torch.tensor([f.input_mask for f in test_features], dtype=torch.long)
    all_segment_ids_dev = torch.tensor([f.segment_ids for f in test_features], dtype=torch.long)
    all_label_ids_dev = torch.tensor([f.label_id for f in test_features], dtype=torch.long)
    all_valid_ids_dev = torch.tensor([f.valid_ids for f in test_features], dtype=torch.long)
    all_lmask_ids_dev = torch.tensor([f.label_mask for f in test_features], dtype=torch.long)
    all_seq_lens_dev = torch.tensor([f.seq_len for f in test_features], dtype=torch.long)
    test_data = TensorDataset(all_input_ids_dev, all_input_mask_dev, all_segment_ids_dev, all_label_ids_dev, all_valid_ids_dev, all_lmask_ids_dev, all_seq_lens_dev)

    if args.local_rank == -1 or torch.distributed.get_rank() == 0:
        test_sampler = SequentialSampler(test_data)
    test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=args.eval_batch_size)

    if args.local_rank == -1 or torch.distributed.get_rank() == 0:
        eval_sampler = SequentialSampler(eval_data)
    eval_dataloader = DataLoader(eval_data, sampler=eval_sampler, batch_size=args.eval_batch_size)

    if args.local_rank == -1:
        train_sampler = RandomSampler(train_data)
    else:
        train_sampler = DistributedSampler(train_data)
    train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=args.train_batch_size)


### Train

In [None]:
if args.do_train:
    test_f1 = []
    dev_f1 = []

    for epoch_ in trange(int(args.num_train_epochs), desc="Epoch"):
        model.train()
        tr_loss = 0
        nb_tr_examples, nb_tr_steps = 0, 0
        for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")):
            # begin_time = time.time()
            batch = tuple(t.to(device) for t in batch)
            input_ids, input_mask, segment_ids, label_ids, valid_ids,l_mask, seq_len = batch
            print(len(input_ids))
            loss = model(input_ids, segment_ids, input_mask, label_ids,valid_ids,l_mask)#, seq_len=seq_len)

            if n_gpu > 1:
                loss = loss.mean() # mean() to average on multi-gpu.
            if args.gradient_accumulation_steps > 1:
                loss = loss / args.gradient_accumulation_steps

            else:
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

            tr_loss += loss.item()
            nb_tr_examples += input_ids.size(0)
            nb_tr_steps += 1
            if (step + 1) % args.gradient_accumulation_steps == 0:
                optimizer.step()
                scheduler.step()  # Update learning rate schedule
                model.zero_grad()
                global_step += 1
        # eval in each epoch.
        model.eval()
        eval_loss, eval_accuracy = 0, 0
        nb_eval_steps, nb_eval_examples = 0, 0
        y_true = []
        y_pred = []
        label_map = {i : label for i, label in enumerate(label_list,0)}
        for input_ids, input_mask, segment_ids, label_ids,valid_ids,l_mask, seq_len in eval_dataloader:
            input_ids = input_ids.to(device)
            input_mask = input_mask.to(device)
            segment_ids = segment_ids.to(device)
            valid_ids = valid_ids.to(device)
            label_ids = label_ids.to(device)
            l_mask = l_mask.to(device)
            seq_len = seq_len.to(device)
            #domain_l = domain_l.to(device)

            with torch.no_grad():
                logits = model(input_ids, segment_ids, input_mask,valid_ids=valid_ids,attention_mask_label=l_mask)#, seq_len=seq_len)


            if not args.use_crf:
                logits = torch.argmax(F.log_softmax(logits, dim=2), dim=2)
            logits = logits.detach().cpu().numpy()
            label_ids = label_ids.to('cpu').numpy()
            input_mask = input_mask.to('cpu').numpy()


            for i, label in enumerate(label_ids):
                temp_1 = []
                temp_2 = []
                for j,m in enumerate(label):
                    if j == 0:
                        continue
                    elif label_ids[i][j] == len(label_map):
                        y_true.append(temp_1)
                        y_pred.append(temp_2)
                        break
                    else:

                        temp_1.append(label_map[label_ids[i][j]])
                        try:
                            temp_2.append(label_map[logits[i][j]])
                        except:
                            temp_2.append('O')
                        #temp_2.append(label_map[logits[i][j]])

        report = classification_report(y_true, y_pred,digits=4)
        logger.info("\n******evaluate on the dev data*******")
        logger.info("\n%s", report)
        temp = report.split('\n')[-3]
        f_eval = eval(temp.split()[-2])
        dev_f1.append(f_eval)

        output_eval_file = os.path.join(args.output_dir, "eval_results.txt")


        #if os.path.exists(output_eval_file):
        with open(output_eval_file, "a") as writer:
            writer.write('*******************epoch*******'+str(epoch_)+'\n')
            writer.write(report+'\n')


        y_true = []
        y_pred = []
        label_map = {i : label for i, label in enumerate(label_list,0)}
        for input_ids, input_mask, segment_ids, label_ids,valid_ids,l_mask, seq_len in test_dataloader:
            input_ids = input_ids.to(device)
            input_mask = input_mask.to(device)
            segment_ids = segment_ids.to(device)
            valid_ids = valid_ids.to(device)
            label_ids = label_ids.to(device)
            l_mask = l_mask.to(device)
            seq_len = seq_len
            #domain_l = domain_l.to(device)

            with torch.no_grad():
                logits = model(input_ids, segment_ids, input_mask,valid_ids=valid_ids,attention_mask_label=l_mask)
                shape = logits.shape
                if len(shape) < 3:
                    logits = logits.unsqueeze(dim=0)

            try:
                if not args.use_crf:
                    logits = torch.argmax(F.log_softmax(logits,dim=2),dim=2)
                logits = logits.detach().cpu().numpy()
                label_ids = label_ids.to('cpu').numpy()
                input_mask = input_mask.to('cpu').numpy()
            except:
                import pdb
                pdb.set_trace()

            for i, label in enumerate(label_ids):
                temp_1 = []
                temp_2 = []
                for j,m in enumerate(label):
                    if j == 0:
                        continue
                    elif label_ids[i][j] == len(label_map):
                        y_true.append(temp_1)
                        y_pred.append(temp_2)
                        #print(temp_2)
                        #time.sleep(5)
                        break
                    else:
                        temp_1.append(label_map[label_ids[i][j]])
                        try:
                            temp_2.append(label_map[logits[i][j]])
                        except:
                            temp_2.append('O')
                        #temp_2.append(label_map[logits[i][j]])

        report = classification_report(y_true, y_pred,digits=4)

        logger.info("\n******evaluate on the test data*******")
        logger.info("\n%s", report)
        temp = report.split('\n')[-3]
        f_test = eval(temp.split()[-2])
        test_f1.append(f_test)



        output_eval_file_t = os.path.join(args.output_dir, "test_results.txt")


        #if os.path.exists(output_eval_file):
        with open(output_eval_file_t, "a") as writer2:
            writer2.write('*******************epoch*******'+str(epoch_)+'\n')
            writer2.write(report+'\n')



    # Load a trained model and config that you have fine-tuned
    output_f1_test = os.path.join(args.output_dir, "f1_score_epoch.txt")
    with open(output_f1_test, "w") as writer1:
        for i, j in zip(test_f1, dev_f1):
            writer1.write(str(i) + '\t' + str(j) + '\n')
        writer1.write('\n')
        writer1.write(str(best_test_f))