## Load Own Dataset

In [1]:
import json
 
with open('ner-data-techs (2).json', 'r') as f:
    data = json.load(f)

In [2]:
data

[['two large, open source software systems are analyzed from the vantage point of complex adaptive systems theory.',
  {'entities': [[87, 103, 'TECH']]}],
 ['a distributed algorithm, based on the methodology of algebraic traceback developed by dean et al, is proposed which can completely determine a path of d nodes/routers using o(d) marked packets, and subsequently determine the changes in its topology using o(log d) marked packets with high probability.',
  {'entities': [[159, 166, 'TECH']]}],
 ['the advent of web 3.0, claiming for personalization in interactive systems (lassila & hendler, 2007), and the need for systems capable of interacting in a more natural way in the future society flooded with computer systems and devices (harper et al.,',
  {'entities': [[207, 223, 'TECH']]}],
 ['for the purpose of using common sense knowledge in the development and design of computer systems, it is necessary to provide an architecture that allows it.',
  {'entities': [[81, 97, 'TECH']]}],
 ['

In [3]:
raw_text = """
[40"](display_size) [LED](display_type) TV
Specifications: [16″](display_size) HD READY [LED](display_type) TV.
[1 Year](warranty) Warranty"""

## Convert Data to Required Format

In [4]:
TRAIN_DATA = []

for i in data:
    strr = []
    for loc in i[1]["entities"]:
        for c in range(len(i[0])):
            if loc[0] == c:
                strr.append("[")
                strr.append(i[0][c])
            elif loc[1] == c:
                strr.append("]")
                strr.append(i[0][c])
                strr.append("("+loc[-1]+") ")
            else:
                strr.append(i[0][c])
    
    TRAIN_DATA.append("".join(strr))

TRAIN_DATA = "\n".join(TRAIN_DATA).replace("].(", "](").replace("] (", "](")
TRAIN_DATA[0:1000]

'two large, open source software systems are analyzed from the vantage point of complex [adaptive systems](TECH) theory.\na distributed algorithm, based on the methodology of algebraic traceback developed by dean et al, is proposed which can completely determine a path of d nodes/[routers](TECH) using o(d) marked packets, and subsequently determine the changes in its topology using o(log d) marked packets with high probability.\nthe advent of web 3.0, claiming for personalization in interactive systems (lassila & hendler, 2007), and the need for systems capable of interacting in a more natural way in the future society flooded with [computer systems](TECH) and devices (harper et al.,\nfor the purpose of using common sense knowledge in the development and design of [computer systems],(TECH)  it is necessary to provide an architecture that allows it.\nfor a given network, we derive a mathematical condition on how small $p_e$ should be so that only single [edge network]-(TECH) errors need

In [5]:
from transformers import AutoTokenizer
import os

model_name = os.getcwd()+"//distilbert-base-uncased"
tokenizer = None
tokenizer = AutoTokenizer.from_pretrained(os.getcwd()+"//distilbert-base-uncased")

## Tokenization and other things

In [6]:
import re
def get_tokens_with_entities(raw_text: str):
    raw_tokens = re.split(r"\s(?![^\[]*\])", raw_text)
    entity_value_pattern = r"\[(?P<value>.+?)\]\((?P<entity>.+?)\)"
    entity_value_pattern_compiled = re.compile(entity_value_pattern, flags=re.I|re.M)

    tokens_with_entities = []

    for raw_token in raw_tokens:
        match = entity_value_pattern_compiled.match(raw_token)
        if match:
            raw_entity_name, raw_entity_value = match.group("entity"), match.group("value")

            for i, raw_entity_token in enumerate(re.split("\s", raw_entity_value)):
                entity_prefix = "B" if i == 0 else "I"
                entity_name = f"{entity_prefix}-{raw_entity_name}"
                tokens_with_entities.append((raw_entity_token, entity_name))
        else:
            tokens_with_entities.append((raw_token, "O"))

    return tokens_with_entities


class NERDataMaker:
    def __init__(self, texts):
        self.unique_entities = []
        self.processed_texts = []

        temp_processed_texts = []
        for text in texts:
            tokens_with_entities = get_tokens_with_entities(text)
            for _, ent in tokens_with_entities:
                if ent not in self.unique_entities:
                    self.unique_entities.append(ent)
            temp_processed_texts.append(tokens_with_entities)

        self.unique_entities.sort(key=lambda ent: ent if ent != "O" else "")

        for tokens_with_entities in temp_processed_texts:
            self.processed_texts.append([(t, self.unique_entities.index(ent)) for t, ent in tokens_with_entities])

    @property
    def id2label(self):
        return dict(enumerate(self.unique_entities))

    @property
    def label2id(self):
        return {v:k for k, v in self.id2label.items()}

    def __len__(self):
        return len(self.processed_texts)

    def __getitem__(self, idx):
        def _process_tokens_for_one_text(id, tokens_with_encoded_entities):
            ner_tags = []
            tokens = []
            for t, ent in tokens_with_encoded_entities:
                ner_tags.append(ent)
                tokens.append(t)

            return {
                "id": id,
                "ner_tags": ner_tags,
                "tokens": tokens
            }

        tokens_with_encoded_entities = self.processed_texts[idx]
        if isinstance(idx, int):
            return _process_tokens_for_one_text(idx, tokens_with_encoded_entities)
        else:
            return [_process_tokens_for_one_text(i+idx.start, tee) for i, tee in enumerate(tokens_with_encoded_entities)]

    def as_hf_dataset(self, tokenizer):
        from datasets import Dataset, Features, Value, ClassLabel, Sequence
        def tokenize_and_align_labels(examples):
            tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

            labels = []
            for i, label in enumerate(examples[f"ner_tags"]):
                word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
                previous_word_idx = None
                label_ids = []
                for word_idx in word_ids:  # Set the special tokens to -100.
                    if word_idx is None:
                        label_ids.append(-100)
                    elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                        label_ids.append(label[word_idx])
                    else:
                        label_ids.append(-100)
                    previous_word_idx = word_idx
                labels.append(label_ids)

            tokenized_inputs["labels"] = labels
            return tokenized_inputs

        ids, ner_tags, tokens = [], [], []
        for i, pt in enumerate(self.processed_texts):
            ids.append(i)
            pt_tokens,pt_tags = list(zip(*pt))
            ner_tags.append(pt_tags)
            tokens.append(pt_tokens)
        data = {
            "id": ids,
            "ner_tags": ner_tags,
            "tokens": tokens
        }
        features = Features({
            "tokens": Sequence(Value("string")),
            "ner_tags": Sequence(ClassLabel(names=dm.unique_entities)),
            "id": Value("int32")
        })
        ds = Dataset.from_dict(data, features)
        tokenized_ds = ds.map(tokenize_and_align_labels, batched=True)
        return tokenized_ds

In [7]:
dm = NERDataMaker(TRAIN_DATA.split("\n"))
print(f"total examples = {len(dm)}")
print(dm[0:3])

total examples = 21380
[{'id': 0, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0], 'tokens': ['two', 'large,', 'open', 'source', 'software', 'systems', 'are', 'analyzed', 'from', 'the', 'vantage', 'point', 'of', 'complex', 'adaptive', 'systems', 'theory.']}, {'id': 1, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'tokens': ['a', 'distributed', 'algorithm,', 'based', 'on', 'the', 'methodology', 'of', 'algebraic', 'traceback', 'developed', 'by', 'dean', 'et', 'al,', 'is', 'proposed', 'which', 'can', 'completely', 'determine', 'a', 'path', 'of', 'd', 'nodes/[routers](TECH)', 'using', 'o(d)', 'marked', 'packets,', 'and', 'subsequently', 'determine', 'the', 'changes', 'in', 'its', 'topology', 'using', 'o(log', 'd)', 'marked', 'packets', 'with', 'high', 'probability.']}, {'id': 2, 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [5]:
from transformers import AutoTokenizer, DataCollatorForTokenClassification, AutoModelForTokenClassification, TrainingArguments, Trainer

## Model Training

In [9]:
from transformers import pipeline

model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(dm.unique_entities), id2label=dm.id2label, label2id=dm.label2id, ignore_mismatched_sizes=True)
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Some weights of the model checkpoint at D:\UPWORK\Eyong\NER Spacy Model\Task NER//distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at D:\UPWORK\Eyong\NER Spacy Model\Task NER//distilbert-base-uncased and are new

## Model Saving and Evaluation

In [10]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=40,
    weight_decay=0.01,
)

train_ds = dm.as_hf_dataset(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=train_ds, # eval on training set! ONLY for DEMO!!
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

  0%|          | 0/22 [00:00<?, ?ba/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
The following columns in the training set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: tokens, id, ner_tags. If tokens, id, ner_tags are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 21380
  Num Epochs = 40
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 53480


Epoch,Training Loss,Validation Loss


Saving model checkpoint to ./results\checkpoint-500
Configuration saved in ./results\checkpoint-500\config.json
Model weights saved in ./results\checkpoint-500\pytorch_model.bin
tokenizer config file saved in ./results\checkpoint-500\tokenizer_config.json
Special tokens file saved in ./results\checkpoint-500\special_tokens_map.json


KeyboardInterrupt: 

### Complete Training takes huge amount of time, I have stopped this forcefully, but the initial model is ready to use

## Final Output

In [20]:
from transformers import pipeline
test = """two large, open source software systems are analyzed from the vantage point of complex adaptive systems theory."""
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
pipe(test)

[{'entity_group': 'TECH',
  'score': 0.96347094,
  'word': 'adaptive systems',
  'start': 87,
  'end': 103}]

In [21]:
test = """a distributed algorithm, based on the methodology of algebraic traceback developed by dean et al, \
          is proposed which can completely determine a path of d nodes/ routers using o(d) marked packets, and subsequently \
          determine the changes in its topology using o(log d) marked packets with high probability"""
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
pipe(test)

[{'entity_group': 'TECH',
  'score': 0.98494405,
  'word': 'route',
  'start': 170,
  'end': 175}]

In [23]:
test = "it is, however, well known that public key cryptography demands considerable computing resources and that rsa encryption is much faster than rsa decryption"
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
pipe(test)

[{'entity_group': 'TECH',
  'score': 0.75266695,
  'word': 'public key crypt',
  'start': 32,
  'end': 48}]

In [27]:
test = "this report deals with security in wireless sensor networks (wsns), especially in network layer."
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
pipe(test)

[{'entity_group': 'TECH',
  'score': 0.94492906,
  'word': 'w',
  'start': 61,
  'end': 62}]

In [28]:
test = "in the development of recent mobile devices like software defined radio (sdr) secure method of software downloading is found necessary for reconfiguration."
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
pipe(test)

[{'entity_group': 'TECH',
  'score': 0.96436954,
  'word': 'software defined',
  'start': 49,
  'end': 65}]

## Model Evaluation

In [1]:
## laod model from Disk
import os
from transformers import pipeline
model_name_disk = model_name = os.getcwd()+"//results//checkpoint-500"
model = AutoModelForTokenClassification.from_pretrained(model_name_disk)
tokenizer = AutoTokenizer.from_pretrained(model_name_disk)

NameError: name 'AutoModelForTokenClassification' is not defined

In [None]:
pipe = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") # pass device=0 if using gpu
print(pipe("routers - Model Loaded"))

## Some Example - Tests

In [24]:
import random

for i in range(1, 20):
    print(pipe(random.choice(data)[0]))
    print("------------------------------------------------------------------------------------")

[{'entity_group': 'TECH', 'score': 0.9955695, 'word': 'gps', 'start': 250, 'end': 253}]
------------------------------------------------------------------------------------
[{'entity_group': 'TECH', 'score': 0.9929993, 'word': 'data security', 'start': 70, 'end': 83}]
------------------------------------------------------------------------------------
[{'entity_group': 'TECH', 'score': 0.9836524, 'word': 'humanoid robot', 'start': 83, 'end': 97}, {'entity_group': 'TECH', 'score': 0.96176255, 'word': 'humanoid robot', 'start': 202, 'end': 216}]
------------------------------------------------------------------------------------
[{'entity_group': 'TECH', 'score': 0.9842491, 'word': 'network architecture', 'start': 112, 'end': 132}]
------------------------------------------------------------------------------------
[]
------------------------------------------------------------------------------------
[{'entity_group': 'TECH', 'score': 0.9983916, 'word': 'network architecture', 'start': 

## Accuracy

In [13]:
actual_tag = []
predicted_tag = []
for i in data:
    actual_tag.append(i[0][i[1]['entities'][0][0]:i[1]['entities'][0][1]])
    
    try:
        predicted_tag.append(pipe(i[0])[0]['word'])
    except:
        predicted_tag.append("Not Detected")

In [17]:
correct_terms = 0

for i in range(len(actual_tag)):
    if actual_tag[i] == predicted_tag[i]:
        correct_terms = correct_terms + 1
        
print("The Accuracy of the System is: "+str(round((correct_terms/len(actual_tag))*100, 2))+"%")

The Accuracy of the System is: 84.16%


## Records Where Model Makes Mistake

In [20]:
already_done = []
for i in range(len(actual_tag)):
    if actual_tag[i] != predicted_tag[i]:
        if actual_tag[i] not in already_done:
            print("Actual tag is ["+actual_tag[i]+"], while predicted tag is ["+predicted_tag[i]+"]")
            already_done.append(actual_tag[i])

Actual tag is [edge network], while predicted tag is [Not Detected]
Actual tag is [expert systems], while predicted tag is [based]
Actual tag is [route optimization], while predicted tag is [mobile network]
Actual tag is [mobile networks], while predicted tag is [mobile]
Actual tag is [computer systems], while predicted tag is [access control]
Actual tag is [data mining], while predicted tag is [)]
Actual tag is [gps], while predicted tag is [phone]
Actual tag is [programming languages], while predicted tag is [programming language]
Actual tag is [streaming applications], while predicted tag is [video]
Actual tag is [proxy server], while predicted tag is [Not Detected]
Actual tag is [cognitive network], while predicted tag is [cognitive]
Actual tag is [cellular system], while predicted tag is [)]
Actual tag is [wsns], while predicted tag is [w]
Actual tag is [iptv], while predicted tag is [internet protocol]
Actual tag is [digital signature], while predicted tag is [##mark]
Actual tag 

**We can get rid of these issues, when we train model for large number of iterations, because this is a very basic model**

## Save Models to Disk for Later Use

In [13]:
# Save to disk
tokenizer.save_pretrained("local-pt-checkpoint")
model.save_pretrained("local-pt-checkpoint")

tokenizer config file saved in local-pt-checkpoint\tokenizer_config.json
Special tokens file saved in local-pt-checkpoint\special_tokens_map.json
Configuration saved in local-pt-checkpoint\config.json
Model weights saved in local-pt-checkpoint\pytorch_model.bin
