#**Sources**

* [Named Entity Recognition using roBERTa Base Large Language Model - Medium](https://christiangrech.medium.com/named-entity-recognition-using-roberta-base-large-language-model-67110c82d7e2)


* [NER annotator website](https://tecoholic.github.io/ner-annotator/)

* [Custom NER in spacy tutorial](https://github.com/PradipNichite/Youtube-Tutorials/blob/main/Spacy_Custom_NER_Youtube.ipynb)


#**Installs & Imports**

In [None]:
!pip install transformers -q
!pip install huggingface_hub -q
!pip install wikipedia -q
!pip install spacy spacy-transformers -q
!python -m spacy download en_core_web_trf -q
!python -m spacy download en_core_web_lg -q

In [None]:
from spacy import displacy
from spacy import load
from spacy.tokens import DocBin
from spacy.util import filter_spans
from spacy import blank

In [None]:
import json
from tqdm import tqdm
from random import randrange
from transformers import pipeline
from huggingface_hub import InferenceClient
import wikipedia
import pandas

#**RoBERTa NER**

In [None]:
def display_labeled_entities(nlp_output):
  displacy.render(nlp_output, style="ent", jupyter=True)
  displacy.render(nlp_output, style="dep", jupyter=True)
  for entity in nlp_output.ents:
    print(entity.text, "- labeled as", entity.label_)

In [None]:
nlp_roberta = load("en_core_web_trf")

In [None]:
prompt = "What is the prevelance of stroke in Columbia, Missouri, for the last 3 years?"

In [None]:
prompt_roberta_NER = nlp_roberta(prompt)

In [None]:
display_labeled_entities(prompt_roberta_NER)

Columbia - labeled as GPE
Missouri - labeled as GPE
the last 3 years - labeled as DATE


#**Prepare training data**

In [None]:
with open("health.txt", "w", encoding="utf-8") as file:
  file.write(wikipedia.page("health").content)

In [None]:
wikipedia.page("healthcare").content

'Health care, or healthcare, is the improvement of health via the prevention, diagnosis, treatment, amelioration or cure of disease, illness, injury, and other physical and mental impairments in people. Health care is delivered by health professionals and allied health fields. Medicine, dentistry, pharmacy, midwifery, nursing, optometry, audiology, psychology, occupational therapy, physical therapy, athletic training, and other health professions all constitute health care. The term includes work done in providing primary care, secondary care, tertiary care, and public health.\nAccess to healthcare may vary across countries, communities, and individuals, influenced by social and economic conditions and health policies. Providing health care services means "the timely use of personal health services to achieve the best possible health outcomes". Factors to consider in terms of healthcare access include financial limitations (such as insurance coverage), geographical and logistical barri

In [None]:
ISO_dataframe = pandas.read_json("/content/drive/MyDrive/College/Research/Su2024/ISO CategoriesWiki.json")

In [None]:
def write_to_txt(dataframe) -> list:
  error_rows = []
  for rowNum, row in dataframe.iterrows():
    category = row["Cat Name"]
    try:
      with open(f"/content/drive/MyDrive/Su2024/SpacyISO/{category}.txt", "w", encoding='utf-8') as file:
        file.write(row["Wikipedia"])
    except:
      print(category)
      error_rows.append(category)
  return error_rows

In [None]:
error_rows = write_to_txt(ISO_dataframe)

In [None]:
error_rows = ['information pertaining to earth sciences Examples: geophysical features and processes, geology, minerals, sciences dealing with the composition, structure and origin of the earth’s rocks, risks of earthquakes, volcanic activity, landslides, gravity information, soils, permafrost, hydrogeology, erosion', 'political and administrative boundaries', 'revenue', 'altitude', 'slope', 'pollution', 'sciences dealing with composition', 'gravity', 'erosion', 'substance abuse', 'water utilization plans', 'recreational areas and activities', 'airports/airstrips', 'tunnels nautical charts', 'data communication']

In [None]:
def generate_random_color() -> str:
  # Generating a random number in between 0 and 2^24
  color = randrange(0, 2**24)
  # Converting that number from base-10 (decimal) to base-16 (hexadecimal)
  return hex(color).replace("0x", "#")

In [None]:
def write_to_json(data, filename):
  with open(f"./SpacyISO/{filename}.json", "w", encoding='utf-8') as file:
    json.dump(data, file)

In [None]:
def read_from_json(filename):
  with open(f"./SpacyISO/{filename}.json", "r", encoding='utf-8') as file:
    data = json.load(file)
  return data

In [None]:
def remove_empty_entities(data):
  data["annotations"] = [label for label in data["annotations"] if label[1]["entities"]]
  return data

In [None]:
def list_to_tuple(data):
  for annotation in data["annotations"]:
    entity_labels = annotation[1]["entities"]
    tuple_labels = []
    for entity_label in entity_labels:
      tuple_labels.append(tuple(entity_label))
    annotation[1]["entities"] = tuple_labels
  return data

In [None]:
def to_tuple(data):
  for i in range(0, len(data)):
    data[i]["entities"] = [(entity[0], entity[1], entity[2]) for entity in data[i]["entities"]]
  return data

In [None]:
def list_to_dict(data):
  annotations = []
  for annotation in data["annotations"]:
    dictionary = {"text": annotation[0], "entities": annotation[1]["entities"]}
    annotations.append(dictionary)
  data["annotations"] = annotations
  return data

In [None]:
def spacy_train_format(ISO_Category: str):
  data = read_from_json(ISO_Category)
  data = remove_empty_entities(data)
  data = list_to_tuple(data)
  data = list_to_dict(data)
  write_to_json(data["annotations"], data["classes"][0]) # Write only "annotations" dict item. Use "classes"'s entity type name to be JSON file's name
  data = read_from_json(ISO_Category.upper()) # Read created JSON file
  return data

In [None]:
def add_roberta_labels(data):
  roberta = load("en_core_web_trf") # Model should be saved in ./ folder
  data = to_tuple(data)
  for i in range(0, len(data)):
    labeled_document = roberta(data[i]["text"])
    for labeled in labeled_document.ents:
      data[i]["entities"].append((labeled.start_char, labeled.end_char, labeled.label_))
  return data

In [None]:
def append_train_json(data):
  train_data = read_from_json("train")
  for example in data:
    train_data.append(example)
  write_to_json(train_data, "train")
  return read_from_json("train")

In [None]:
def process_train_data(data):
  data = add_roberta_labels(data)
  data = append_train_json(data) # Add to json with all other ISO category training data, returns all training examples from all ISO categories
  return data

In [None]:
ISO_name = "weather"

In [None]:
data = spacy_train_format(ISO_name)
data = process_train_data(data)
data = read_from_json("train")

#**Store spacy DocBin**

In [None]:
doc_bin_path = "./train_folder"
nlp = load(doc_bin_path)
doc_bin = DocBin()

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [None]:
for train_example in tqdm(data):
  entities = []
  document = nlp.make_doc(train_example["text"])
  entity_labels = train_example["entities"]
  for start, end, entity_label in entity_labels:
    span = document.char_span(start, end, label=entity_label, alignment_mode="contract")
    if span is None:
      print("Skipping", (start, end, entity_label))
    else:
      entities.append(span)
  filtered_entities = filter_spans(entities)
  document.ents = filtered_entities
  doc_bin.add(document)
doc_bin.to_disk(doc_bin_path + "/train.spacy")

100%|██████████| 484/484 [00:00<00:00, 505.11it/s]


In [None]:
!python -m spacy init fill-config /content/drive/MyDrive/Su2024/SpacyISO/base_config.cfg config.cfg

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [None]:
!python -m spacy train config.cfg --output ./ --paths.train /content/drive/MyDrive/Su2024/SpacyISO/train.spacy --paths.dev /content/drive/MyDrive/Su2024/SpacyISO/train.spacy

  _torch_pytree._register_pytree_node(
[38;5;4mℹ Saving to output directory: .[0m
[38;5;4mℹ Using CPU[0m
[1m
  _torch_pytree._register_pytree_node(
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     36.45    0.00    0.00    0.00    0.00
  0     200        565.47   5080.06   18.99   30.95   13.70    0.19
  1     400       1272.35   4642.35   38.76   45.12   33.98    0.39
  1     600        995.38   4026.25   37.36   51.11   29.44    0.37
  2     800       1548.49   4689.26   56.25   60.79   52.35    0.56
  3    1000       3601.85   5105.33   62.25   61.50   63.03    0.62
  4    1200        620.06   4649.31   70.64   72.49   68.88    0.71
  6    1400        522.40   4832.18   74.99   74.44   75.56    0.75
  8    1600        993.30   5060.21   81.

In [None]:
trained_ner = load("model-best")

In [None]:
document_to_label = """Number of times it snowed in Bakersfield California and then what is the rate of employment given salary less than 10000?"""
labeled_document = nlp(document_to_label)

In [None]:
colors = {
    "LABOR": generate_random_color(),
    "MENTAL AND PHYSICAL HEALTH": generate_random_color(),
    "CLIMATE": generate_random_color(),
    "DEMOGRAPHIC": generate_random_color(),
    "INDUSTRY": generate_random_color(),
    "EDUCATION": generate_random_color(),
    "SCHOOL DISTRICT": generate_random_color(),
    "WEATHER": generate_random_color(),

}
options = {"colors": colors}
displacy.render(labeled_document, style="ent", options=options, jupyter=True)