# Named Entity Recognition (NER) Data Preparation

This notebook demonstrates the process of preparing data for Named Entity Recognition (NER) using the `datasets` library. The workflow includes:

1. **Loading the Dataset**: The dataset is loaded using the `load_dataset` function from the `datasets` library.
2. **Normalization and Parsing**: Functions are defined to normalize tokens, find sublist indices, and parse examples to extract entities and their labels.
3. **Dataset Processing**: The dataset is processed to convert text and annotations into a format suitable for NER tasks.
4. **Data Splitting**: The processed data is split into training and testing sets.
5. **Saving the Data**: The training and testing sets are saved as JSON files for further use.

The notebook ensures that the data is properly tokenized and annotated, making it ready for training GliNER models.

In [1]:
from datasets import load_dataset
import ast
import string

def normalize_token(token):
    # Remove punctuation and lowercase the token.
    return token.strip(string.punctuation).lower()

def find_sublist_indices_norm(main_list, sub_list):
    """
    Return (start_index, end_index) if the normalized sub_list occurs 
    consecutively in the normalized main_list, else return None.
    """
    norm_main = [normalize_token(t) for t in main_list]
    norm_sub = [normalize_token(t) for t in sub_list]
    n = len(norm_main)
    m = len(norm_sub)
    for i in range(n - m + 1):
        if norm_main[i : i + m] == norm_sub:
            return i, i + m - 1
    return None

def parse_example(example):
    # Split the input text by whitespace.
    text = example["input"]
    tokens = text.split()
    
    # Convert the output field (a string representation of a list) to an actual list.
    try:
        annotations = ast.literal_eval(example["output"])
    except Exception as e:
        annotations = []
    
    spans = []
    for ann in annotations:
        # Each annotation is expected to have the format: "Entity text <> Label"
        if "<>" not in ann:
            continue
        ent_text, ent_label = [x.strip() for x in ann.split("<>")]
        # Split the entity text by space.
        ent_tokens = ent_text.split()
        indices = find_sublist_indices_norm(tokens, ent_tokens)
        if indices is not None:
            start, end = indices
            spans.append([start, end, ent_label])
    
    # Add new keys as expected by GLiNER.
    example["tokenized_text"] = tokens
    example["ner"] = spans
    return example

In [2]:

# 3) Load your local CSV with the 'datasets' library
#    Suppose your CSV has columns "input" and "output"
ds = load_dataset(
    "numind/NuNER",
    split="entity",  # 'train' if it's a single file
)

ds

Dataset({
    features: ['input', 'output'],
    num_rows: 1000000
})

In [3]:
example = ds[0]

print(example)

{'input': 'State University of New York Press, 1997.', 'output': "['State University of New York Press <> Publisher']"}


In [4]:
parse_example(example)

{'input': 'State University of New York Press, 1997.',
 'output': "['State University of New York Press <> Publisher']",
 'tokenized_text': ['State',
  'University',
  'of',
  'New',
  'York',
  'Press,',
  '1997.'],
 'ner': [[0, 5, 'Publisher']]}

In [5]:
parse_example(ds[10])

{'input': 'According to the 2015 census, it has a population of 70,757 people.',
 'output': "['2015 census <> Time', '70,757 people <> Quantity']",
 'tokenized_text': ['According',
  'to',
  'the',
  '2015',
  'census,',
  'it',
  'has',
  'a',
  'population',
  'of',
  '70,757',
  'people.'],
 'ner': [[3, 4, 'Time'], [10, 11, 'Quantity']]}

In [6]:
parse_example(ds[10987])

{'input': 'This, of course, makes the new Senator worthy of closer inspection in our latest edition of Watch of the Week.',
 'output': "['Senator <> Political figure', 'Watch of the Week <> Media feature']",
 'tokenized_text': ['This,',
  'of',
  'course,',
  'makes',
  'the',
  'new',
  'Senator',
  'worthy',
  'of',
  'closer',
  'inspection',
  'in',
  'our',
  'latest',
  'edition',
  'of',
  'Watch',
  'of',
  'the',
  'Week.'],
 'ner': [[6, 6, 'Political figure'], [16, 19, 'Media feature']]}

In [7]:
from tqdm import tqdm

gliner_data = [
    parse_example(example) for example in tqdm(ds) if example["input"]
]

  0%|          | 0/1000000 [00:00<?, ?it/s]

100%|██████████| 1000000/1000000 [02:13<00:00, 7516.52it/s]


In [8]:
len(gliner_data)

999997

In [9]:
import random
random.seed(41)

random.shuffle(gliner_data)

In [10]:
train = gliner_data[: int(len(gliner_data) * .9)]
test = gliner_data[int(len(gliner_data) * .9): ]

len(train), len(test)

(899997, 100000)

In [None]:
import json
with open("data/data/numer/train.json", "w") as file:
    json.dump(train, file)
    
with open("data/data/numer/test.json", "w") as file:
    json.dump(test, file)