## NEREL:

Each entity is represented by a string of the following format:
```html
"<id>\t<type> <start> <stop>\t<text>", where
<id> is an entity id,
<type> is one of entity types,
<start> is a position of the first symbol of entity in text,
<stop> is the last symbol position in text +1.
```

Each relation is represented by a string of the following format:
```html
"<id>\t<type> Arg1:<arg1_id> Arg2:<arg2_id>", where
<id> is a relation id,
<arg1_id> and <arg2_id> are entity ids.
```

Each link is represented by a string of the following format:
```html
"<id>\tReference <ent_id> <link>\t<text>", where
<id> is a link id,
<ent_id> is an entity id,
<link> is a reference to knowledge base entity (example: "Wikidata:Q1879675" if link exists, else "Wikidata:NULL"),
<text> is a name of entity in knowledge base if link exists, else empty string.
```

## RELIK:

### Entity Linking

All your data should have the following structure:

```json
{
  "doc_id": int,  # Unique identifier for the document
  "doc_text": txt,  # Text of the document
  "doc_span_annotations": # Char level annotations
    [
      [start, end, label],
      [start, end, label],
      ...
    ]
}
```

### Relation extraction

```json
{
  "doc_id": int,  # Unique identifier for the document
  "doc_words": list[txt] # Tokenized text of the document
  "doc_span_annotations": # Token level annotations of mentions (label is optional)
    [
      [start, end, label],
      [start, end, label],
      ...
    ],
  "doc_triplet_annotations": # Triplet annotations
  [
    {
      "subject": [start, end, label], # label is optional
      "relation": name, # type is optional
      "object": [start, end, label], # label is optional
    },
    {
      "subject": [start, end, label], # label is optional
      "relation": name, # type is optional
      "object": [start, end, label], # label is optional
    },
  ]
}
```

### Retriever:

We perform a two-step training process for the retriever. First, we "pre-train" the retriever using BLINK (Wu et al., 2019) dataset, and then we "fine-tune" it using AIDA (Hoffart et al, 2011).
Data Preparation

The retriever requires a dataset in a format similar to DPR: a jsonl file where each line is a dictionary with the following keys:

```json
{
  "question": "....",
  "positive_ctxs": [{
    "title": "...",
    "text": "...."
  }],
  "negative_ctxs": [{
    "title": "...",
    "text": "...."
  }],
  "hard_negative_ctxs": [{
    "title": "...",
    "text": "...."
  }]
}
```

The retriever also needs an index to search for the documents. The documents to index can be either a JSONL file or a TSV file similar to DPR:

    jsonl: each line is a JSON object with the following keys: id, text, metadata
    tsv: each line is a tab-separated string with the id and text columns, followed by any other column that will be stored in the metadata field

jsonl example:
```json
{
  "id": "...",
  "text": "...",
  "metadata": ["{...}"]
},
...
```

tsv example:

id \t text \t any other column
...

In [27]:
from datasets import load_dataset
nerel_dataset = load_dataset('MalakhovIlya/NEREL', trust_remote_code=True)

In [28]:
# Access the features
print(nerel_dataset['train'].features)

# Access the data
print(nerel_dataset['train'][0])

{'id': Value(dtype='int32', id=None), 'text': Value(dtype='string', id=None), 'entities': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'relations': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'links': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
{'id': 0, 'text': 'Пулеметы, автоматы и снайперские винтовки изъяты в арендуемом американцами доме в Бишкеке\n\n05/08/2008 10:35\n\nБИШКЕК, 5 августа /Новости-Грузия/. Правоохранительные органы Киргизии обнаружили в доме, арендуемом гражданами США в Бишкеке, пулеметы, автоматы и снайперские винтовки, сообщает во вторник пресс-служба МВД Киргизии.\n\n"В ходе проведения оперативно-профилактического мероприятия под кодовым названием "Арсенал" в новостройке Ынтымак, в доме, принадлежащем 66-летнему гражданину Киргизии и арендуемом гражданами США, обнаружены и изъяты: шесть крупнокалиберных пулеметов с оптическим прицелом и с приборами ночного видения, 26 автоматов калибра 5

In [31]:
import json

# Function to parse NEREL entities
def parse_entities(entities):
    parsed_entities = []
    for entity in entities:
        parts = entity.split('\t')
        entity_id = parts[0]
        entity_type = parts[1].split()[0]
        segments = parts[1].split()[1:]
        text = parts[2]
        entity_segments = []
        if ';' in parts[1]:
            # Handle multiple segments
            segment_pairs = ' '.join(segments).split(';')
            for pair in segment_pairs:
                start, end = map(int, pair.split())
                entity_segments.append({
                    "start": start,
                    "end": end,
                    "type": entity_type,
                    "text": text
                })
        else:
            # Handle single segment
            start, end = map(int, segments)
            entity_segments.append({
                "start": start,
                "end": end,
                "type": entity_type,
                "text": text
            })
        parsed_entities.append({
            "id": entity_id,
            "segments": entity_segments
        })
    return parsed_entities

# Function to parse NEREL relations
def parse_relations(relations):
    parsed_relations = []
    for relation in relations:
        parts = relation.split('\t')
        relation_id = parts[0]
        relation_type = parts[1].split()[0]
        arg1_id = parts[1].split()[1].split(':')[1]
        arg2_id = parts[1].split()[2].split(':')[1]
        parsed_relations.append({
            "id": relation_id,
            "type": relation_type,
            "arg1_id": arg1_id,
            "arg2_id": arg2_id
        })
    return parsed_relations

# Function to parse NEREL links
def parse_links(links):
    parsed_links = []
    for link in links:
        parts = link.split('\t')
        link_id = parts[0]
        ent_id = parts[1].split()[1]
        wikidata_link = parts[1].split()[2]
        text = parts[2]
        parsed_links.append({
            "id": link_id,
            "ent_id": ent_id,
            "link": wikidata_link,
            "text": text
        })
    return parsed_links

# Function to transform NEREL dataset to RELIK format for Entity Linking
def transform_nerel_to_relik_entity_linking(nerel_dataset):
    relik_data = []
    for doc_id, example in enumerate(nerel_dataset):
        entities = parse_entities(example['entities'])

        relik_example = {
            "doc_id": doc_id,
            "doc_text": example['text'],
            "doc_span_annotations": [
                [segment["start"], segment["end"], segment["type"]]
                for entity in entities for segment in entity["segments"]
            ]
        }
        relik_data.append(relik_example)

    return relik_data

# Function to transform NEREL dataset to RELIK format for Relation Extraction
def transform_nerel_to_relik_relation_extraction(nerel_dataset):
    relik_data = []
    for doc_id, example in enumerate(nerel_dataset):
        entities = parse_entities(example['entities'])
        relations = parse_relations(example['relations'])

        # Create a mapping from entity IDs to their details
        entity_map = {entity["id"]: entity for entity in entities}

        relik_example = {
            "doc_id": doc_id,
            "doc_words": example['text'].split(),  # Tokenized text
            "doc_span_annotations": [
                [segment["start"], segment["end"], segment["type"]]
                for entity in entities for segment in entity["segments"]
            ],
            "doc_triplet_annotations": [
                {
                    "subject": [segment1["start"], segment1["end"], segment1["type"]],
                    "relation": relation["type"],
                    "object": [segment2["start"], segment2["end"], segment2["type"]]
                }
                for relation in relations
                for segment1 in entity_map[relation["arg1_id"]]["segments"]
                for segment2 in entity_map[relation["arg2_id"]]["segments"]
            ]
        }
        relik_data.append(relik_example)

    return relik_data

# Transform and save the datasets for each split
splits = ['train', 'test', 'dev']
for split in splits:
    # Transform the dataset to RELIK format for Entity Linking
    relik_entity_linking_dataset = transform_nerel_to_relik_entity_linking(nerel_dataset[split])
    # Transform the dataset to RELIK format for Relation Extraction
    relik_relation_extraction_dataset = transform_nerel_to_relik_relation_extraction(nerel_dataset[split])

    # Save the datasets to files
    with open(f'/kaggle/working/data_{split}_entity_linking.json', 'w', encoding='utf-8') as f:
        json.dump(relik_entity_linking_dataset, f, indent=2, ensure_ascii=False)

    with open(f'/kaggle/working/data_{split}_relation_extraction.json', 'w', encoding='utf-8') as f:
        json.dump(relik_relation_extraction_dataset, f, indent=2, ensure_ascii=False)