# GenIE: Generative Information Extraction

---

---

## Table of Content
1. [How to download the required artefacts?](#Download)
2. [How to load the models?](#Loading-the-Models)
3. [How to run inference?](#Inference)
    - [Unconstrained Generation](#Unconstrained-Generation)
    - [Constrained Generation](#Constrainted-Generation)
    - [Extracting the Wikidata Disambiguated Triplet Sets](#Extracting-the-Wikidata-Disambiguated-Triplet-Sets)
4. [Loading models and running inference with Hydra](#Loading-Models-and-Running-Inference-with-Hydra)
5. [How to load and use the datasets?](#Loading-Datasets)
6. Optional
    1. [How to constraint the model with a custom set of strings?](#Constructing-Prefix-Tries-for-A-Custom-Set-of-Strings) 
    2. [Loading and Using the WikidataID2Name Dictionaries](#Loading-and-Using-the-WikidataID2Name-Dictionaries)

---

## Download

The data that we release consists of:

1. **Pre-trained Model(s)**
    - Wiki-NRE (W): [Random Initialization](https://zenodo.org/record/6139236/files/genie_w.ckpt)
    - Rebel (R): [Random Initialization](https://zenodo.org/record/6139236/files/genie_r.ckpt) – [Pretrained Language Model](https://zenodo.org/record/6139236/files/genie_plm_r.ckpt) – [Pretrained Entity Linker (GENRE)](https://zenodo.org/record/6139236/files/genie_genre_r.ckpt)
    - Rebel + Wiki-NRE (R+W): [Random Initialization](https://zenodo.org/record/6139236/files/genie_rw.ckpt)
2. [**Prefix Trees (tries) for Constrained Generation**](https://zenodo.org/record/6139236/files/tries.zip)
    - relation trie
    - entity trie
3. **Datasets** \[Not required for inference\] 
    - [Rebel](https://zenodo.org/record/6139236/files/rebel.zip)
    - [FewRel](https://zenodo.org/record/6139236/files/fewrel.zip)
    - [Wikipedia-NRE](https://zenodo.org/record/6139236/files/wikipedia_nre.zip)
    - [Geo-NRE](https://zenodo.org/record/6139236/files/geo_nre.zip)
4. [**World Definitions**](https://zenodo.org/record/6139236/files/world_definitions.zip) \[Not required for inference\] 
5. **Mapping between Unique Names and Wikidata Identifiers** ([used by GenIE](https://zenodo.org/record/6139236/files/surface_form_dicts.zip), [full snapshot](https://zenodo.org/record/6139236/files/surface_form_dicts_from_snapshot.zip)) \[Optional. Necessary for processing data\] 
    - relation name to wikidata ID (and vice-versa)
    - entity name to wikidata ID (and vice-versa)

You can download the data by executing the <code>download_data.sh</code> script. If you want to omit some files, comment out parts of the code.

Alternatively, you can access the data [here](https://zenodo.org/record/6139236#.YhJdiJPMJhH).

In [1]:
# If you are using a different directory for your data, update the path below
DATA_DIR="../data"

# To download the data uncomment and run the following line
#!bash ../download_data.sh $DATA_DIR

# If your working directory is not the GenIE folder, include the path to it in your PATH variable to make the library available
import os
import sys

sys.path.append("../")

# Loading the Models

In [2]:
"""Load the Model"""
from genie.models import GeniePL

ckpt_name = "genie_genre_r.ckpt"
path_to_checkpoint = os.path.join(DATA_DIR, 'models', ckpt_name)
model = GeniePL.load_from_checkpoint(checkpoint_path=path_to_checkpoint)

In [3]:
"""Load the Prefix Tries"""
from genie.constrained_generation import Trie

# Large schema tries (correspond to Rebel; see the paper for details) 
entity_trie_path = os.path.join(DATA_DIR, "tries/large/entity_trie.pickle")
entity_trie = Trie.load(entity_trie_path)

relation_trie_path = os.path.join(DATA_DIR, "tries/large/relation_trie.pickle")
relation_trie = Trie.load(relation_trie_path)

large_schema_tries = {'entity_trie': entity_trie, 'relation_trie': relation_trie}

# Small schema tries (correspond to Wiki-NRE; see the paper for details) 
entity_trie_path = os.path.join(DATA_DIR, "tries/small/entity_trie.pickle")
entity_trie = Trie.load(entity_trie_path)

relation_trie_path = os.path.join(DATA_DIR, "tries/small/relation_trie.pickle")
relation_trie = Trie.load(relation_trie_path)

small_schema_tries = {'entity_trie': entity_trie, 'relation_trie': relation_trie}

To construct a prefix trie for your custom set of strings see [this section](#Constructing-a-Prefix-Tries-for-A-Custom-Set-of-Strings).

# Inference

For inference use the `model.sample` function. 

Under the hood, **GenIE** uses the HuggingFace's generate function, thus it accepts the same generation parameters. By default, during inference the same generation parameters used by the model during are employed – they are the model's default – but you can override them in the call of the function, as shown in the examples.

In [None]:
sentences = ["Prior to KTRK, Carson was an anchor for FOX-owned KSAZ in Phoenix, Arizona."]
sentences = ["Since the omicron wave crested in January, multiple studies and datasets have demonstrated that the mRNA vaccines are not nearly as effective against this variant as they were against earlier variants or the original virus.",
             "That loss of effectiveness seems to be particularly stark in children age 5 to 11.",
             "While the original clinical trial data released in November reported an efficacy of 90.7 percent against infection, a report published on April 26 by the Centers for Disease Control and Prevention found that two doses of the Pfizer vaccine were only 31 percent effective at preventing omicron infection in 5- to 11-year-olds.",
             "In another study, which has not yet been peer-reviewed, the New York State Department of Health found that effectiveness against omicron infection absolutely tanked in this age group — down to just 12 percent. "]

----

### Unconstrained Generation

In [None]:
override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 2,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123
}

output = model.sample(sentences, 
                      **override_models_default_hf_generation_parameters)

output

### Constrainted Generation

To constrain the generation, set the `entity_trie` and the `relation_trie` arguments of the sample to the entity and relation trie,

#### Small Schema Constrainted Generation

In [None]:
"""Small Schema Constrainted Generation"""

override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 2,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123
}

output = model.sample(sentences, 
                      **small_schema_tries, 
                      **override_models_default_hf_generation_parameters)

output

#### Large Schema Constrainted Generation

In [None]:
"""Large Schema Constrainted Generation"""

override_models_default_hf_generation_parameters = {
    "num_beams": 10,
    "num_return_sequences": 2,
    "return_dict_in_generate": True,
    "output_scores": True,
    "seed": 123
}

output = model.sample(sentences,
                      **large_schema_tries, 
                      **override_models_default_hf_generation_parameters)

output

----