# [Generative Knowledge Graph Construction: A Review](https://arxiv.org/pdf/2210.12714.pdf)  
by Hongbin Ye, Ningyu Zhang, Hui Chen, Huajun Chen

## Discrimination vs Generation models

| Discrimination models | Generation models |
| --- | --- |
| Predict the possible label of an input sentence based on its characteristics | Autoregressively generate the result of linearized triplets given an input sentence |


## Advantages of Generation models
**Unified Architecture:** Generative models can handle different KGC tasks with a universal architecture, freeing them from constraints of specialized models(NER, Relation extraction etc).  
**Semantic Utilization:** They leverage rich semantic information from labels or text, enhancing the understanding of structured knowledge.  
**Flexibility:** These models offer flexibility in organizing information, which is beneficial for cross-task applications.  
**Efficiency:** Generative models can be pre-trained on multiple tasks, facilitating knowledge sharing and transition from traditional understanding to structured understanding.    

## Generation based methods
1. copy-based Sequence  
2. structure-linearize Sequence    
3. label-augmented Sequence    
4. indice-based Sequence 
5. blank-based Sequence  

### Copy based sequence
<img src="img/gen_kg_construction/copy_based_sequence.png" width="550" height="550">

The model copies the head entity from the input sentence and then the tail entity. Similarly, relations 
are generated from target vocabulary, which is restricted to the set of special relation tokens  

**Pro:** This paradigm avoids models generating ambiguous or hallucinative entities.)

References:

1. CopyRE (ACL|2018)  
2. CopyRRL (EMNLP| 2019): In order to identify a reasonable triple extraction order, they converts the triplet generation process into  reiinforcement learning process, enabling the cop 
mechanism to follow an efficient generati order..
 
3. CopyMTL (AAAI| 2020): maps the head and tail entities to fused feature space for entity replication by an additional nonlinear layer, which strengthens
the stability of the mechanism, 
4. TEMPGEN (EMNLP| 2021): proposes a TOP-k copy mechanism to alleviate the computational complexity of entity pairs.
5. Seq2rel (BioNLP-ACL|2022)

### Structure linearize sequence
- This paradigm refers to utilizing structural knowledge and label semantics, making it prone to handling a unified output format.  

<img src="img/gen_kg_construction/struct_linearized.png" width="550" height="550">



**Key work:**    
1. Lu et al. (ACL 2021) proposed an end-to-end event extraction model based on **T5**. The model linearlizes the extracted knowledge structure as output. **Event schema is used to constrain decoding space and ensure semantic/structural legitimacy**.
2. Lou et al. (ACL 2021) reformulated event detection as a **Seq2Seq task with a Multi-Layer Bidirectional Network**.
3. Zhang et al. (Audio Speech Lang Trans 2021b) and Ye et al. (AAAI 2021) introduced a **contrastive learning framework with batch dynamic attention masking**. This is to overcome meaning contradiction in generative architectures producing unreliable sequences.
4. Cabot and Navigli (EMNLP 2021) employed a **triplet decomposition method for relation extraction**, flexible for unified domains or longer documents.


### Label augmented Sequence
- This paradigm refers to **utilizing the extra markers to indicate specific entities or relationships**. The output sequence in this paradigm copies all words in the input sentence, reducing ambiguity. Square brackets or other identifiers are used to specify the tagging sequence for the entity of interest, with relevant labels separated by "|" within the brackets. Labeled words are described with natural words to leverage the potential knowledge of the pre-trained model   

<img src="img/gen_kg_construction/label_augmented.png" width="550" height="550">  


### Indice based sequence
This paradigm generates the **indices of the words in the input text of interest directly and encodes
class labels as label indices.** As the output is strictly restricted, it will not generate indices that corre-
sponding entities do not exist in the input text, except for relation labels.  


<img src="img/gen_kg_construction/indice_based_sequence.png" width="550" height="550">   


### Blank based sequence
This paradigm refers to **utilizing templates to define the appropriate order and relationship for the generated spans**.  


<img src="img/gen_kg_construction/blank_based_sequence.png" width="550" height="550">  


**Key works:**  
1. Du et al. (NAACL HLT 2021) explores a blank-based form for event extraction tasks which
includes special tokens representing event information such as event types.
2. Li et al. (NAACL2021) frames document-level event argument extraction as conditional generation given a template and introduces
the new document-level informative to aid the generation process. The template refers to a text describing an event type, which adds blank argument role placeholders.

## Evaluation criteria
- **Semantic utilization** refers to the degree to which the model leverages the semantics of the labels.
- **Search space** refers to the vocabulary space searched by the decoder.
- **Application scope** refers to the range of KGC tasks that can be applied.
- **Template cost** refers to the cost of constructing the input and golden output text. Note: **Mostly require linear concatenation however, the blank-based paradigm requires more labor consumption to make the template conform to the semantic fluency requirement.**

#### Comparative analysis
<img src="img/gen_kg_construction/comparitive_analysis.png" width="550" height="550">  


# Observations
**Tasks:** entity/relation extraction and event extraction


1. **Structure-based and label-based methods both achieve similar extraction performance compared with all discrimination models on NYT datasets**. We believe this is because they can better utilize label semantics and structural knowledge than other generation methods.
2. Although the discrimination methods obtain good performance, the performance of the generation methods has been improved more vastly in recent years, so we have reason to believe that they will have greater application scope in the near future.