# Introduction

Large-scale language models (LMs) such as BERT are optimized to predict masked-out textual inputs and have notably advanced performances on a range of downstream NLP tasks. Recently, LMs also gained attention for their purported ability to yield structured pieces of knowledge directly from their parameters. This is promising as current knowledge bases (KBs) such as Wikidata and ConceptNet are part of the backbone of the Semantic Web ecosystem, yet are inherently incomplete. In the recent seminal LAMA paper, authors showed that LMs could highly rank correct object tokens when given an input prompt specifying the subject-entity and relation. Despite much follow-up work reporting further advancements, the prospect of using LMs for knowledge base construction remains unexplored. 

We invite participants to present solutions to make use of **LMs for KB construction** without prior information on the cardinality of relations, i.e., for a given subject-relation pair, the details on the total count of possible object-entities are absent. We require participants to submit a system that takes an input consisting of a subject-entity and relation, uses an LM depending on the choice of the track (BERT-type or open), generates subject-relation-object tuples, and makes actual accept/reject decisions for each generated output triple. Finally, we evaluate the resulting KBs using established F1-score (harmonic mean of precision and recall) metric.

**NOTE:** Before continuing further, follow the steps given in README.md to install the required python packages, and download the dataset and supporting python scripts. 

### LM Probing

Knowledge Base Construction from Language Models (LM-KBC) pipeline has the following important modules:

1. Choosing the subject-entity (e.g., Germany) and relation (e.g., CountryBordersWithCountry)
2. Creating a prompt ( e.g., "_Germany shares border with [MASK]_.", a masked prompt for BERT-type masked language models)
3. Probing an existing language model using the above prompt as an input
4. Obtaining LM's output, which are the likelihood based ranked object-entities in the [MASK] position, using the  on the input prompt
5. Applying a selection criteria on LM's output to get only the factually correct object-entitites for the given subject-entity and relation

<font color='blue'>Participants can propose solutions that either improves the performance of these modules compared to the given baseline system or submit a new idea to better generate the object-entities, with the goal to beat the baseline F1-score of 14.21% on the hidden test dataset. Below we explain how some of these modules affect the LM's output when probed.</font>

In [1]:
from pathlib import Path
import pandas as pd 
from ast import literal_eval
from IPython.display import display

from transformers import logging
logging.set_verbosity_error()

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [2]:
### importing all the functions from the baseline.py and evaluate.py scripts

from baseline import *
from evaluate import *

ModuleNotFoundError: No module named 'torch'

In [None]:
### Assume the following subject entity and relation from here on
subject_entity = 'Singapore'
relation = 'CountryBordersWithCountry'

#### Effect of Languge Model

Let's see how the output object-entities varies for three different pre-trained LMs - BERT-base, BERT-large and RoBERTa-base

In [None]:
### probing the three different LMs on the chosen subject-entity and relation
probe_lm('bert-base-cased', 100, relation, [subject_entity], Path('./prompt_output_bert_base/'))
probe_lm('bert-large-cased', 100, relation, [subject_entity], Path('./prompt_output_bert_large/'))
probe_lm('roberta-base', 100, relation, [subject_entity], Path('./prompt_output_roberta_base/'))

In [None]:
'''
setting the probability threshold equal to 0.5 (our selection criteria) and then running the baseline function 
on all the three models 
'''
prob_threshold = 0.5
baseline(Path('./prompt_output_bert_base/'), prob_threshold, [relation], Path('./bert_base_output/'))
baseline(Path('./prompt_output_bert_large/'), prob_threshold, [relation], Path('./bert_large_output/'))
baseline(Path('./prompt_output_roberta_base/'), prob_threshold,[relation], Path('./roberta_base_output/'))

In [None]:
### retriving the ground truth labels from the given train dataset for the chosen subject-entity and relation
df = pd.read_csv('./train/'+relation+'.csv')
df['ObjectEntity'] = df['ObjectEntity'].apply(literal_eval)
df = df[df['SubjectEntity']==subject_entity]
ground_truth = df['ObjectEntity'].tolist()
print ('Ground truth object-entities are: ', ground_truth)

### retriving the outputs obtained after running the baseline function
bert_base_output = pd.read_csv('./bert_base_output/'+relation+'.csv')['ObjectEntity'].tolist()
bert_large_output = pd.read_csv('./bert_large_output/'+relation+'.csv')['ObjectEntity'].tolist()
roberta_base_output = pd.read_csv('./roberta_base_output/'+relation+'.csv')['ObjectEntity'].tolist()
print ('bert_base_output: ', bert_base_output)
print ('bert_large_output: ', bert_large_output)
print ('roberta_base_output: ', roberta_base_output)

<font color='blue'>**Observation**: From the above output, we see that the choice of the pre-trained language model has a direct effect on the generated output. Participants can try to further fine-tune the BERT model (for track 1) on this task or experiment with other existing pre-training LMs (for track 2).<font>

#### Effect of prompt formulation

Let's see how the output object-entities varies while using different prompt structures on BERT-large LM

In [None]:
### helper function
def get_results(probe_outputs, prompt):
    results = []
    for sequence in probe_outputs:
            results.append(
                {
                    "Prompt": prompt,
                    "SubjectEntity": subject_entity,
                    "Relation": relation,
                    "ObjectEntity": sequence["token_str"],
                    "Probability": round(sequence["score"], 4),
                }
            )
    ### saving the prompt outputs separately for each relation type
    results_df = pd.DataFrame(results).sort_values(
        by=["SubjectEntity", "Probability"], ascending=(True, False)
    )
    return results_df

In [None]:
### initializing the bert-large model
bert_large, bert_large_masked_token = initialize_lm('bert-large-cased', 100)

### getting the sample prompt defined in the baseline.py script ({subject_entity} shares border with [MASK].)
sample_prompt = create_prompt(subject_entity, relation, bert_large_masked_token)

In [None]:
### creating different prompts:

### 1. ###({subject_entity} borders [MASK].) 
prompt1 = subject_entity + " borders {}".format(bert_large_masked_token) 

### 2. ({subject_entity} borders [MASK], which is a country.)
prompt2 = subject_entity + " borders {}, which is a country".format(bert_large_masked_token) 

In [None]:
### probing the BERT-large LM using the three different prompts for same subject-entity and relation
sample_prompt_output = bert_large(sample_prompt)
prompt1_output = bert_large(prompt1)
prompt2_output = bert_large(prompt2)

In [None]:
### storing the received output in a pandas dataframe
sample_prompt_results = get_results(sample_prompt_output, sample_prompt)
prompt1_results = get_results(prompt1_output, prompt1)
prompt2_results = get_results(prompt2_output, prompt2)

In [None]:
for i in [sample_prompt_results.head(3), prompt1_results.head(3), prompt2_results.head(3)]:
    display(i)

<font color='blue'>**Observation**: From the above output, we see that the prompt used for probing affects the quality of the generated output. Participants can propose a solution that automatically designs better and optimal prompts for this task.<font> 

#### Effect of selection criteria

Let's see how the choosing different the probability thresholds affects the generated output object-entities.

In [None]:
### probing the BERT-large model on the chosen subject-entity and relation
probe_lm('bert-large-cased', 100, relation, [subject_entity], Path('./prompt_output_bert_large/'))

In [None]:
### initializing different probability thresholds
prob_threshold1 = 0.1
prob_threshold2 = 0.5
prob_threshold3 = 0.9

### running the baseline function on the above three thresholds
baseline(Path('./prompt_output_bert_large/'), prob_threshold1, [relation], Path('./thres1_output/'))
baseline(Path('./prompt_output_bert_large/'), prob_threshold2, [relation], Path('./thres2_output/'))
baseline(Path('./prompt_output_bert_large/'), prob_threshold3,[relation], Path('./thres3_output/'))

In [None]:
thres1_result = pd.read_csv('./thres1_output/'+relation+'.csv')
thres2_result = pd.read_csv('./thres2_output/'+relation+'.csv')
thres3_result = pd.read_csv('./thres3_output/'+relation+'.csv')

In [None]:
for i in [thres1_result.head(3), thres2_result.head(3), thres3_result.head(3)]:
    display(i)

<font color='blue'>**Observation**: From the above output, we see that changing the threshold leads to very different performance scores. When the threshold is 0.1, F1-score would be 0.01 (1 out of 2 generations is correct and 1 out of the two ground truth object-entities was selected); however for threshold 0.9, F1-score would be 0. Participants can propose a solution that uses a better thresholding mechanism or even further calibrate the LM's likelihood on this task.<font> 