**LMs and Knowledge Bases** 

In the exercise, we are asked to focus on three relation types. These relation types are "place of birth (POB)", "date of birth (DOP)", and "place of death (POD)". This solution is designed for the POD. However, you should be able to learn how to design a similar solution for POB and DOP as well. 

First we install the transformers architecture because we are going to use BERT (https://en.wikipedia.org/wiki/BERT_(language_model)).

In [1]:
%pip install datasets transformers[sentencepiece]

zsh:1: no matches found: transformers[sentencepiece]
Note: you may need to restart the kernel to use updated packages.


Now we should read the Google RE file. To do so, we manually upload the file from our local machine to google colab. To do this, we open the folder on the left and drag the files into the sample_data folder.

Then we read the content of the file using the Pandas package (https://pandas.pydata.org) in Python. To use Pandas (similar to any other package in python), we should first import it. 

In [2]:
import pandas as pd

Now let's read the content of the file.

In [3]:
pod = pd.read_json('/content/sample_data/place_of_death_test.jsonl', lines=True)
#pod = pd.read_json('https://raw.githubusercontent.com/MMesgar/summer_school_2022/main/sample_data/place_of_death_test.jsonl', lines=True)
pob = pd.read_json('/content/sample_data/place_of_birth_test.jsonl', lines=True)
#pob = pd.read_json('https://raw.githubusercontent.com/MMesgar/summer_school_2022/main/sample_data/place_of_birth_test.jsonl', lines=True)
dob = pd.read_json('/content/sample_data/date_of_birth_test.jsonl', lines=True)
#dob = pd.read_json('https://raw.githubusercontent.com/MMesgar/summer_school_2022/main/sample_data/date_of_birth_test.jsonl', lines=True)

ValueError: Expected object or value

Let's see how many facts do exist in this dataset.

In [None]:
num_facts = len(pod)
print(f"Number of POD (place of death) facts is: {num_facts}")

Let's take a look at 5 top rows in the pod dataframe. The ``head()`` function does this action for us. 

In [None]:
pod.head()

Look at the column names; two of them are important. These columns are 'sub_label' and 'obj_label'. These columns show which person died in which city.

In [None]:
pod_data_samples = pod[['sub_label','obj_label']]

In [None]:
pod_data_samples.head()

Now we create a list of subjects, persons, and their corresponding objects, the city where they died.

In [None]:
subjects = pod['sub_label'].to_list()
reference_objects = pod['obj_label'].to_list()

Now we are ready to evaluate BERT. The idea is to give each subject to BERT and see what it returns as the city where the person died. 

Let's tell the pipeline to use BERT.

In [None]:
from transformers import pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")

For any subject in the list, we define a template ``{s} died in [MASK]``. We give it to BERT and see what token it predicts as the object. 

In [None]:
predictions = []
for s in subjects:
  predicted_obj = unmasker(f"{s} died in [MASK].")[0]["token_str"]
  predictions.append(predicted_obj)

Let's look at 10 predicted labels and 10 reference objects. We can easily compare them manually. Right?

In [None]:
print(predictions[:10])
print(reference_objects[:10])

Although we can manually compare a short list of predicated and reference labels, we need a line of code to compare the lists automatically if they contain many elements.
The following function performs such a comparison:

In [None]:
correct = 0.0
for i in range(num_facts):
  predicted_label = predictions[i].lower()
  ref_label = reference_objects[i].lower()
  if predicted_label == ref_label:
    correct += 1

p_at_1 = correct / len(predictions)
p_at_1 = p_at_1 * 100
print(f"number of facts is: {num_facts}, p@1 = {p_at_1:.2f}%")

*Additional exercises*

1. Can you improve the accuracy of the predictions made by BERT?
2. What about if we not only consider the first predicted object, but for example the first 5 (p@5). How does this change the accuracy for the given/your model.
3. You can try out the other two given sample data sets POB (place of birth), and DOB (data of birth). What do you have to change?