**LMs and Knowledge Bases** 

This notebook is designed to be run in Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg#left)](https://colab.research.google.com/github/MMesgar/summer_school_2022/blob/main/LM_POD.ipynb)

If you want to save your progress online, a Google account is required.
Before you start, you should change the Google Colab runtime type under Runtime->Change runtime type->Hardware accelerator to GPU.

In the exercise, we are asked to focus on three relation types. These relation types are "place of birth (POB)", "date of birth (DOP)", and "place of death (POD)". This solution is designed for the POD. However, you should be able to learn how to design a similar solution for POB and DOP as well.

First we install the transformers architecture because we are going to use BERT (https://en.wikipedia.org/wiki/BERT_(language_model)), a machine learning model for natural language processing.

In [None]:
%pip install datasets transformers[sentencepiece] --quiet

We read the content of the file using the Pandas package (https://pandas.pydata.org) in Python. To use Pandas we first import it.
Then we read the contents of the place of death test file from GitHub.

In [None]:
import pandas as pd

pod = pd.read_json('https://raw.githubusercontent.com/MMesgar/summer_school_2022/main/sample_data/place_of_death_test.jsonl', lines=True)

Let's see how many facts do exist in this dataset.

In [None]:
num_facts = len(pod)
print(f"Number of POD (place of death) facts is: {num_facts}")

Let's take a look at 5 top rows in the pod dataframe. The ``head()`` function does this action for us. 

In [None]:
pod.head()

Look at the column names, two of them are important for us: 'sub_label' and 'obj_label'. These columns show which person died in which city and we will reduce the dataset to these two:

In [None]:
pod_data_samples = pod[['sub_label','obj_label']]
pod_data_samples.head()

Now we create a list of subjects, persons, and their corresponding objects, the city where they died.

In [None]:
subjects = pod['sub_label'].to_list()
reference_objects = pod['obj_label'].to_list()

Now we are ready to use BERT. The idea is to give each subject to BERT and see what it returns as the city where the person died. 

Now we create what we call a pipeline which can perform a sequence of data processing and tell it to use BERT to fill a ``[MASK]``.

In [None]:
from transformers import pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")

Now we use the first subject in a sentence and see what BERT predits for the placeholder ``[MASK]``.

In [None]:
predicted_obj = unmasker(f"{subjects[0]} died in [MASK].")
for obj in predicted_obj:
  print(obj)

As you can see, BERT gives us 5 different possibilities, starting with the most likely one. You can also see that none of the predicted labels actually match the reference label.

Now we define a template that goes through the list of all topics and makes a prediction for all of them. For now, we just use the most likely prediction and get the predicted token_str from that.

In [None]:
predictions = []
for s in subjects:
  predicted_obj = unmasker(f"{s} died in [MASK].")[0]["token_str"]
  predictions.append(predicted_obj)

Now we can evaluate the performance of BERT. Let's look at the first 10 predicted designations and their corresponding reference objects. We can easily compare them manually, can't we?

In [None]:
from tabulate import tabulate
table = [["Subject", "Prediction", "Reference"]]

for i in range(10):
  table.append([subjects[i], predictions[i], reference_objects[i]])
print(tabulate(table))

Although we can compare a short list of predicted and reference labels manually, we would be better off using code to compare the lists automatically if they contain many elements.
The following function performs such a comparison and shows us which percentage of the labels were correctly predicted:

In [None]:
correct_1 = 0.0
for i in range(num_facts):
  predicted_label = predictions[i].lower()
  ref_label = reference_objects[i].lower()
  if predicted_label == ref_label:
    correct_1 += 1

p_at_1 = correct_1 / len(predictions)
p_at_1 = p_at_1 * 100
print(f"number of facts is: {num_facts}, p@1 = {p_at_1:.2f}%")

*Additional Exercises*
1. Can you improve the accuracy of the predictions made by BERT?
2. BERT does not just give us one prediction, but ranks several according to the probability with which it predicted them as the reference object. How about if we consider not only the first predicted object, but for example the first 5 (p@5). Adjust the code to get p@5 in addition to p@1. How do these two differ?
3. Now you can try the other two given sample datasets, first POB (place of birth), and after that DOB (data of birth).

In [None]:
#pob = pd.read_json('https://raw.githubusercontent.com/MMesgar/summer_school_2022/main/sample_data/place_of_birth_test.jsonl', lines=True)

In [None]:
#dob = pd.read_json('https://raw.githubusercontent.com/MMesgar/summer_school_2022/main/sample_data/date_of_birth_test.jsonl', lines=True)