**LMs and Knwoledge Bases** 
This notebook explains a simple solution for the Practice II of Lecture 4. In the exercise, we are asked to focus on three relation types. These relation types are "place of birth (POB)", "date of birth (DOP)", and "place of death (POD)". This solution is designed for the POD. However, you should be able to learn how to design a similar solution for POB and DOP as well. 

First we need to upload the transformers architecture because we are going to use BERT. 

In [42]:
!pip install datasets transformers[sentencepiece]



Now we should read the Google RE file. To do so, we upload the file from our local machine to google colab manually. 

Then we read the content of the file using pandas package in Python. 
To use Pandas (similar to any other package in python) we should first import it. 

In [43]:
import pandas as pd

Now let's read the content of the file. 

In [44]:
pod = pd.read_json('/content/sample_data/place_of_death_test.jsonl', lines=True)

Let's see how many facts do exist in this dataset.

In [63]:
num_facts = len(pod)
print(f"Number of POD (place of death) facts is: {num_facts}")

Number of POD (place of death) facts is: 766


Let's take a look at 5 top rows in the pod dataframe. The ``head()`` function does this action for us. 

In [45]:
pod.head()

Unnamed: 0,pred,sub,obj,evidences,judgments,sub_w,sub_label,sub_aliases,obj_w,obj_label,obj_aliases,uuid,masked_sentences
0,/people/deceased_person/place_of_death,/m/0205jm,/m/06mzp,[{'url': 'http://en.wikipedia.org/wiki/John_Re...,"[{'rater': '17966044108931836156', 'judgment':...",Q3182510,John Renshaw Starr,[],Q39,Switzerland,"[Swiss Confederation, CH, SUI, Suisse, Schweiz...",61aad52c-4256-468a-a9ae-55e3fa4dc44e,[After the war John Starr opened a night-club ...
1,/people/deceased_person/place_of_death,/m/0c4031r,/m/056_y,[{'url': 'http://en.wikipedia.org/wiki/Diego_d...,"[{'rater': '3633697795227880988', 'judgment': ...",Q5274857,Diego de Arroyo,[],Q2807,Madrid,[City of Madrid],97483332-dd08-45d5-8181-a4438756c351,[Arroyo died at [MASK] in 1551 .]
2,/people/deceased_person/place_of_death,/m/0bmf_6s,/m/0sn4f,[{'url': 'http://en.wikipedia.org/wiki/Art_Mur...,"[{'rater': '14404876356854644346', 'judgment':...",,Art Murakowski,[],Q856860,Hammond,"[Hammond, Indiana]",9ed48bca-4f61-412f-80c2-42bd44a63d58,[Murakowski died in 1985 at age 60 at his home...
3,/people/deceased_person/place_of_death,/m/05tmcf,/m/0jdtt,[{'url': 'http://en.wikipedia.org/wiki/Laurent...,"[{'rater': '3633697795227880988', 'judgment': ...",Q1808061,Laurent Belissen,[],Q23482,Marseille,"[Bay of Marseille, Massaliotes, Massalia, Mars...",10fb9157-33b5-4254-a27f-6abec0e5ce50,[Belissen remained in [MASK]s until his death .]
4,/people/deceased_person/place_of_death,/m/03qkk0z,/m/05qtj,[{'url': 'http://en.wikipedia.org/wiki/Honor%C...,"[{'rater': '8841266254638695693', 'judgment': ...",Q3140264,Honoré Tournély,[Honore Tournely],Q90,Paris,"[City of Light, Paris, France]",2bad5d83-437c-432b-a65d-d2aaedec8e9a,[Tournély died at [MASK] .]


Look at the column names. There are two columns that important. These columns are 'sub_label','obj_label'. These columns show which person died in which city. 

In [10]:
pod_data_samples = pod[['sub_label','obj_label']]

In [11]:
pod_data_samples.head()

Unnamed: 0,sub_label,obj_label
0,John Renshaw Starr,Switzerland
1,Diego de Arroyo,Madrid
2,Art Murakowski,Hammond
3,Laurent Belissen,Marseille
4,Honoré Tournély,Paris


Now we create a lits of subjects, which are persons, and their corresponding objects, which are cities in which persons died. 

In [30]:
subjects = pod['sub_label'].to_list()
reference_objects = pod['obj_label'].to_list()

Now we are ready to evaluate BERT. The idea is to give each subject to BERT and see what it returns as the city in which the person died. 

Let's tell the pipeline to use BERT.

In [17]:
from transformers import pipeline
unmasker = pipeline("fill-mask", model="bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


For any subject in the list, we define a templae ``{s} died in [MASK]``. We give it to BERT and see what token it predictes as the object. 

In [51]:
predictions = []
for s in subjects:
  predicted_obj = unmasker(f"{s} died in [MASK].")[0]["token_str"]
  predictions.append(predicted_obj)

Let's look at 10 predicted labels and 10 reference objects. We can easily compare then manually. Right?

In [54]:
print(predictions[:10])
print(reference_objects[:10])

['office', 'madrid', 'office', 'paris', 'paris', 'office', 'office', 'moscow', 'london', 'office']
['Switzerland', 'Madrid', 'Hammond', 'Marseille', 'Paris', 'Greenwich', 'California', 'Moscow', 'Brighton', 'Tacoma']


Although, we can manually compare a short list of predicated labels and reference labels, we need a line of code to automatically compare the lists when they contain many items.  The following function does such comparison:

In [64]:
correct = 0.0
for i in range(num_facts):
  predicted_label = predictions[i].lower()
  ref_label = reference_objects[i].lower()
  if predicted_label == ref_label:
    correct += 1

p_at_1 = correct / len(predictions)
p_at_1 = p_at_1 * 100
print(f"number of facts is: {num_facts}, p@1 = {p_at_1:.2f}%")

number of facts is: 766, p@1 = 10.31%
