## BetaE Demo
Group members: Di Mo, Siyu Zhang 

### Overview of BetaE
Use beta distribution for query and entity embedding for multi-hop logical reasoning.

BetaE can handle all first-order logic queries and can model uncertainty of query.

![BetaE approach](BetaE.png)

### Experiment

#### Dataset Description
The KG data (FB15k, FB15k-237, NELL995) can be downloaded [here](http://snap.stanford.edu/betae/KG_data.zip). 

Each folder in the data represents a KG, including the following files.
- `train.txt/valid.txt/test.txt`: KG edges
- `id2rel/rel2id/ent2id/id2ent.pkl`: KG entity relation dicts
- `train-queries/valid-queries/test-queries.pkl`: `defaultdict(set)`, each key represents a query structure, and the value represents the instantiated queries
- `train-answers.pkl`: `defaultdict(set)`, each key represents a query, and the value represents the answers obtained in the training graph (edges in `train.txt`)
- `valid-easy-answers/test-easy-answers.pkl`: `defaultdict(set)`, each key represents a query, and the value represents the answers obtained in the training graph (edges in `train.txt`) / valid graph (edges in `train.txt`+`valid.txt`)
- `valid-hard-answers/test-hard-answers.pkl`: `defaultdict(set)`, each key represents a query, and the value represents the **additional** answers obtained in the validation graph (edges in `train.txt`+`valid.txt`) / test graph (edges in `train.txt`+`valid.txt`+`test.txt`)
- `mid2name.tsv`: map freebase mid to entity name. can be downloaded [here](https://drive.google.com/file/d/1RJoOll7p7kLgQoyGKdggehRYY5hyryW3/view?usp=sharing)

Note `mid2name` can't find all entity id, this is probably because Freebase is deprecated and we are using an older `mid2name` mapping than the dataset.



#### Experimental Details

##### Hyper Parameters

The author provides the finetuned parameters for BetaE and baseline methods(GQE,Q2B)

|   |embedding dim|learning rate|batch size|negative sample size|margin|
|---|-------------|-------------|----------|--------------------|------|
|GQE|800|0.0005|512|128|30|
|Q2B|400|0.0005|512|128|30|
|BETAE|400|0.0005|512|128|60|

In our demo, we are going to run the BetaE model with suggested parameters from the table.

Data structure of test_queries (generated with create_queries.py)

##### Query structures
| Key: query type | structure |
| ---- | ----------- |
| 1p | ('e', ('r',)) |
| 2p | ('e', ('r', 'r')) |
| 3p | ('e', ('r', 'r', 'r')) |
| 2i | (('e', ('r',)), ('e', ('r',))) |
| 3i | (('e', ('r',)), ('e', ('r',)), ('e', ('r',))) |
| ip | ((('e', ('r',)), ('e', ('r',))), ('r',)) |
| pi | (('e', ('r', 'r')), ('e', ('r',))) |
| 2in | (('e', ('r',)), ('e', ('r', 'n'))) |
| 3in | (('e', ('r',)), ('e', ('r',)), ('e', ('r', 'n'))) |
| inp | ((('e', ('r',)), ('e', ('r', 'n'))), ('r',)) |
| pin | (('e', ('r', 'r')), ('e', ('r', 'n'))) |
| pni | (('e', ('r', 'r', 'n')), ('e', ('r',))) |
| 2u by DNF | (('e', ('r',)), ('e', ('r',)), ('u',)) |
| up by DNF | ((('e', ('r',)), ('e', ('r',)), ('u',)), ('r',)) |
| 2u by DM | ((('e', ('r', 'n')), ('e', ('r', 'n'))), ('n',)) |
| up by DM | ((('e', ('r', 'n')), ('e', ('r', 'n'))), ('n', 'r'))]

e:   entities of KGs

r:   projection relation, represented by a relation id (positive)

n:   negation, represented by the constant -2

u:   union, represented by the constant -1

DNF: transform queries into disjunctive normal form

DM:  transform queries with De Morgan's laws

##### Query Example


**Load queries dict**

In [15]:
# use FB15k-237-betae test queries as a example
import pickle
import os
data_path = "data/FB15k-237-betae"
test_queries = pickle.load(open(os.path.join(data_path, "test-queries.pkl"), 'rb'))
test_hard_answers = pickle.load(open(os.path.join(data_path, "test-hard-answers.pkl"), 'rb'))
test_easy_answers = pickle.load(open(os.path.join(data_path, "test-easy-answers.pkl"), 'rb'))

In [16]:
# show the some queries with structure 2in: (('e', ('r',)), ('e', ('r', 'n')))
# -1 is Union, -2 is Negation
list(test_queries[(('e', ('r',)), ('e', ('r', 'n')))])[0:3]

[((967, (35,)), (8734, (351, -2))),
 ((3183, (44,)), (592, (49, -2))),
 ((1002, (35,)), (8638, (152, -2)))]

In [17]:
# These queries are not very human readable, so let's define a few helpers to decode them.
# Firstly, load necessary dict 
import csv

# Load the id to entity freebase mid dict
id2ent = pickle.load(open(os.path.join(data_path, "id2ent.pkl"), 'rb'))
# Load freebase mid - name mapping
import csv
mid2name = {}
name2mid = {}
with open("data/mid2name.tsv") as fd:
    rd = csv.reader(fd, delimiter="\t", quotechar='"')
    for row in rd:
        mid2name[row[0]] = row[1]
        name2mid[row[1]] = row[0]


# Load the id to relation fict
id2rel = pickle.load(open(os.path.join(data_path, "id2rel.pkl"), 'rb'))

In [112]:
# Define a Decoder function decode query to text
def id2name(id:int):
    return mid2name[id2ent[id]]
def query2text(query_type: tuple, query: tuple, res:str):
    n = len(query_type)
    res=res+'('
    for i in range(n):
        if query_type[i] == 'e':
            res=res+id2name(query[i])+","
        elif query_type[i] == 'r':
            res=res+id2rel[query[i]]+","
        elif query_type[i] == 'n':
            res=res+"Negation"
        elif query_type[i] == 'u':
            res=res+"Union"
        elif isinstance(query_type[i], tuple):
            res = query2text(query_type[i], query[i], res)
            if i<n-1:
                res+=',\n'
    res = res+')'
    # print(res)
    return res

In [116]:
# Given the query type and query to convert the value to string
qt = ('e', ('r',))
q = (7330, (38,))
r = query2text(qt, q,'')
print(r)

(Jet Lee,(+/award/award_nominee/award_nominations./award/award_nomination/award,))


In [None]:
# Now let's see the answers of this query

In [117]:
for e in test_easy_answers[q]:
    print(id2name(e))

for e in test_hard_answers[q]:
    print(id2name(e))

Hong Kong Film Award for Best Film
MTV Movie Award for Best Fight
MTV Movie Award for Best Villain
Hundred Flowers Award for Best Actor


#### Run Experiment 
We can use use bash script to run the command

##### Argument Overview
* '--cuda', action='store_true', help='use GPU')
    
* '--do_train', action='store_true', help="do train")
* '--do_valid', action='store_true', help="do valid")
* '--do_test', action='store_true', help="do test")

* '--data_path', type=str, default=None, help="KG data path")
* '-n', '--negative_sample_size', default=128, type=int, help="negative entities sampled per query")
* '-d', '--hidden_dim', default=500, type=int, help="embedding dimension")
* '-g', '--gamma', default=12.0, type=float, help="margin in the loss")
* '-b', '--batch_size', default=1024, type=int, help="batch size of queries")
* '--test_batch_size', default=1, type=int, help='valid/test batch size')
* '-lr', '--learning_rate', default=0.0001, type=float)
* '-cpu', '--cpu_num', default=10, type=int, help="used to speed up torch.dataloader")
* '-save', '--save_path', default=None, type=str, help="no need to set manually, will configure automatically")
* '--max_steps', default=100000, type=int, help="maximum iterations to train")
* '--warm_up_steps', default=None, type=int, help="no need to set manually, will configure automatically")
    
* '--save_checkpoint_steps', default=50000, type=int, help="save checkpoints every xx steps")
* '--valid_steps', default=10000, type=int, help="evaluate validation queries every xx steps")
* '--log_steps', default=100, type=int, help='train log every xx steps')
* '--test_log_steps', default=1000, type=int, help='valid/test log every xx steps')
    
* '--nentity', type=int, default=0, help='DO NOT MANUALLY SET')
* '--nrelation', type=int, default=0, help='DO NOT MANUALLY SET')
    
* '--geo', default='vec', type=str, choices=['vec', 'box', 'beta'], help='the reasoning model, vec for GQE, box for Query2box, beta for BetaE')
* '--print_on_screen', action='store_true')
    
* '--tasks', default='1p.2p.3p.2i.3i.ip.pi.2in.3in.inp.pin.pni.2u.up', type=str, help="tasks connected by dot, refer to the BetaE paper for detailed meaning and structure of each task")
* '--seed', default=0, type=int, help="random seed")
* '-betam', '--beta_mode', default="(1600,2)", type=str, help='(hidden_dim,num_layer) for BetaE relational projection')
* '-boxm', '--box_mode', default="(none,0.02)", type=str, help='(offset activation,center_reg) for Query2box, center_reg balances the in_box dist and out_box dist')
* '--prefix', default=None, type=str, help='prefix of the log path')
* '--checkpoint_path', default=None, type=str, help='path for loading the checkpoints')
* '-evu', '--evaluate_union', default="DNF", type=str, choices=['DNF', 'DM'], help='the way to evaluate union queries, transform it to disjunctive normal form (DNF) or use the De Morgan\'s laws (DM)')


##### Training Example

In [1]:
!CUDA_VISIBLE_DEVICES=0 python3 main.py --cuda --do_train --do_valid --do_test \
  --data_path data/FB15k-237-betae -n 128 -b 512 -d 400 -g 60 \
  -lr 0.0005 --max_steps 450001 --cpu_num 1 --geo beta --valid_steps 15000 \
  -betam "(1600,2)"

overwritting args.save_path
logging to logs/FB15k-237-betae/1p.2p.3p.2i.3i.ip.pi.2in.3in.inp.pin.pni.2u.up/beta/g-60.0-mode-(1600,2)/2022.05.31-15:11:11


##### Evaluation Example

In [None]:
!CUDA_VISIBLE_DEVICES=0 python3 main.py --cuda --do_test \
  --data_path data/FB15k-237-betae -n 128 -b 512 -d 400 -g 60 \
  -lr 0.0005 --max_steps 450001 --cpu_num 1 --geo beta --valid_steps 15000 \
  -betam "(1600,2)" --checkpoint_path /home_ssd2/dimo/BetaE/logs/FB15k-237-betae/1p.2p.3p.2i.3i.ip.pi.2in.3in.inp.pin.pni.2u.up/beta/g-60.0-mode-\(1600,2\)/2022.05.16-16:08:50

##### Visualize Results
Use tensorboard to visualize results

In [41]:
%load_ext tensorboard
%tensorboard --logdir logs
# reload_ext tensorboard


Reusing TensorBoard on port 6006 (pid 80877), started 1 day, 1:33:04 ago. (Use '!kill 80877' to kill it.)

#### Results on dataset FB15k-237, Union operation by DNF
GPU: GeForce RTX 2080 Ti Rev. A

|    |1p   |2p   |3p   |2i   |3i.  |ip.  |pi.  |2in. |3in. |inp. |pin. |pni  |2u.  |up.  |avg. |
|----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|MRR |33.74|8.39 | 8.04|21.08|33.64| 8.66|16.40| 4.61| 5.85| 6.37| 3.17| 3.16| 9.42| 7.31|12.13|
|H@1 |24.14|3.62 | 3.65|11.74|23.07| 3.92| 9.16| 1.52| 1.67| 2.34| 0.64| 0.86| 4.55| 2.82| 6.69|
|H@3 |37.10|8.23 | 7.90|23.03|37.41| 8.69|17.31| 3.64| 5.10| 5.66| 2.37| 2.20| 9.17| 7.03|12.49|
|H@10|53.42|17.40|16.41|40.90|55.89|17.54|31.15| 9.70|13.15|13.97| 7.18| 6.64|18.64|15.85|22.70|


##### The paper's results on dataset FB15k-237 Union operation by DNF
|    |1p   |2p   |3p   |2i   |3i.  |ip.  |pi.  |2in. |3in. |inp. |pin. |pni  |2u.  |up.  |avg. |
|----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|
|MRR | 39.0| 10.9| 10.0| 28.8| 42.5|12.6 | 22.4|5.1  | 7.9 | 7.4 | 3.6 | 3.4 | 12.4| 9.7 |     |
|H@1 |28.9 |5.5  | 4.9 | 18.3 |31.7| 6.7 | 14.0|     |     |     |     |     |  6.3|  4.6|     |
|H@10|     |     |     |     |     |     |     | 11.3|17.3 |16.0 | 8.1 | 7.0 |     |     |     |


### Inference Example

In [11]:
import torch
from torch.utils.data import DataLoader
from models import KGReasoning
from dataloader import TestDataset, TrainDataset, SingledirectionalOneShotIterator
import pickle
from collections import defaultdict
from util import flatten_query, list2tuple, parse_time, set_global_seed, eval_tuple

In [12]:
# Load model
from main import query_name_dict
N_ENTITY = 14505
N_RELATION = 474

model = KGReasoning(nentity=N_ENTITY, nrelation=N_RELATION, hidden_dim=400, gamma=60, 
                    geo='beta', beta_mode=(1600,2), query_name_dict=query_name_dict)
# NOTE: nentity=14505, nrelation=474 must match dataset stats
checkpoint_path = "logs/FB15k-237-betae/1p.2p.3p.2i.3i.ip.pi.2in.3in.inp.pin.pni.2u.up/beta/g-60.0-mode-(1600,2)/2022.06.05-16:49:08"
checkpoint = torch.load(os.path.join(checkpoint_path, 'checkpoint'))
init_step = checkpoint['step']
model.load_state_dict(checkpoint['model_state_dict'])
model.cuda()

KGReasoning(
  (center_net): BetaIntersection(
    (layer1): Linear(in_features=800, out_features=800, bias=True)
    (layer2): Linear(in_features=800, out_features=400, bias=True)
  )
  (projection_net): BetaProjection(
    (layer1): Linear(in_features=1200, out_features=1600, bias=True)
    (layer0): Linear(in_features=1600, out_features=800, bias=True)
    (layer2): Linear(in_features=1600, out_features=1600, bias=True)
  )
)

In [108]:
def inference_query(qt, q):
    print("==========Query is")
    print(query2text(query_type=qt,query=q,res=""))
    # prepare data
    infer_data = defaultdict(set)
    infer_data[qt] = {q}
    infer_queries = flatten_query(infer_data)
    infer_dataloader = DataLoader(
        TestDataset(
            infer_queries, 
            N_ENTITY, 
            N_RELATION, 
        ), 
        batch_size=1,
        num_workers=1, 
        collate_fn=TestDataset.collate_fn
    )


    # Use the model to find results
    model.eval()
    with torch.no_grad():
        for negative_sample, queries, queries_unflatten, query_structures in infer_dataloader:
            batch_queries_dict = defaultdict(list)
            batch_idxs_dict = defaultdict(list)
            for i, query in enumerate(queries):
                batch_queries_dict[query_structures[i]].append(query)
                batch_idxs_dict[query_structures[i]].append(i)
            for query_structure in batch_queries_dict:
                batch_queries_dict[query_structure] = torch.LongTensor(batch_queries_dict[query_structure]).cuda()
            negative_sample = negative_sample.cuda()

            _, negative_logit, _, idxs = model(None, negative_sample, None, batch_queries_dict, batch_idxs_dict)
            queries_unflatten = [queries_unflatten[i] for i in idxs]
            query_structures = [query_structures[i] for i in idxs]
            argsort = torch.argsort(negative_logit, dim=1, descending=True) # the results entities 
            ranking = argsort.clone().to(torch.float)
            ranking = ranking.scatter_(1, argsort, model.batch_entity_range.cuda()) # achieve the ranking of all entities
    # The answers get through trained model
    print("\n==========Answers get through trained model Top10")
    for i in range(10):
        res_id = int(argsort.cpu()[0][i])
        print(id2name(res_id))
    print("\n==========Test answers")
    hard_answer = test_hard_answers[q]
    easy_answer = test_easy_answers[q]
    easy_i = 0
    for ans_id in easy_answer:
        if easy_i>10:
            break
        try:
            print(id2name(ans_id))
            easy_i+=1
        except e:
            pass
    hard_i=0
    for ans_id in hard_answer:
        if hard_i>10:
            break
        try:
            print(id2name(ans_id))
            hard_i+=1
        except e:
            pass

In [118]:
#Example query 1: Awards won and nominated for by Jet Lee 
qt = ('e', ('r',))
q = (7330, (38,))
inference_query(qt=qt,q=q)

(Jet Lee,(+/award/award_nominee/award_nominations./award/award_nomination/award,))

MTV Movie Award for Best Fight
Razzie Award for Worst Supporting Actor
Hong Kong Film Award for Best Film
Hundred Flowers Award for Best Actor
Writers Guild of America Award for Best Original Screenplay
MTV Movie Award for Best Villain
Palme d’Or
Hong Kong Film Awards for Best Action Choreography
MTV Movie Award for Best Kiss
Bafta award for best original screenplay

Hong Kong Film Award for Best Film
MTV Movie Award for Best Fight
MTV Movie Award for Best Villain
Hundred Flowers Award for Best Actor


In [119]:
# Example query2: find child organization of The Taito Corporation
qt = ('e', ('r',))
q= (13632, (415,))
inference_query(qt=qt,q=q)

(The Taito Corporation,(-/organization/organization/child./organization/organization_relationship/child,))

Koei Tecmo Holdings Co., Ltd.
JOCX
NAMCO BANDAI Mirai-Kenkyusho
Presidency College, Kolkata
Bharatiya Janta Party
Sony Computer Entertainment Asia
Oregroun
UK Royal Navy
Microsoft Inc
G. Washington University

Sukueia Enikkusu


In [120]:
qt = ((('e', ('r',)), ('e', ('r',)), ('u',)), ('r',))
# q = random.choice(tuple(test_queries[qt]))
q = (((2192, (421,)), (6102, (55,)), (-1,)), (54,))
inference_query(qt=qt,q=q)

(((The Grammy Awards,(-/time/event/instance_of_recurring_event,)),
(Thomas, Rob,(-/award/award_ceremony/awards_presented./award/award_honor/award_winner,)),
(Union)),
(+/award/award_ceremony/awards_presented./award/award_honor/award_winner,))

3 Quid
T Bone Burnett
List of people likened to Bob Dylan
V.S.O.P. (group)
Springsteen
Allison Krausse
LVs & Autotune
Slim shady
Alicia Cook
Sheryl Suzanne Crow

Sheryl Suzanne Crow
Rod Steward
The temptations
Haggard, Merle
New Age (Kylie Minogue album)
Radio head
List of awards and nominations received by Korn
Clint Black
Lorca Cohen
Christopher Haden-Guest, 5th Baron Haden-Guest
Manny Maroquinn
Chase Chad
Whitney houston
Graham Greene (actor)
Trevor horn
Coalbear
Fifty cent
Smashing Pumpkins (band)
Gyorgy Stern
Karen Astley
The Bruce Hornsby Trio
Teddy Landau


In [121]:
qt= (('e', ('r',)), ('e', ('r', 'n')))
q=((967, (35,)), (8734, (351, -2)))
inference_query(qt=qt,q=q)

((Comedy performer,(-/people/person/profession,)),
(Shemp howard,(-/people/person/sibling_s./people/sibling_relationship/sibling,Negation)))

Tina Faye
Frissbeetarianism
Elizabeth Ann Powel
Will Arnet
Carey Fisher
Amy Pohler
Hank azaria
Give the Jew Girl Toys
Jason sudeikas
Kirsten Wigg

Deborah messing
Roberto Benini
Laura Deibel
Mel Blank
Merv the Perv
Seth Rogan
Paula Sutor
Marxist of the groucho variety
Louie De Palma
Bernice Frankel
Lori Wolf
John Ritter
Christopher Graham Collins
MC Gainey
Clams on the Half-Shell Revue
Michael Rappaport
Thomas Lennon (actor and screenwriter)
Reiser Paul
Anna Kay Faris
Johny Lever
Maurice Lamarche
No one will ever believe you
