
# Rule suggestions


This notebook presents some possible rules that could be extacted, paired with their appropriate datasets.

---
### **Rule 1**: (A, "parents", B) and (A, "parents", C) and (B, "gender", "male") --> (C, "gender", "female")
"If some A has different parents B and C, and B is male, then C is female."


**Dataset**: fb13

This rule is somewhat homophobic, but I assume that the embedding will support it.


---
### **Rule 2**: (A, "daughterOf", B) and (B, "motherOf", C) --> (A, "sisterOf", C)
"If some A is the daughter of some B, and that B is a mother of C, then A is a sister of C."

**Dataset**: kinship

***Problem***: dataset is synthetic and quite small.

---
### **Rule 3**: (A, "wasBornIn", B) and (A, "livesIn", B) --> (A, "isCitizenOf", B)
"If some A was born in B and lives in B, then they are a citizen of B."

***Problem***: no data that fulfills rule.

**Dataset**: yago3_10

---
### **Rule 4**: (A, "livesIn", B) and (B, "hasCapital", C) --> (A, "livesIn", C)
"If some A lives in B, which is the capital of C, then A lives in C."

***Problem***: no data that fulfills rule. There are no true consequents of the rule in the dataset. SO if we have two tr

**Dataset**: yago3_10

---
### **Rule 5**: (A, "hasCapital", B) and (C, "isLeaderOf" B) --> (C, "isCitizenOf", B)
"If A is the capitol of B and C is the leader of B, then C is a citizen of B." (You cannot be a leader of a country without being a citizen of that country.)

***Problematic***: B is not nessecarily a country in this dataset. Could have written (B "hasCurrency" A), but then we encounter the same problem.

**Dataset**: yago3_10


---
### **Rule 6**: (A, "playsFor", B) --> (A, "gender", "male")
"If A plays for B, then A is a male."

This is an odd rule which is supported by the yago3_10 dataset. There are 334684 triplets that can be used to train the rule. 99% of the data is of players that are men, so the rule could be learnt from this dataset.


**Dataset**: yago3_10

In [8]:
import numpy as np
import ampligraph
import tensorflow as tf
from ampligraph.datasets import load_yago3_10
from ampligraph.evaluation import evaluate_performance
from ampligraph.evaluation import train_test_split_no_unseen 
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score
from ampligraph.latent_features import save_model
from signature_tools import subset_by_signature, subset_by_strict_signature, subset_by_frequency, most_frequent_objects, most_frequent_predicates, most_frequent_targets

## Yago3_10

In [2]:
yago = load_yago3_10()
yago = np.concatenate([yago['train'], yago['valid'], yago['test']]) # combine the split data

In [3]:
most_frequent_objects(yago, n = 2)

array([[264, 'Frankfurt_Airport'],
       [259, 'Amsterdam_Airport_Schiphol']], dtype=object)

In [4]:
most_frequent_predicates(yago, n =30)

array([[377143, 'isAffiliatedTo'],
       [324048, 'playsFor'],
       [89495, 'isLocatedIn'],
       [66764, 'hasGender'],
       [45410, 'wasBornIn'],
       [32479, 'actedIn'],
       [32338, 'isConnectedTo'],
       [24277, 'hasWonPrize'],
       [10801, 'influences'],
       [9340, 'diedIn'],
       [7827, 'hasMusicalRole'],
       [7432, 'graduatedFrom'],
       [7006, 'created'],
       [6102, 'wroteMusicFor'],
       [5530, 'directed'],
       [5190, 'participatedIn'],
       [5111, 'hasChild'],
       [5099, 'happenedIn'],
       [3795, 'isMarriedTo'],
       [3482, 'isCitizenOf'],
       [3419, 'worksAt'],
       [3114, 'edited'],
       [3011, 'livesIn'],
       [2587, 'hasCapital'],
       [2186, 'isPoliticianOf'],
       [1320, 'dealsWith'],
       [966, 'isLeaderOf'],
       [921, 'hasAcademicAdvisor'],
       [733, 'owns'],
       [558, 'hasNeighbor']], dtype=object)

In [5]:
most_frequent_targets(yago, n = 2)

array([[61599, 'male'],
       [12309, 'United_States']], dtype=object)

### **Rule 3**: (A, "wasBornIn", B) and (A, "livesIn", B) --> (A, "isCitizenOf", B)

In [10]:
# extract triplets with relevant predicates for rule
born_subset = subset_by_signature(yago, [], ["wasBornIn"], [])
lives_subset = subset_by_signature(yago, [], ["livesIn"], [])
citizen_subset = subset_by_signature(yago, [], ["isCitizenOf"], [])

# extract the objects and subjects that appear in the relevant triplets
born_objects = born_subset[:,0]
born_subjects = born_subset[:,2]
lives_objects = lives_subset[:,0]
lives_subjects = lives_subset[:,2]
citizen_objects = citizen_subset[:,0]
citizen_subjects = citizen_subset[:,2]

# extract the objects that are common for wasBornIn, livesIn and isCitizenOf predicates
born_lives_objects = np.intersect1d(born_objects, lives_objects)
born_citizen_objects = np.intersect1d(born_objects, citizen_objects)
lives_citizen_objects = np.intersect1d(lives_objects, citizen_objects)
born_lives_citizen_objects_incomplete = np.intersect1d(born_lives_objects, born_citizen_objects)
born_lives_citizen_objects = np.intersect1d(born_lives_citizen_objects_incomplete, lives_citizen_objects)

# extract the subjects that are common for wasBornIn, livesIn and isCitizenOf predicates
born_lives_subjects = np.intersect1d(born_subjects, lives_subjects)
born_citizen_subjects = np.intersect1d(born_subjects, citizen_subjects)
lives_citizen_subjects = np.intersect1d(lives_subjects, citizen_subjects)
born_lives_citizen_subjects_incomplete = np.intersect1d(born_lives_subjects, born_citizen_subjects)
born_lives_citizen_subjects = np.intersect1d(born_lives_citizen_subjects_incomplete, lives_citizen_subjects)

# extract triplets that share objects and subjects across all the relevant predicates
born_filtered = subset_by_strict_signature(born_subset, list(born_lives_citizen_objects), [], list(born_lives_citizen_subjects))
lives_filtered = subset_by_strict_signature(lives_subset, list(born_lives_citizen_objects), [], list(born_lives_citizen_subjects))
citizen_filtered = subset_by_strict_signature(citizen_subset, list(born_lives_citizen_objects), [], list(born_lives_citizen_subjects))

# final dataset to be used to learn rule 3
born_lives_citizen_dataset = np.concatenate((born_filtered, lives_filtered, citizen_filtered))
born_lives_citizen_dataset

array([], shape=(0, 3), dtype=object)

No data that fulfills rule.

### **Rule 4**: (A, "livesIn", B) and (B, "hasCapital", C) --> (A, "livesIn", C)


In [15]:
# extract triplets with relevant predicates for rule
capital_subset = subset_by_signature(yago, [], ["hasCapital"], [])
lives_subset = subset_by_signature(yago, [], ["livesIn"], [])

# extract the objects and subjects that appear in the relevant triplets
capital_subjects = capital_subset[:,2]
lives_subjects = lives_subset[:,2]
capital_objects = capital_subset[:,0]
lives_objects = lives_subset[:,0]

# extract objects that appear in a playsFor and hasGender triplet
capital_and_lives_subjects = np.intersect1d(capital_subjects, lives_subjects)
capital_objects_and_lives_subjects = np.intersect1d(capital_objects, lives_subjects)

# extract triplets that have subjects as objects of other preicates or vice versa
lives_with_capital_objects_as_subjects = subset_by_signature(lives_subset, [], [], list(capital_objects))
capitals_with_lives_subjects_as_objects = subset_by_signature(capital_subset, list(lives_subjects), [], [])

# find possible subjects and objects for a livesIn triplet that could be the consequent of a true example of the rule.
A = lives_with_capital_objects_as_subjects[:,0]
C = capitals_with_lives_subjects_as_objects[:,2]

# true consequents of rule
A_livesIn_C = subset_by_strict_signature(lives_subset, list(A), [], list(C))

In [16]:
len(lives_with_capital_objects_as_subjects)

1343

In [17]:
len(capitals_with_lives_subjects_as_objects)

239

In [19]:
len(A_livesIn_C)

0

No true consequents of rule in dataset.

### **Rule 5**: (A, "hasCapital", B) and (C, "isLeaderOf" B) --> (C, "isCitizenOf", B)

In [None]:
# extract triplets with relevant predicates for rule
capital_subset = subset_by_signature(yago, [], ["hasCapital"], [])
leader_subset = subset_by_signature(yago, [], ["isLeaderOf"], [])
citizen_subset = subset_by_signature(yago, [], ["isCitizenOf"], [])

# extract the objects and subjects that appear in the relevant triplets
capital_subjects = capital_subset[:,2]
leader_subjects = leader_subset[:,2]
leader_objects = leader_subset[:,0]
citizen_objects = citizen_subset[:,0]
citizen_subjects = citizen_subset[:,2]

# extract the objects and subjects that appear in multiple of the relevant predicates
capital_and_leader_subjects = np.intersect1d(capital_subjects, leader_subjects)
citizen_and_leader_subjects = np.intersect1d(citizen_subjects, leader_subjects)
citizen_and_leader_objects = np.intersect1d(citizen_objects, leader_objects)
capital_citizen_and_leader_subjects = np.intersect1d(capital_and_leader_subjects, citizen_and_leader_subjects)

# extract triplets that share subjects across all the relevant predicates
capitals_with_leaders = subset_by_signature(capital_subset, [], [], list(capital_citizen_and_leader_subjects))
leader_of_B_is_citizen_of_B = subset_by_signature(leader_subset, [], [], list(capital_citizen_and_leader_subjects))
citizen_of_B_is_leader_of_B = subset_by_signature(citizen_subset, [], [], list(capital_citizen_and_leader_subjects))

# final dataset to be used to learn the gendered players rule
capital_leader_citizen_dataset = np.concatenate((capitals_with_leaders, leader_of_B_is_citizen_of_B, citizen_of_B_is_leader_of_B))
capital_leader_citizen_dataset

### **Rule 6**: (A, "playsFor", B) --> (A, "gender", "male")

In [None]:
playsFor_subset = subset_by_signature(yago, [], ["playsFor"], [])
hasGender_subset = subset_by_signature(yago, [], ["hasGender"], [])
playsFor_objects = playsFor_subset[:,0]
hasGender_objects = hasGender_subset[:,0]

# extract objects that appear in a playsFor and hasGender triplet
playsFor_and_gender_objects = np.intersect1d(playsFor_objects, hasGender_objects)

players_with_gender = subset_by_signature(playsFor_subset, list(playsFor_and_gender_objects), [], [])
genders_that_are_players = subset_by_signature(hasGender_subset, list(playsFor_and_gender_objects), [], [])

# final dataset to be used to learn the gendered players rule
gendered_players = np.concatenate((players_with_gender, genders_that_are_players))
print("Size of dataset that can be used to learn the rule:", len(gendered_players))

print(most_frequent_targets(genders_that_are_players, n = 2))

In [None]:
genders = []
for triplet in players_with_gender:
    player = triplet[0]
    genders.append(genders_that_are_players[genders_that_are_players[:,0]==player, :][0][2])

unique, counts = np.unique(genders, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies

99% of the datapoints support the rule, out of a dataset of size 334684.

Not enough data to support this rule. This dataset doesn't even contain a single example of three triplets that fulfill the rule.

# 2. Defining train and test datasets

As is typical in machine learning, we need to split our dataset into training and test (and sometimes validation) datasets. 

What differs from the standard method of randomly sampling N points to make up our test set, is that our data points are two entities linked by some relationship, and we need to take care to ensure that all entities are represented in train and test sets by at least one triple. 

To accomplish this, AmpliGraph provides the [`train_test_split_no_unseen`](https://docs.ampligraph.org/en/latest/generated/ampligraph.evaluation.train_test_split_no_unseen.html#train-test-split-no-unseen) function.  

For sake of example, we will create a small test size that includes only 100 triples:

In [None]:
from ampligraph.evaluation import train_test_split_no_unseen 

X_train, X_test = train_test_split_no_unseen(subset, test_size=100) 

Our data is now split into train/test sets. If we need to further divide into a validation dataset we can just repeat using the same procedure on the test set (and adjusting the split percentages). 

In [None]:
print('Train set size: ', X_train.shape)
print('Test set size: ', X_test.shape)

In [None]:
["Trondheim_Airport", "hasGender", "Brisbane_Airport"] in subset

---
# 3. Training a model 

AmpliGraph has implemented [several Knoweldge Graph Embedding models](https://docs.ampligraph.org/en/latest/ampligraph.latent_features.html#knowledge-graph-embedding-models) (TransE, ComplEx, DistMult, HolE), but to begin with we're just going to use the [ComplEx](https://docs.ampligraph.org/en/latest/generated/ampligraph.latent_features.ComplEx.html#ampligraph.latent_features.ComplEx) model (with  default values), so lets import that:

In [None]:
from ampligraph.latent_features import ComplEx

Lets go through the parameters to understand what's going on:

- **`k`** : the dimensionality of the embedding space
- **`eta`** ($\eta$) : the number of negative, or false triples that must be generated at training runtime for each positive, or true triple
- **`batches_count`** : the number of batches in which the training set is split during the training loop. If you are having into low memory issues than settings this to a higher number may help.
- **`epochs`** : the number of epochs to train the model for.
- **`optimizer`** : the Adam optimizer, with a learning rate of 1e-3 set via the *optimizer_params* kwarg.
- **`loss`** : pairwise loss, with a margin of 0.5 set via the *loss_params* kwarg.
- **`regularizer`** : $L_p$ regularization with $p=2$, i.e. l2 regularization. $\lambda$ = 1e-5, set via the *regularizer_params* kwarg. 

Now we can instantiate the model:


In [None]:
model = ComplEx(batches_count=100, 
                seed=0, 
                epochs=200, 
                k=150, 
                eta=5,
                optimizer='adam', 
                optimizer_params={'lr':1e-3},
                loss='multiclass_nll', 
                regularizer='LP', 
                regularizer_params={'p':3, 'lambda':1e-5}, 
                verbose=True)

## Filtering negatives

AmpliGraph aims to follow scikit-learn's ease-of-use design philosophy and simplify everything down to **`fit`**, **`evaluate`**, and **`predict`** functions. 

However, there are some knowledge graph specific steps we must take to ensure our model can be trained and evaluated correctly. The first of these is defining the filter that will be used to ensure that no *negative* statements generated by the corruption procedure are actually positives. This is simply done by concatenating our train and test sets. Now when negative triples are generated by the corruption strategy, we can check that they aren't actually true statements.  


In [None]:
positives_filter = subset

## Fitting the model

Once you run the next cell the model will train. 

On a modern laptop this should take ~3 minutes (although your mileage may vary, especially if you've changed any of the hyper-parameters above).

In [None]:
tf.logging.set_verbosity(tf.logging.ERROR)

model.fit(X_train, early_stopping = False)

---
# 5.  Saving and restoring a model

Before we go any further, let's save the best model found so that we can restore it in future.

In [None]:
from ampligraph.latent_features import save_model

In [None]:
save_model(model, './airports_subset.pkl')

This will save the model in the ampligraph_tutorial directory as `best_model.pkl`.

.. we can then delete the model .. 

In [None]:
#del model

.. and then restore it from disk! Ta-da! 

In [None]:
#model = restore_model('./connectedTo_subset.pkl')

And let's just double check that the model we restored has been fit:

In [None]:
if model.is_fitted:
    print('The model is fit!')
else:
    print('The model is not fit! Did you skip a step?')

## Running evaluation

In [None]:
ranks = evaluate_performance(X_test, 
                             model=model, 
                             filter_triples=positives_filter,   # Corruption strategy filter defined above 
                             use_default_protocol=True, # corrupt subj and obj separately while evaluating
                             verbose=True)


The ***ranks*** returned by the evaluate_performance function indicate the rank at which the test set triple was found when performing link prediction using the model. 

For example, given the triple:

    <House Stark of Winterfell, IN_REGION The North>
    
The model returns a rank of 7. This tells us that while it's not the highest likelihood true statement (which would be given a rank 1), it's pretty likely.


## Metrics

Let's compute some evaluate metrics and print them out.

We're going to use the mrr_score (mean reciprocal rank) and hits_at_n_score functions. 

- ***mrr_score***:  The function computes the mean of the reciprocal of elements of a vector of rankings ranks.
- ***hits_at_n_score***: The function computes how many elements of a vector of rankings ranks make it to the top n positions.


In [None]:
mrr = mrr_score(ranks)
print("MRR: %.2f" % (mrr))

hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))

Now, how do we interpret those numbers? 

[Hits@N](http://docs.ampligraph.org/en/1.0.3/generated/ampligraph.evaluation.hits_at_n_score.html#ampligraph.evaluation.hits_at_n_score) indicates how many times in average a true triple was ranked in the top-N. Therefore, on average, we guessed the correct subject or object 53% of the time when considering the top-3 better ranked triples. The choice of which N makes more sense depends on the application.

The [Mean Reciprocal Rank (MRR)](http://docs.ampligraph.org/en/latest/generated/ampligraph.evaluation.mrr_score.html) is another popular metrics to assess the predictive power of a model.