## BERT Knowledge Representation
We want to compute, at each step of the user session (i.e. each document clicked), how their internal knowledge representation changes. Therefore, we have a few different methods to do so using Bert embeddings as a starting point.
A few assumptions:
- user starts with an empty knowledge representation
- User READS every document, and that is added to their knowledge

Document embeddings for BERT can come in a few different forms. Check `Compute_BERT_embeddings.ipynb` for how we compute each:
- SUM: sum of the embeddings for each sentence of the document
- MEAN: mean of the embeddings for each sentence of the document
- TRUNC: Truncate the document at the first 384 tokens.
- maxp_pairwise: Considering all sentences from the Wikipedia topic and the document, consider only the sentence with the higher similarity for any Wikipedia sentence
- maxp_sum: Consider only the sentence with higher similarity to the SUM of the wikipedia sentences
- maxp_mean: Consider only the sentence with higher similarity to the MEAN of the wikipedia sentences
- maxp_trunc: Consider only the sentence with higher similarity to the truncated wikipedia document
    
These are the ways we can compute the users' knowledge evolution. Will be compared to the same method of aggregation on the Wikipedia text

- MEAN: Concatenate all of the documents, the MEAN of these is the final knowledge.
- SUM: As the user clicks on documents, SUM the embeddings


In [41]:
import pickle
import urllib.parse
from collections import defaultdict
import json
import numpy as np
import urllib.parse

from tqdm.auto import tqdm


def normalize_vector(v):
    return v / np.linalg.norm(v)


dataset = json.load(open("../data/logs_with_position.json"))


embeddings = {
    "docs_mean": pickle.load(open("../data/docs_mean_embeddings.pkl", "rb")),
    "docs_sum": pickle.load(open("../data/docs_sum_embeddings.pkl", "rb")),
    "docs_trunc": pickle.load(open("../data/docs_trunc_embeddings.pkl", "rb")),
    "maxp_pairwise": pickle.load(open("../data/docs_maxp_pairwise_embeddings.pkl", "rb")),
    "maxp_sum": pickle.load(open("../data/docs_maxp_sum_embeddings.pkl", "rb")),
    "maxp_mean": pickle.load(open("../data/docs_maxp_mean_embeddings.pkl", "rb")),
    "maxp_trunc": pickle.load(open("../data/docs_maxp_trunc_embeddings.pkl", "rb")),
}

In [None]:
a = np.random.randn(12)
rolling_avg = 0.0


for n, i in enumerate(a):
    rolling_avg += (i - rolling_avg) / (n + 1)

True

In [40]:
u['topic_title']

'Subprime mortgage crisis'

In [43]:
topic

{}

In [46]:
final_knowledges

{('SUM_KNOWLEDGE', 'docs_mean', 'sum'): 0.9309203166983557,
 ('SUM_KNOWLEDGE', 'docs_mean', 'mean'): 0.9309202647969933,
 ('SUM_KNOWLEDGE', 'docs_mean', 'trunc'): 0.6968966156696906,
 ('SUM_KNOWLEDGE', 'docs_sum', 'sum'): 0.9309203168534019,
 ('SUM_KNOWLEDGE', 'docs_sum', 'mean'): 0.9309202649520396,
 ('SUM_KNOWLEDGE', 'docs_sum', 'trunc'): 0.6968966149985361,
 ('SUM_KNOWLEDGE', 'docs_trunc', 'sum'): 0.838700938961882,
 ('SUM_KNOWLEDGE', 'docs_trunc', 'mean'): 0.8387008915413234,
 ('SUM_KNOWLEDGE', 'docs_trunc', 'trunc'): 0.753863944267095,
 ('SUM_KNOWLEDGE', 'maxp_pairwise', 'sum'): 0.8587660841259157,
 ('SUM_KNOWLEDGE', 'maxp_pairwise', 'mean'): 0.8587660353205061,
 ('SUM_KNOWLEDGE', 'maxp_pairwise', 'trunc'): 0.7096784202961963,
 ('SUM_KNOWLEDGE', 'maxp_sum', 'sum'): 0.910286177926849,
 ('SUM_KNOWLEDGE', 'maxp_sum', 'mean'): 0.9102861266757349,
 ('SUM_KNOWLEDGE', 'maxp_sum', 'trunc'): 0.7517972642223243,
 ('SUM_KNOWLEDGE', 'maxp_mean', 'sum'): 0.910286177926849,
 ('SUM_KNOWLEDGE', '

In [48]:
methods = list(embeddings.keys())
users_knowledge_MEAN = []  # add final score for the user
users_knowledge_SUM = []
final_knowledges = defaultdict(lambda:[])

missing_docs = set()

wikipedia_embeddings = {
    "sum": pickle.load(open("../data/wikipedia_sum_embeddings.pkl", "rb")),
    "mean": pickle.load(open("../data/wikipedia_mean_embeddings.pkl", "rb")),
    "trunc": pickle.load(open("../data/wikipedia_trunc_embeddings.pkl", "rb")),
}
wikipedia_sum = {}
wikipedia_mean = {}
wikipedia_trunc = {}

for u in dataset:
    user_knowledge_mean = {k: np.zeros(768) for k in methods}
    user_knowledge_sum = {k: np.zeros(768) for k in methods}
    topic = urllib.parse.quote(u['topic_title'])
    clicks = 0
    for d in u["clicks"]:
        url = d["url"]
        if url not in docs_mean or not np.any(embeddings["docs_mean"][url]):
            missing_docs.add(url)
            continue
        clicks += 1
        for method in methods:
            emb = embeddings[method][url]
            user_knowledge_mean[method] += (emb - user_knowledge_mean[method]) / (clicks)
            user_knowledge_sum[method] += emb  # normalize at the end
    # normalize and compute final similarity
    for method in methods:
        user_knowledge_mean[method] = normalize_vector(user_knowledge_mean[method])
        user_knowledge_sum[method] = normalize_vector(user_knowledge_sum[method])
        for emb_type in wikipedia_embeddings.keys():
            wiki_emb = wikipedia_embeddings[emb_type][topic]
            final_knowledges[("SUM", method, emb_type)].append(np.dot(user_knowledge_sum[method], wiki_emb))
            final_knowledges[("MEAN", method, emb_type)].append(np.dot(user_knowledge_mean[method], wiki_emb))
