The goal of this notebook is to evaluate the embeddings that we have created. Specifically, we are interested in how well (both qualitative and quantitatively) our embeddings do at matching questions with context in the statement portion of the earnings call transcript. Since we divided the statements portion of each earnings call transcript into chunks of size at most 64 words, this reduces to matching each question with the statement chunk that is most "similar" in terms of some predefined similarity metric (e.g. cosine similarity).

In [4]:
import numpy as np
import pickle

In [2]:
# Change this to point to your embeddings
FILE_PATH = 'embeddings/bert_embeddings.pickle'

In [6]:
with open(FILE_PATH, 'rb') as f:
    transcript_embeddings = pickle.load(f)

In [7]:
def cosine_sim(u, v):
    u = np.reshape(u, (768,))
    v = np.reshape(v, (768,))
    return np.dot(u.T, v) / (np.linalg.norm(u) * np.linalg.norm(v))

In [8]:
# Define this as you see fit
sim_func = cosine_sim

In [10]:
q_to_chunk = []
for i in range(len(transcript_embeddings)):
    curr = {}
    for q_a_idx in range(len(transcript_embeddings[i][1])):
        if transcript_embeddings[i][1][q_a_idx][1] == 1: continue  # This is an answer
        q_embedding = transcript_embeddings[i][1][q_a_idx][0]
        
        curr_best_score, curr_best_chunk = None, None
        for chunk_idx in range(transcript_embeddings[i][0].shape[0]):
            chunk_embedding = transcript_embeddings[i][0][chunk_idx]
            score = sim_func(q_embedding, chunk_embedding)
            if curr_best_score is None or score > curr_best_score:
                curr_best_score = score
                curr_best_chunk = chunk_idx
        curr[q_a_idx] = curr_best_chunk
    q_to_chunk.append(curr)

In [12]:
# Now we load the original transcripts
with open('data/transcripts.pickle', 'rb') as f:
    transcripts = pickle.load(f)

In [13]:
for i, mapping in enumerate(q_to_chunk):
    for q_idx in mapping.keys():
        print ("QUESTION TEXT:\n")
        print(transcripts[i][3][q_idx][0])
        print("\nANSWER TEXT:\n")
        print(transcripts[i][2][mapping[q_idx]] + '\n')
        print('#' * 75)

QUESTION TEXT:

operator instructions first question comes line bruce geller dghm please proceed

ANSWER TEXT:

thank operator good morning thank joining us conference call fourth fiscal quarter full year ended october 1 2016 call today michael weinstein chairman ceo vinny pascal chief operating officer yet obtained copy press release issued newswire yesterday available website review full text press release along associated financial tables please go homepage begin however like read safe harbor statement need remind everyone part discussion afternoon

###########################################################################
QUESTION TEXT:

hi good morning guys

ANSWER TEXT:

greetings welcome ark restaurants fourth quarter full year 2016 results conference time participants mode session follow formal presentation operator instructions reminder conference recorded would like turn conference host bob stewart president chief financial officer thank may begin

##########################