The goal of this notebook is to use a pretrained BERT model to create embeddings of the earnings call transcripts that we have parsed and stored in the transcript data structure discussed in the FormTranscripts notebook.

In [1]:
import numpy as np
import pandas as pd
import pickle
from nltk.tokenize import word_tokenize
from bert_serving.client import BertClient
from tqdm import tqdm

In [2]:
# We'll begin by loading the train data along with our 
with open('data/transcripts_train.pickle', 'rb') as f:
    transcripts = pickle.load(f)
with open('data/id_to_row.pickle', 'rb') as f:
    id_to_row = pickle.load(f)

Start the BERT server by running the following from the command line (assuming you're in the cs224u-project directory):

bert-serving-start -model_dir ../cs224u/data/bert/uncased_L-12_H-768_A-12/ -pooling_strategy NONE -max_seq_len NONE -show_tokens_to_client

In [3]:
# Instantiate a BertClient object
bc = BertClient()

We'll begin with a toy example

In [4]:
result = bc.encode(["Hello, world.", "This is a test.", "One encoding per sentence."])
print(result.shape)
print(result[0])

(3, 7, 768)
[[-0.6583949   0.00830614  0.03765348 ... -0.45243877 -0.00460901
   0.31208587]
 [ 0.03465232  0.61453384  0.73173016 ... -0.93620336 -0.38065237
  -0.26721236]
 [-1.3928384   0.4704191   1.006837   ... -0.9887997   0.00509461
   0.5722854 ]
 ...
 [-1.1544309  -0.2005329   0.15407975 ...  0.07570094  0.13651484
  -0.60283816]
 [ 0.04261789  0.01326546 -0.02783578 ...  0.00655589 -0.04553343
   0.0079642 ]
 [-0.          0.          0.         ... -0.         -0.
   0.        ]]




In [5]:
# We'll use max-pooling across all 768 dimensions to get a single 768-dim representation for each sentence
result = np.max(result, axis=1)
print(result.shape)

(3, 768)


We now need to do some preprocessing. Specifically, we need to chunk our question and answer text just as we did with the statements (see the FormTranscripts notebook).

In [6]:
CHUNK_SZ = 64

In [7]:
def create_chunks(tokens):
    '''
    Form a list of strings with at most CHUNK_SZ words each
    '''
    result = []
    for i in range(0, len(tokens), CHUNK_SZ):
        offset = min(CHUNK_SZ, len(tokens) - i)
        curr_chunk = tokens[i:i + offset]
        curr_str = ' '.join(curr_chunk)
        result.append(curr_str)
    return result

In [8]:
for i in range(len(transcripts)):
    curr_qna = transcripts[i][3]
    for idx, elem in enumerate(curr_qna):
        curr_tokens = word_tokenize(elem[0])
        transcripts[i][3][idx] = (create_chunks(curr_tokens), elem[1])

Now we're ready to encode our training data! We will create a list transcript_embeddings of the form [ [statement chunk 1 embedding, statement chunk 2 embedding, ...], [([Q1 chunk 1 embedding, Q1 chunk 2 embedding, ...], 0), ([A1 chunk 1 embedding, A1 chunk 2 embedding, ...], 1)] ]

In [27]:
def embed_statement_chunks(chunks):
    embeddings = bc.encode(chunks)
    embeddings = np.max(embeddings, axis=1)
    return list(embeddings)

In [28]:
def embed_questions_and_answers(qna):
    result = []
    for elem in qna:
        sents = [chunk for chunk in elem[0]]
        embeddings = bc.encode(sents)
        embeddings = np.max(embeddings, axis=1)
        embedding = embeddings.max(axis=0)  # Just want one embedding per question or answer
        result.append((embedding, elem[1]))
    return result

In [43]:
NUM_SAMPLES = 100

In [44]:
transcript_embeddings = []
for i in tqdm(range(NUM_SAMPLES)):
    statement_embeddings = embed_statement_chunks(transcripts[i][2])
    qna_embeddings = embed_questions_and_answers(transcripts[i][3])
    curr_embeddings = [statement_embeddings, qna_embeddings]
    transcript_embeddings.append(curr_embeddings)

100%|██████████| 100/100 [12:24<00:00,  9.69s/it]


In [46]:
with open('embeddings/bert_embeddings.pickle', 'wb') as f:
    pickle.dump(transcript_embeddings, f)

Now let's see how well we can do at contextualizing questions using these pretrained BERT embeddings by evaluating them in the EvaluateEmbeddings notebook!