The goal of this notebook is to go from the raw earnings call data to a data structure amenable to the NLU models we will be applying to create a pseudo question-answering system for earnings call transcripts.

Specifically, we create the following two data structures:

1) A list of the following form:

[ [company name, time, [statement chunk 1, statement chunk 2, ...], [(Q1, 0), (A1, 1), (Q2, 0), (A2, 1), ...] ] ]

2) A dictionary of the following form:

{time : (statements row, Q&A row)}

Note that in the first data structure, we chunk the statement into strings of length at most 64, however we do not chunk the text in the questions or answers. Also note that we use value 0 to denote a question and 1 to denote an answer.

Finally, note that the second data structure functions as an efficient reference that allows us to quickly lookup all the information we have about a given earnings call.

In [4]:
import numpy as np
import pandas as pd
import pickle
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [7]:
# Assuming the transcripts data structure has already been created, split into test and train

with open('data/transcripts.pickle', 'rb') as f:
    transcripts_full = pickle.load(f)

split_idx = int(len(transcripts_full) * 0.7)
transcripts_train = transcripts_full[:split_idx]
transcripts_test = transcripts_full[split_idx:]

assert((len(transcripts_train) + len(transcripts_test)) == len(transcripts_full))
print("There are %d training examples and %d test examples" % (len(transcripts_train), len(transcripts_test)))

with open('data/transcripts_train.pickle', 'wb') as f:
    pickle.dump(transcripts_train, f)
    
with open('data/transcripts_test.pickle', 'wb') as f:
    pickle.dump(transcripts_test, f)

There are 1586 training examples and 680 test examples


We begin by reading in the statements and Q&A sessions:

In [8]:
statements = pd.read_pickle('data/statements_mini')
qna = pd.read_pickle('data/qna_mini')

In [9]:
print(statements.head())

   num      file-company-name file-exchange file-symbol    file-published-on  \
0    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   
1    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   
2    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   
3    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   
4    2         UniFirst Corp.          NYSE         UNF  2017-01-04 13:33:06   

                                          file-title           company-name  \
0  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   
1  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   
2  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   
3  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   
4  UniFirst's (UNF) CEO Ronald Croatti on Q1 2017...  UniFirst Corporation.   

                 quarter-year-event symbol-e

In [10]:
print(qna.head())

   num      file-company-name file-exchange file-symbol    file-published-on  \
0    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   
1    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   
2    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   
3    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   
4    1  Ark Restaurants Corp.        NASDAQ        ARKR  2017-01-01 06:49:04   

                                          file-title           company-name  \
0  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   
1  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   
2  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   
3  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   
4  Ark Restaurants' (ARKR) CEO Michael Weinstein ...  Ark Restaurants Corp.   

                 quarter-year-event symbol-e

In [11]:
# Using time as unique identifier
# Get the times from the statements
transcript_times = set([])
for i in range(statements.shape[0]):
    transcript_times.add(statements.iloc[i]['file-published-on'])

In [12]:
# Ensure that statement and Q&A times match
qna_times = set([])
for i in range(qna.shape[0]):
    qna_times.add(qna.iloc[i]['file-published-on'])
assert(transcript_times == qna_times)

In [13]:
print("There are a total of %d distinct transcripts" % len(transcript_times))

There are a total of 2266 distinct transcripts


We will now read in the lists of analysts and executives so we can determine what's a question and what's an answer later on.

In [14]:
analysts = pd.read_csv('./data/analysts.csv')
analysts.head()

Unnamed: 0,num,file-company-name,file-exchange,file-symbol,file-published-on,file-title,company-name,quarter-year-event,symbol-exchange,symbol,datetime,analyst-company
0,1,Ark Restaurants Corp.,NASDAQ,ARKR,2017-01-01 06:49:04,Ark Restaurants' (ARKR) CEO Michael Weinstein ...,Ark Restaurants Corp.,Q4 2016 Earnings Conference Call,,ARKR,"December 30, 2016 10:00 A.M. ET",Bruce Geller - DGHM
1,2,UniFirst Corp.,NYSE,UNF,2017-01-04 13:33:06,UniFirst's (UNF) CEO Ronald Croatti on Q1 2017...,UniFirst Corporation.,Q1 2017 Earnings Conference Call,,UNF,"January 4, 2017 10:00 AM ET",John Healy - Northcoast Research
2,2,UniFirst Corp.,NYSE,UNF,2017-01-04 13:33:06,UniFirst's (UNF) CEO Ronald Croatti on Q1 2017...,UniFirst Corporation.,Q1 2017 Earnings Conference Call,,UNF,"January 4, 2017 10:00 AM ET",Kevin Steinke - Barrington Research Associates...
3,2,UniFirst Corp.,NYSE,UNF,2017-01-04 13:33:06,UniFirst's (UNF) CEO Ronald Croatti on Q1 2017...,UniFirst Corporation.,Q1 2017 Earnings Conference Call,,UNF,"January 4, 2017 10:00 AM ET",Joe Box - KeyBanc Capital Markets
4,2,UniFirst Corp.,NYSE,UNF,2017-01-04 13:33:06,UniFirst's (UNF) CEO Ronald Croatti on Q1 2017...,UniFirst Corporation.,Q1 2017 Earnings Conference Call,,UNF,"January 4, 2017 10:00 AM ET","Justin Hauke - Robert W. Baird & Company, Inc."


In [15]:
executives = pd.read_csv('./data/executives.csv')
executives.head()

Unnamed: 0,num,file-company-name,file-exchange,file-symbol,file-published-on,file-title,company-name,quarter-year-event,symbol-exchange,symbol,datetime,executive-positions
0,1,Ark Restaurants Corp.,NASDAQ,ARKR,2017-01-01 06:49:04,Ark Restaurants' (ARKR) CEO Michael Weinstein ...,Ark Restaurants Corp.,Q4 2016 Earnings Conference Call,,ARKR,"December 30, 2016 10:00 A.M. ET",Bob Stewart - President and Chief Financial Of...
1,1,Ark Restaurants Corp.,NASDAQ,ARKR,2017-01-01 06:49:04,Ark Restaurants' (ARKR) CEO Michael Weinstein ...,Ark Restaurants Corp.,Q4 2016 Earnings Conference Call,,ARKR,"December 30, 2016 10:00 A.M. ET",Michael Weinstein - Chairman and Chief Executi...
2,2,UniFirst Corp.,NYSE,UNF,2017-01-04 13:33:06,UniFirst's (UNF) CEO Ronald Croatti on Q1 2017...,UniFirst Corporation.,Q1 2017 Earnings Conference Call,,UNF,"January 4, 2017 10:00 AM ET",Ronald Croatti - President and CEO
3,2,UniFirst Corp.,NYSE,UNF,2017-01-04 13:33:06,UniFirst's (UNF) CEO Ronald Croatti on Q1 2017...,UniFirst Corporation.,Q1 2017 Earnings Conference Call,,UNF,"January 4, 2017 10:00 AM ET",Steven Sintros - SVP and CFO
4,3,"Resources Connection, Inc.",NASDAQ,RECN,2017-01-04 23:16:06,Resources Connection's (RECN) CEO Kate Duchene...,"Resources Connection, Inc.",Q2 2017 Earnings Conference Call,,RECN,"January 4, 2017 17:00 ET",Alice Washington - Interim General Counsel


In [16]:
# We will use set representations of analysts and executives later on
# Note that we store both with and without company since formatting may vary
analyst_set = set([])
for i in range(analysts.shape[0]):
    curr_analyst = str(analysts.iloc[i]['analyst-company'])
    analyst_set.add(curr_analyst)
    curr_analyst = curr_analyst.split(' - ')
    if len(curr_analyst) > 1:
        analyst_set.add(curr_analyst[0])

In [17]:
exec_set = set([])
for i in range(executives.shape[0]):
    curr_exec = str(executives.iloc[i]['executive-positions'])
    exec_set.add(curr_exec)
    curr_exec = curr_exec.split(' - ')
    if len(curr_exec) > 1:
        exec_set.add(curr_exec[0])

We are finally ready to form the transcripts data structure.

In [18]:
stop_words = set(stopwords.words('english'))

In [19]:
CHUNK_SZ = 64

In [20]:
def create_chunks(tokens):
    '''
    Form a list of strings with at most CHUNK_SZ words each
    '''
    result = []
    for i in range(0, len(tokens), CHUNK_SZ):
        offset = min(CHUNK_SZ, len(tokens) - i)
        curr_chunk = tokens[i:i + offset]
        curr_str = ' '.join(curr_chunk)
        result.append(curr_str)
    return result

In [21]:
def form_statement(row):
    '''
    Returns a list of the form [chunk1, chunk2, ...] where each chunk is a string
    containing at most CHUNK_SZ words (we remove stop words and punctuation)
    '''
    transcript_id = statements.iloc[row]['file-published-on']
    result = []
    
    while(row < statements.shape[0] and statements.iloc[row]['file-published-on'] == transcript_id):
        curr_text = str(statements.iloc[row]['content'])
        curr_tokens = word_tokenize(curr_text)
        curr_tokens = [tok.lower() for tok in curr_tokens if (not tok.lower() in stop_words and tok.isalnum())]
        # Ignore single token (e.g. NaN)
        if len(curr_tokens) > 1:
            result.extend(create_chunks(curr_tokens))
        row += 1

    return result, row

In [22]:
def is_question(row):
    '''
    If the speaker is (not) in the analyst set, return True (False).
    '''
    speaker = qna.iloc[row]['name']
    return speaker in analyst_set

In [23]:
def form_qna(row):
    '''
    Returns a list of the form [(Q1,  0), (A1, 1), (Q2, 0), (A2, 1), ...] where each tuple
    represents a question (element at index 1 is 0) or an answer (element at index 1 is 1)

    Note that we do not create chunks for the questions and answers, rather we treat them
    as monolithic entities due to the way that our training procedure is defined later on.
    ''' 
    transcript_id = qna.iloc[row]['file-published-on']
    result = []
    
    while(row < qna.shape[0] and qna.iloc[row]['file-published-on'] == transcript_id):
        curr_text = str(qna.iloc[row]['content'])
        curr_tokens = word_tokenize(curr_text)
        curr_tokens = [tok.lower() for tok in curr_tokens if (not tok.lower() in stop_words and tok.isalnum())]
        # Ignore single token (e.g. NaN)
        if len(curr_tokens) > 1:
            curr_text = ' '.join(curr_tokens)
            if is_question(row): result.append((curr_text, 0))
            else: result.append((curr_text, 1))
        row += 1
        
    return result, row

In [32]:
def form_transcripts():
    result = []
    statement_row, qna_row = 0, 0
    for i in range(len(transcript_times)):
        # Get name and time and ensure consistency between statement and Q&A
        company_name = statements.iloc[statement_row]['company-name']
        if not(qna.iloc[qna_row]['company-name'] == company_name):
            print(company_name)
            print(qna.iloc[qna_row]['company-name'])
        transcript_time = statements.iloc[statement_row]['file-published-on']
        if not(qna.iloc[qna_row]['file-published-on'] == transcript_time):
            print(transcript_time)
            print(qna.iloc[qna_row]['file-published-on'])
        
        curr_statement, statement_row = form_statement(statement_row)
        curr_qna, qna_row = form_qna(qna_row)
        curr_transcript = [company_name, transcript_time, curr_statement, curr_qna]
        result.append(curr_transcript)
    return result

In [33]:
transcripts = form_transcripts()

nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan


In [34]:
def get_statements_row(time, row):
    while(statements.iloc[row]['file-published-on'] == time):
        row += 1
    next_time = statements.iloc[row]['file-published-on']
    if next_time == time:
        print(next_time)
        print(time)
    return next_time, row

In [35]:
def get_qna_row(time, row):
    while(qna.iloc[row]['file-published-on'] == time):
        row += 1
    next_time = qna.iloc[row]['file-published-on']
    if next_time == time:
        print(next_time)
        print(time)
    return next_time, row

In [36]:
def form_id_to_row():
    '''
    We form a mapping from ID (i.e. transcript time) to a tuple of the
    form (row in statements, row in Q&A) for easy lookups later on
    '''
    result = {}
    curr_transcript_time = statements.iloc[0]['file-published-on']
    result[curr_transcript_time] = (0, 0)
    statements_row, qna_row = 0, 0
    for i in range(len(transcript_times)):
        next_transcript_time, statements_row = get_statements_row(curr_transcript_time, statements_row)
        check, qna_row = get_qna_row(curr_transcript_time, qna_row)
        if check != next_transcript_time:
            print(check)
            print(next_transcript_time)
        curr_transcript_time = next_transcript_time
        result[curr_transcript_time] = (statements_row, qna_row)
    return result

In [37]:
id_to_row = form_id_to_row()

In [38]:
print(transcripts[1300])

['Cummins, Inc.', '2017-02-09 15:24:33', ['good day ladies gentlemen welcome q4 2016 cummins earnings conference call time participants mode later conduct session instructions follow time would like turn call mark smith vice president finance operations please go ahead', 'thank good morning everyone welcome teleconference today discuss cummins results fourth quarter 2016 joining today chairman chief executive officer tom linebarger chief financial officer pat ward president chief operating officer rich freeland start please note information hear given today consist statements within meaning securities exchange act 1934 statements express forecast expectations hopes beliefs intentions strategies regarding future actual future results could differ materially projected', 'statements number risks uncertainties information regarding risks uncertainties available disclosure statement slide deck filings sec particularly risk factors section recently filed annual report form subsequently filed

In [39]:
print(id_to_row['2017-02-09 15:24:33'])

(7378, 91406)


In [40]:
print(statements.iloc[7378])

num                                                                 324
file-company-name                                          Cummins Inc.
file-exchange                                                      NYSE
file-symbol                                                         CMI
file-published-on                                   2017-02-09 15:24:33
file-title            Cummins (CMI) Q4 2016 Results - Earnings Call ...
company-name                                              Cummins, Inc.
quarter-year-event                                Q4 2016 Earnings Call
symbol-exchange                                                     NaN
symbol                                                              CMI
datetime                                  February 09, 2017 10:00 am ET
order                                                                 1
name                                                       Presentation
content                                                         

In [41]:
print(qna.iloc[91406])

num                                                                 324
file-company-name                                          Cummins Inc.
file-exchange                                                      NYSE
file-symbol                                                         CMI
file-published-on                                   2017-02-09 15:24:33
file-title            Cummins (CMI) Q4 2016 Results - Earnings Call ...
company-name                                              Cummins, Inc.
quarter-year-event                                Q4 2016 Earnings Call
symbol-exchange                                                     NaN
symbol                                                              CMI
datetime                                  February 09, 2017 10:00 am ET
order                                                                 1
name                                                           Operator
content               Our first question comes from Tim Thein wi

In [42]:
with open('data/transcripts.pickle', 'wb') as f:
    pickle.dump(transcripts, f)
with open('data/id_to_row.pickle', 'wb') as f:
    pickle.dump(id_to_row, f)

In [43]:
with open('data/transcripts.pickle', 'rb') as f:
    test_transcripts_load = pickle.load(f)
with open('data/id_to_row.pickle', 'rb') as f:
    test_id_to_row_load = pickle.load(f)

In [44]:
print(test_transcripts_load[1300])

['Cummins, Inc.', '2017-02-09 15:24:33', ['good day ladies gentlemen welcome q4 2016 cummins earnings conference call time participants mode later conduct session instructions follow time would like turn call mark smith vice president finance operations please go ahead', 'thank good morning everyone welcome teleconference today discuss cummins results fourth quarter 2016 joining today chairman chief executive officer tom linebarger chief financial officer pat ward president chief operating officer rich freeland start please note information hear given today consist statements within meaning securities exchange act 1934 statements express forecast expectations hopes beliefs intentions strategies regarding future actual future results could differ materially projected', 'statements number risks uncertainties information regarding risks uncertainties available disclosure statement slide deck filings sec particularly risk factors section recently filed annual report form subsequently filed

In [45]:
print(test_id_to_row_load['2017-02-09 15:24:33'])

(7378, 91406)
