# Earnings Call Project: MORE
<br>
CIS 831 Deep Learning – Term Project<br>
Kansas State University
<br><br>
James Chapman<br>
John Woods<br>
Nathan Diehl<br>
<br>

### This notebook featurizes the text data from the earnings calls with RoBERTa and SentenceTransformer.

RoBERTa documentation can be found at https://huggingface.co/FacebookAI/roberta-large


[SentenceTransformer](https://www.sbert.net/) from hugging face is used with the following 3 models.
- [finance-embeddings-investopedia](https://huggingface.co/FinLang/finance-embeddings-investopedia)
- [bge-m3-financial-matryoshka](https://huggingface.co/haophancs/bge-m3-financial-matryoshka)
- [bge-base-financial-matryoshka](https://huggingface.co/philschmid/bge-base-financial-matryoshka)

The data from this notebook is stored in the "data/data_prep" directory as the following CSVs.
- RoBERTa_features
- MACE_RoBERTa_features
- RoBERTa_features2
- MACE_RoBERTa_features2
- investopedia_features -------768 features
- MACE_investopedia_features --768 features
- bge_features ----------------1024 features
- MACE_bge_features -----------1024 features
- bge_base_features -----------768 features
- MACE_bge_base_features ------768 features


In [1]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
from tqdm import tqdm
from transformers import RobertaModel, RobertaTokenizer
from sentence_transformers import SentenceTransformer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [2]:
MAEC_dir = 'data/MAEC/MAEC_Dataset' # https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction

############# too big for GitHub ########################
############# stored on local disk ######################
original_data_dir = r"D:\original_dataset" # https://github.com/GeminiLn/EarningsCall_Dataset 
MAEC_audio_dir = r"D:\MAEC_audio" 
# there is a link for the audio data in the MAEC GitHub, but it does not work
# I emailed the authors, and they send another link.
# There is like a half-million files, but only 19 GB
# https://drive.google.com/file/d/1m1GRCHgKn9Vz9IFMC_SpCog6uP3-gFgY/view?usp=drive_link 

In [3]:
# Loop through the directory, each folder represents an earnings conference call; the folders are named as "CompanyName_Date".
filename_data = []
for filename in os.listdir(original_data_dir):
    company_name, date_str = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    filename_data.append([company_name, date])
filename_data = pd.DataFrame(filename_data, columns=["Company", "Date"])
company_ticker = pd.read_csv('data/data_prep/company_ticker.csv')
filename_data = filename_data.merge(company_ticker, on="Company", how="left")

# Loop through the directory, each folder represents an earnings conference call; the folders are named as "Date_CompanyName".
MAEC_filename_data = []
for filename in os.listdir(MAEC_dir):
    date_str, ticker = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    MAEC_filename_data.append([ticker, date])
MAEC_filename_data = pd.DataFrame(MAEC_filename_data, columns=["Ticker", "Date"])

In [4]:
def apply_model(model, model_name, num_features ):
    print(f'Applying {model_name} to the original dataset …')
    # num_features
    columns = [f'{model_name}_{j}' for j in range(num_features)] + ['Company', 'Date', 'Sentence_num']
    features = []
    errors = []
    for Company, Date in tqdm(filename_data[['Company', 'Date']].values):
        Date = Date.replace('-', '')
        text_path = f"D:/original_dataset/{Company}_{Date}/TextSequence.txt"
        try:
            with open(text_path, 'r', encoding='utf-8', errors='replace') as file:
                for i, line in enumerate(file, start=1):
                    # apply model
                    sentence_embedding = model(line.strip())
                    features_row = np.concatenate([sentence_embedding.flatten(), [Company, Date, i]])
                    features.append(features_row)
        except KeyboardInterrupt: break
        except Exception as e:
            errors.append((Company, Date, str(e)))
    features = np.array(features, dtype=object)
    features = pd.DataFrame(features, columns=columns)
    features.info(verbose=False)
    
    print(f"Number of errors: {len(errors)}")
    print(errors)
    ###############################################
    features.to_csv(f'data/data_prep/{model_name}.csv', index=False)
    ###############################################

def apply_model_MAEC(model, model_name, num_features):
    print(f'Applying {model_name} to the MAEC dataset …')
    # num_features
    columns = [f'{model_name}_{j}' for j in range(num_features)] + ['Ticker', 'Date', 'Sentence_num']
    features = []
    errors = []
    for Ticker, Date in tqdm(MAEC_filename_data[['Ticker', 'Date']].values):
        Date = Date.replace('-', '')
        text_path = f"D:/MAEC_audio/{Date}_{Ticker}/text.txt"
        try:
            with open(text_path, 'r', encoding='utf-8', errors='replace') as file:
                for i, line in enumerate(file, start=1):
                    # apply model
                    sentence_embedding = model(line.strip())
                    features_row = np.concatenate([sentence_embedding.flatten(), [Ticker, Date, i]])
                    features.append(features_row)
        except KeyboardInterrupt: break
        except Exception as e:
            errors.append((Ticker, Date, str(e)))
    features = np.array(features, dtype=object)
    features = pd.DataFrame(features, columns=columns)
    features.info(verbose=False)
    
    print(f"Number of errors: {len(errors)}")
    print(errors)
    ###############################################
    features.to_csv(f'data/data_prep/MAEC_{model_name}.csv', index=False)
    ###############################################


# RoBERTa features from meeting transcript text files

RoBERTa documentation can be found at https://huggingface.co/FacebookAI/roberta-large

### Following code is adapted FROM
[GitHub HTML Encoder](https://github.com/YangLinyi/HTML-Hierarchical-Transformer-based-Multi-task-Learning-for-Volatility-Prediction/blob/master/Model/Token-Level%20Encoder/HuggingFace-Roberta-Token-Encoder.py)

In [5]:

model = RobertaModel.from_pretrained('roberta-large').to(device)
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')

def get_RoBERTa(sentence):
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        # [CLS] embedding for sentence-level representation
        cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy()
    #print(cls_embedding.shape)
    # 1024 features 
    return cls_embedding


def get_RoBERTa_with_averaging(sentence):
    inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
    with torch.no_grad():
        outputs = model(**inputs)
        # average pooling over token embeddings
    token_embeddings = outputs.last_hidden_state
    sentence_embedding = torch.mean(token_embeddings, dim=1).cpu().numpy()  
    return sentence_embedding

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# RoBERTa
apply_model(get_RoBERTa, 'RoBERTa_features', 1024 )
apply_model_MAEC(get_RoBERTa, 'RoBERTa_features', 1024)
# RoBERTa_with_averaging
apply_model(get_RoBERTa_with_averaging, 'RoBERTa_features2', 1024 )
apply_model_MAEC(get_RoBERTa_with_averaging, 'RoBERTa_features2', 1024)

Applying RoBERTa_features to the original dataset …


100%|██████████| 572/572 [12:21<00:00,  1.30s/it]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Columns: 1027 entries, RoBERTa_features_0 to Sentence_num
dtypes: object(1027)
memory usage: 703.0+ MB
Number of errors: 0
[]
Applying RoBERTa_features to the MAEC dataset …


100%|██████████| 3443/3443 [55:33<00:00,  1.03it/s]  


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Columns: 1027 entries, RoBERTa_features_0 to Sentence_num
dtypes: object(1027)
memory usage: 3.0+ GB
Number of errors: 0
[]
Applying RoBERTa_features2 to the original dataset …


100%|██████████| 572/572 [11:53<00:00,  1.25s/it]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Columns: 1027 entries, RoBERTa_features2_0 to Sentence_num
dtypes: object(1027)
memory usage: 703.0+ MB
Number of errors: 0
[]
Applying RoBERTa_features2 to the MAEC dataset …


100%|██████████| 3443/3443 [53:02<00:00,  1.08it/s]  


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Columns: 1027 entries, RoBERTa_features2_0 to Sentence_num
dtypes: object(1027)
memory usage: 3.0+ GB
Number of errors: 0
[]


# SentenceTransformers

## 1. FinLang/finance-embeddings-investopedia

In [7]:
torch.cuda.empty_cache()
Sentence_Transformer_model = SentenceTransformer("FinLang/finance-embeddings-investopedia", device=device) # [, 768]
def get_Sentence_Transformer(sentence):
    sentence_embedding = Sentence_Transformer_model.encode(sentence)
    return sentence_embedding.flatten()
# investopedia
apply_model(get_Sentence_Transformer, 'investopedia_features', 768)
apply_model_MAEC(get_Sentence_Transformer, 'investopedia_features', 768)


Applying investopedia_features to the original dataset …


100%|██████████| 572/572 [10:37<00:00,  1.11s/it]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Columns: 771 entries, investopedia_features_0 to Sentence_num
dtypes: object(771)
memory usage: 527.8+ MB
Number of errors: 0
[]
Applying investopedia_features to the MAEC dataset …


100%|██████████| 3443/3443 [41:49<00:00,  1.37it/s] 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Columns: 771 entries, investopedia_features_0 to Sentence_num
dtypes: object(771)
memory usage: 2.3+ GB
Number of errors: 0
[]


## 2. haophancs/bge-m3-financial-matryoshka

In [8]:
torch.cuda.empty_cache()
Sentence_Transformer_model = SentenceTransformer("haophancs/bge-m3-financial-matryoshka", device=device) # [, 1024]
def get_Sentence_Transformer(sentence):
    sentence_embedding = Sentence_Transformer_model.encode(sentence)
    return sentence_embedding.flatten()
# bge
apply_model(get_Sentence_Transformer, 'bge_features', 1024)
apply_model_MAEC(get_Sentence_Transformer, 'bge_features', 1024)

Applying bge_features to the original dataset …


100%|██████████| 572/572 [18:57<00:00,  1.99s/it]  


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Columns: 1027 entries, bge_features_0 to Sentence_num
dtypes: object(1027)
memory usage: 703.0+ MB
Number of errors: 0
[]
Applying bge_features to the MAEC dataset …


100%|██████████| 3443/3443 [1:15:20<00:00,  1.31s/it]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Columns: 1027 entries, bge_features_0 to Sentence_num
dtypes: object(1027)
memory usage: 3.0+ GB
Number of errors: 0
[]


## 3. philschmid/bge-base-financial-matryoshka

In [9]:
torch.cuda.empty_cache()
Sentence_Transformer_model = SentenceTransformer("philschmid/bge-base-financial-matryoshka", device=device) # [, 768]
def get_Sentence_Transformer(sentence):
    sentence_embedding = Sentence_Transformer_model.encode(sentence)
    return sentence_embedding.flatten()
# bge_base
apply_model(get_Sentence_Transformer, 'bge_base_features', 768)
apply_model_MAEC(get_Sentence_Transformer, 'bge_base_features', 768)

Applying bge_base_features to the original dataset …


100%|██████████| 572/572 [09:20<00:00,  1.02it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Columns: 771 entries, bge_base_features_0 to Sentence_num
dtypes: object(771)
memory usage: 527.8+ MB
Number of errors: 0
[]
Applying bge_base_features to the MAEC dataset …


100%|██████████| 3443/3443 [40:15<00:00,  1.43it/s] 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Columns: 771 entries, bge_base_features_0 to Sentence_num
dtypes: object(771)
memory usage: 2.3+ GB
Number of errors: 0
[]
