# Earnings Call Project: Data Cleaning
<br>
CIS 831 Deep Learning – Term Project<br>
Kansas State University
<br><br>
James Chapman<br>
John Woods<br>
Nathan Diehl<br>
<br>

This notebook creates data used for training/testing.
- Calculates the targets for both datasets
- Corrects Praat features for both datasets
- Combines features (Glove & Praat) or (RoBERTa & Praat) and targets
- save 3 numpy files specifically for HTML (RoBERTa & Praat)
    - TEXT_emb
    - LABEL_emb
    - LABEL_emb_b
- it is important to note 7 meetings were removed because we could not find stock data (Yahoo or alphadvantage)<br>

The rest data from this notebook is stored in the "data" directory as the following CSVs
- original_dataset
- MAEC_dataset

In [4]:
import sys
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install yfinance
    from google.colab import drive
    drive.mount('/content/gdrive')
    %cd gdrive/My Drive/831

In [5]:
import pandas as pd
import numpy as np
import requests
import time
import json
import csv
import re
import os
from datetime import datetime
from tqdm import tqdm

tqdm.pandas()

In [6]:
MAEC_dir = 'data/MAEC/MAEC_Dataset' # https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction

############# too big for GitHub ########################
############# stored on local disk ######################
original_data_dir = r"D:\original_dataset" # https://github.com/GeminiLn/EarningsCall_Dataset 
MAEC_audio_dir = r"D:\MAEC_audio" 
# there is a link for the audio data in the MAEC GitHub, but it does not work
# I emailed the authors, and they send another link.
# There is like a half-million files, but only 19 GB
# https://drive.google.com/file/d/1m1GRCHgKn9Vz9IFMC_SpCog6uP3-gFgY/view?usp=drive_link 

# from Webscraping
alpha_dir = 'data/data_prep/alpha_data/{}.csv' #.format(ticker) # I saved the raw alphadvantage data, so I don't have to do it again
yahoo_data = pd.read_csv('data/data_prep/yahoo_data.csv', index_col=0)
alpha_data = pd.read_csv('data/data_prep/alpha_data.csv', index_col=0)
MAEC_yahoo_data = pd.read_csv('data/data_prep/MAEC_yahoo_data.csv', index_col=0)
MAEC_alpha_data = pd.read_csv('data/data_prep/MAEC_alpha_data.csv', index_col=0)

yahoo_data.index = pd.to_datetime(yahoo_data.index)
alpha_data.index = pd.to_datetime(alpha_data.index)
MAEC_yahoo_data.index = pd.to_datetime(MAEC_yahoo_data.index)
MAEC_alpha_data.index = pd.to_datetime(MAEC_alpha_data.index)

# from Feature Engineering
glove_features = pd.read_csv('data/data_prep/glove_features.csv')
praat_features = pd.read_csv('data/data_prep/praat_features.csv', low_memory=False)
MAEC_glove_features = pd.read_csv('data/data_prep/MAEC_glove_features.csv')
MAEC_praat_features = pd.read_csv('data/data_prep/MAEC_praat_features.csv', low_memory=False)

# from Feature Engineering RoBERTa
RoBERTa_features = pd.read_csv('data/data_prep/RoBERTa_features.csv', low_memory=False)
MACE_RoBERTa_features = pd.read_csv('data/data_prep/MACE_RoBERTa_features.csv', low_memory=False)

FileNotFoundError: [Errno 2] No such file or directory: 'data/data_prep/RoBERTa_features.csv'

In [None]:
# Loop through the directory, each folder represents an earnings conference call; the folders are named as "CompanyName_Date".
filename_data = []
for filename in os.listdir(original_data_dir):
    company_name, date_str = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    filename_data.append([company_name, date])
filename_data = pd.DataFrame(filename_data, columns=["Company", "Date"])
company_ticker = pd.read_csv('data/data_prep/company_ticker.csv')
filename_data = filename_data.merge(company_ticker, on="Company", how="left")

# Loop through the directory, each folder represents an earnings conference call; the folders are named as "Date_CompanyName".
MAEC_filename_data = []
for filename in os.listdir(MAEC_dir):
    date_str, ticker = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    MAEC_filename_data.append([ticker, date])
MAEC_filename_data = pd.DataFrame(MAEC_filename_data, columns=["Ticker", "Date"])

# Add TARGET of the regression

**n-day volatility predictions**: The predicted average volatility over the following n days.<br>

$$
v[0,n] = \ln \left( \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (r_i - \bar{r})^2 } \right)
$$

Where:
- \( r_i \) is the stock return on day \(i\),
- \( \bar{r} \) is the average stock return over \(n\) days.

The stock return \(r_i\) is defined as:

$$
r_i = \frac{P_i - P_{i-1}}{P_{i-1}}
$$

Where \(P_i\) is the adjusted closing price of the stock on day \(i\).

For **single-day log volatility**, we estimate it using the **daily log absolute return**:

$$
v_n = \ln \left( \left| \frac{P_n - P_{n-1}}{P_{n-1}} \right| \right)
$$

Where:
- \(P_n\) is the adjusted closing price of the stock on day \(n\),
- \(P_{n-1}\) is the adjusted closing price on the previous day.

Our multi-task learning objective is to **simultaneously predict** these two quantities:
- \(v[0,n]\): The average volatility over \(n\) days (the main task).
- \(v_n\): The single-day volatility (the auxiliary task).


In [None]:
targets = filename_data.copy()

# Yahoo (missing 9 companies)
def add_n_day(row, n_day):
    Ticker = row['Ticker']
    if Ticker in ['GGP', 'CA', 'STI', 'FLT', 'NLSN', 'WRK','RTN', 'UTX', 'DISH']:
        return float('inf'), float('inf')
    Date = pd.to_datetime(row['Date'])
    start = Date - pd.Timedelta(days=(1))
    end = Date + pd.Timedelta(days=(n_day))
    data = yahoo_data.loc[start:end, f"{Ticker}_Adj Close"]  
    stock_return = data.pct_change().dropna() # ri =(Pi − Pi−1)/Pi−1
    std_dev = stock_return.std()
    if pd.isna(std_dev) or std_dev == 0:
        return 0, 0
    else:
        return np.log(std_dev), np.log(abs(stock_return.iloc[-1]) + 1e-10)

targets['3_day'], targets['3_day_single'] = zip(*targets.apply(lambda row: add_n_day(row, 3), axis=1))
targets['7_day'], targets['7_day_single'] = zip(*targets.apply(lambda row: add_n_day(row, 7), axis=1))
targets['15_day'], targets['15_day_single'] = zip(*targets.apply(lambda row: add_n_day(row, 15), axis=1))
targets['30_day'], targets['30_day_single'] = zip(*targets.apply(lambda row: add_n_day(row, 30), axis=1))
targets.info(verbose=True)

In [None]:
def add_n_day(row, n_day):
    Ticker = row['Ticker']
    Date = pd.to_datetime(row['Date'])
    start = Date - pd.Timedelta(days=1)
    end = Date + pd.Timedelta(days=n_day)
    # using alphadvantage 
    data = alpha_data.loc[start:end, Ticker]
    stock_return = data.pct_change().dropna() # ri =(Pi − Pi−1)/Pi−1
    std_dev = stock_return.std()
    if pd.isna(std_dev) or std_dev == 0:
        return 0, 0
    else:
        return np.log(std_dev), np.log(abs(stock_return.iloc[-1]) + 1e-10)

targets['3_day_alpha'], targets['3_day_single_alpha'] = zip(*targets.apply(lambda row: add_n_day(row, 3), axis=1))
targets['7_day_alpha'], targets['7_day_single_alpha'] = zip(*targets.apply(lambda row: add_n_day(row, 7), axis=1))
targets['15_day_alpha'], targets['15_day_single_alpha'] = zip(*targets.apply(lambda row: add_n_day(row, 15), axis=1))
targets['30_day_alpha'], targets['30_day_single_alpha'] = zip(*targets.apply(lambda row: add_n_day(row, 30), axis=1))

Here I am comparing the Yahoo data to alphadvantage. Why is there a difference? How does each calculate dividend adjustment?

In [None]:
targets['3_day_diff'] = targets['3_day_alpha'] - targets['3_day']
targets['7_day_diff'] = targets['7_day_alpha'] - targets['7_day']
targets['15_day_diff'] = targets['15_day_alpha'] - targets['15_day']
targets['30_day_diff'] = targets['30_day_alpha'] - targets['30_day']

targets['3_day_pct_change'] = abs(((targets['3_day_alpha'] - targets['3_day']) / targets['3_day_alpha']) * 100)
targets['7_day_pct_change'] = abs(((targets['7_day_alpha'] - targets['7_day']) / targets['7_day_alpha']) * 100)
targets['15_day_pct_change'] = abs(((targets['15_day_alpha'] - targets['15_day']) / targets['15_day_alpha']) * 100)
targets['30_day_pct_change'] = abs(((targets['30_day_alpha'] - targets['30_day']) / targets['30_day_alpha']) * 100)
 
targets['3_day_single_diff'] = targets['3_day_single_alpha'] - targets['3_day_single']
targets['7_day_single_diff'] = targets['7_day_single_alpha'] - targets['7_day_single']
targets['15_day_single_diff'] = targets['15_day_single_alpha'] - targets['15_day_single']
targets['30_day_single_diff'] = targets['30_day_single_alpha'] - targets['30_day_single']

targets['3_day_single_pct_change'] = abs(((targets['3_day_single_alpha'] - targets['3_day_single']) / targets['3_day_single_alpha']) * 100)
targets['7_day_single_pct_change'] = abs(((targets['7_day_single_alpha'] - targets['7_day_single']) / targets['7_day_single_alpha']) * 100)
targets['15_day_single_pct_change'] = abs(((targets['15_day_single_alpha'] - targets['15_day_single']) / targets['15_day_single_alpha']) * 100)
targets['30_day_single_pct_change'] = abs(((targets['30_day_single_alpha'] - targets['30_day_single']) / targets['30_day_single_alpha']) * 100)

# investigate discrepancies
targets = targets.sort_values(by='30_day_pct_change', ascending=False)
targets = targets.sort_values(by='30_day_single_pct_change', ascending=False)
targets.to_csv('data/data_prep/temp.csv', index=False) 

In [None]:
# Change Yahoo errors to the values found with alphadvantage 
targets['3_day_single'] = np.where(targets['3_day'] == float('inf'), targets['3_day_single_alpha'], targets['3_day_single'])
targets['7_day_single'] = np.where(targets['7_day'] == float('inf'), targets['7_day_single_alpha'], targets['7_day_single'])
targets['15_day_single'] = np.where(targets['15_day'] == float('inf'), targets['15_day_single_alpha'], targets['15_day_single'])
targets['30_day_single'] = np.where(targets['30_day'] == float('inf'), targets['30_day_single_alpha'], targets['30_day_single'])
targets['3_day'] = np.where(targets['3_day'] == float('inf'), targets['3_day_alpha'], targets['3_day'])
targets['7_day'] = np.where(targets['7_day'] == float('inf'), targets['7_day_alpha'], targets['7_day'])
targets['15_day'] = np.where(targets['15_day'] == float('inf'), targets['15_day_alpha'], targets['15_day'])
targets['30_day'] = np.where(targets['30_day'] == float('inf'), targets['30_day_alpha'], targets['30_day'])

targets['3_day_single'] = np.where(targets['3_day'] == 0, targets['3_day_single_alpha'], targets['3_day_single'])
targets['7_day_single'] = np.where(targets['7_day'] == 0, targets['7_day_single_alpha'], targets['7_day_single'])
targets['15_day_single'] = np.where(targets['15_day'] == 0, targets['15_day_single_alpha'], targets['15_day_single'])
targets['30_day_single'] = np.where(targets['30_day'] == 0, targets['30_day_single_alpha'], targets['30_day_single'])
targets['3_day'] = np.where(targets['3_day'] == 0, targets['3_day_alpha'], targets['3_day'])
targets['7_day'] = np.where(targets['7_day'] == 0, targets['7_day_alpha'], targets['7_day'])
targets['15_day'] = np.where(targets['15_day'] == 0, targets['15_day_alpha'], targets['15_day'])
targets['30_day'] = np.where(targets['30_day'] == 0, targets['30_day_alpha'], targets['30_day'])

print('Number of rows without data-', len(targets[((targets['3_day'] == 0) & 
                                                    (targets['7_day'] == 0) & 
                                                    (targets['15_day'] == 0) & 
                                                    (targets['30_day'] == 0))]))
print('----------------------------------------------------------------------')

# delete?
# targets = targets[~((targets['3_day'] == 0) & 
#                 (targets['7_day'] == 0) & 
#                 (targets['15_day'] == 0) & 
#                 (targets['30_day'] == 0))]

In [None]:
targets = targets.drop(['3_day_alpha', '3_day_single_alpha', '7_day_alpha', '7_day_single_alpha', '15_day_alpha', 
                        '15_day_single_alpha', '30_day_alpha', '30_day_single_alpha', '3_day_diff', '7_day_diff', 
                        '15_day_diff', '30_day_diff', '3_day_pct_change', '7_day_pct_change', '15_day_pct_change', 
                        '30_day_pct_change', '3_day_single_diff', '7_day_single_diff', '15_day_single_diff', 
                        '30_day_single_diff', '3_day_single_pct_change', '7_day_single_pct_change', 
                        '15_day_single_pct_change', '30_day_single_pct_change'], axis=1)
targets = targets.sort_values(by='3_day', ascending=True)
targets.info(verbose=True)
### save ############################################
targets.to_csv('data/data_prep/targets.csv', index=False)
#####################################################

# MAEC dataset TARGETS

[paper](https://dl.acm.org/doi/10.1145/3340531.3412879)
[GitHub](https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction/tree/master)

In [None]:
MAEC_targets = MAEC_filename_data.copy()

# Yahoo (missing 9 companies)
def add_n_day(row, n_day):
    Ticker = row['Ticker']
    Date = pd.to_datetime(row['Date'])
    start = Date - pd.Timedelta(days=(1))
    end = Date + pd.Timedelta(days=(n_day))
    data = MAEC_yahoo_data.loc[start:end, f"{Ticker}_Adj Close"] 
    stock_return = data.pct_change().dropna() # ri =(Pi − Pi−1)/Pi−1
    std_dev = stock_return.std()
    if pd.isna(std_dev) or std_dev == 0:
        return 0, 0
    else:
        return np.log(std_dev), np.log(abs(stock_return.iloc[-1]) + 1e-10)

MAEC_targets['3_day'], MAEC_targets['3_day_single'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 3), axis=1))
MAEC_targets['7_day'], MAEC_targets['7_day_single'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 7), axis=1))
MAEC_targets['15_day'], MAEC_targets['15_day_single'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 15), axis=1))
MAEC_targets['30_day'], MAEC_targets['30_day_single'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 30), axis=1))
MAEC_targets.info(verbose=True)

In [None]:
# 58 tickers that did not return data from alphadvantage
bad_tickers = ['GPS', 'JCP', 'TUP', 'BBT', 'MDP', 'LL', 'ABC', 'PKI', 'HFC', 'HSC', 'CBB', 'ILG', 'JCOM', 'EBIX', 'ENDP', 
               'BIG', 'ASNA', 'IVC', 'BCOR', 'INT', 'FRED', 'CAMP', 'ELY', 'COG', 'CLD', 'CRY', 'PLT', 'FRAN', 'ADS', 'CPSI', 
               'FBHS', 'CHFC', 'UIHC', 'OFC', 'TMST', 'FTD', 'SWM', 'WPG', 'WLTW', 'AKRX', 'BLL', 'DF', 'TLRD', 'SPN', 'CLI', 
               'ESV', 'RCII', 'ANTM', 'RE', 'NCR', 'NEWM', 'PEI', 'LCI', 'ERA', 'ACOR', 'FB', 'AAXN', 'NLS']
def add_n_day(row, n_day):
    Ticker = row['Ticker']
    if Ticker in bad_tickers:
        return float('inf'), float('inf')
    Date = pd.to_datetime(row['Date'])
    start = Date
    end = Date + pd.Timedelta(days=n_day)
    # using alphadvantage 
    data = MAEC_alpha_data.loc[start:end, Ticker]
    stock_return = data.pct_change().dropna() # ri =(Pi − Pi−1)/Pi−1
    std_dev = stock_return.std()
    if pd.isna(std_dev) or std_dev == 0:
        return 0, 0
    else:
        return np.log(std_dev), np.log(abs(stock_return.iloc[-1]) + 1e-10)

MAEC_targets['3_day_alpha'], MAEC_targets['3_day_single_alpha'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 3), axis=1))
MAEC_targets['7_day_alpha'], MAEC_targets['7_day_single_alpha'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 7), axis=1))
MAEC_targets['15_day_alpha'], MAEC_targets['15_day_single_alpha'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 15), axis=1))
MAEC_targets['30_day_alpha'], MAEC_targets['30_day_single_alpha'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 30), axis=1))

In [None]:
MAEC_targets['3_day_diff'] = MAEC_targets['3_day_alpha'] - MAEC_targets['3_day']
MAEC_targets['7_day_diff'] = MAEC_targets['7_day_alpha'] - MAEC_targets['7_day']
MAEC_targets['15_day_diff'] = MAEC_targets['15_day_alpha'] - MAEC_targets['15_day']
MAEC_targets['30_day_diff'] = MAEC_targets['30_day_alpha'] - MAEC_targets['30_day']

MAEC_targets['3_day_pct_change'] = abs(((MAEC_targets['3_day_alpha'] - MAEC_targets['3_day']) / MAEC_targets['3_day_alpha']) * 100)
MAEC_targets['7_day_pct_change'] = abs(((MAEC_targets['7_day_alpha'] - MAEC_targets['7_day']) / MAEC_targets['7_day_alpha']) * 100)
MAEC_targets['15_day_pct_change'] = abs(((MAEC_targets['15_day_alpha'] - MAEC_targets['15_day']) / MAEC_targets['15_day_alpha']) * 100)
MAEC_targets['30_day_pct_change'] = abs(((MAEC_targets['30_day_alpha'] - MAEC_targets['30_day']) / MAEC_targets['30_day_alpha']) * 100)
 
MAEC_targets['3_day_single_diff'] = MAEC_targets['3_day_single_alpha'] - MAEC_targets['3_day_single']
MAEC_targets['7_day_single_diff'] = MAEC_targets['7_day_single_alpha'] - MAEC_targets['7_day_single']
MAEC_targets['15_day_single_diff'] = MAEC_targets['15_day_single_alpha'] - MAEC_targets['15_day_single']
MAEC_targets['30_day_single_diff'] = MAEC_targets['30_day_single_alpha'] - MAEC_targets['30_day_single']

MAEC_targets['3_day_single_pct_change'] = abs(((MAEC_targets['3_day_single_alpha'] - MAEC_targets['3_day_single']) / MAEC_targets['3_day_single_alpha']) * 100)
MAEC_targets['7_day_single_pct_change'] = abs(((MAEC_targets['7_day_single_alpha'] - MAEC_targets['7_day_single']) / MAEC_targets['7_day_single_alpha']) * 100)
MAEC_targets['15_day_single_pct_change'] = abs(((MAEC_targets['15_day_single_alpha'] - MAEC_targets['15_day_single']) / MAEC_targets['15_day_single_alpha']) * 100)
MAEC_targets['30_day_single_pct_change'] = abs(((MAEC_targets['30_day_single_alpha'] - MAEC_targets['30_day_single']) / MAEC_targets['30_day_single_alpha']) * 100)
# investigate discrepancies
MAEC_targets = MAEC_targets.sort_values(by='30_day_pct_change', ascending=False)
MAEC_targets.to_csv('data/data_prep/MAEC_temp.csv', index=False)

In [None]:
# Change Yahoo errors to the values found with alphadvantage 
MAEC_targets['3_day_single'] = np.where(MAEC_targets['3_day'] == float('inf'), MAEC_targets['3_day_single_alpha'], MAEC_targets['3_day_single'])
MAEC_targets['7_day_single'] = np.where(MAEC_targets['7_day'] == float('inf'), MAEC_targets['7_day_single_alpha'], MAEC_targets['7_day_single'])
MAEC_targets['15_day_single'] = np.where(MAEC_targets['15_day'] == float('inf'), MAEC_targets['15_day_single_alpha'], MAEC_targets['15_day_single'])
MAEC_targets['30_day_single'] = np.where(MAEC_targets['30_day'] == float('inf'), MAEC_targets['30_day_single_alpha'], MAEC_targets['30_day_single'])
MAEC_targets['3_day'] = np.where(MAEC_targets['3_day'] == float('inf'), MAEC_targets['3_day_alpha'], MAEC_targets['3_day'])
MAEC_targets['7_day'] = np.where(MAEC_targets['7_day'] == float('inf'), MAEC_targets['7_day_alpha'], MAEC_targets['7_day'])
MAEC_targets['15_day'] = np.where(MAEC_targets['15_day'] == float('inf'), MAEC_targets['15_day_alpha'], MAEC_targets['15_day'])
MAEC_targets['30_day'] = np.where(MAEC_targets['30_day'] == float('inf'), MAEC_targets['30_day_alpha'], MAEC_targets['30_day'])

MAEC_targets['3_day_single'] = np.where(MAEC_targets['3_day'] == 0, MAEC_targets['3_day_single_alpha'], MAEC_targets['3_day_single'])
MAEC_targets['7_day_single'] = np.where(MAEC_targets['7_day'] == 0, MAEC_targets['7_day_single_alpha'], MAEC_targets['7_day_single'])
MAEC_targets['15_day_single'] = np.where(MAEC_targets['15_day'] == 0, MAEC_targets['15_day_single_alpha'], MAEC_targets['15_day_single'])
MAEC_targets['30_day_single'] = np.where(MAEC_targets['30_day'] == 0, MAEC_targets['30_day_single_alpha'], MAEC_targets['30_day_single'])
MAEC_targets['3_day'] = np.where(MAEC_targets['3_day'] == 0, MAEC_targets['3_day_alpha'], MAEC_targets['3_day'])
MAEC_targets['7_day'] = np.where(MAEC_targets['7_day'] == 0, MAEC_targets['7_day_alpha'], MAEC_targets['7_day'])
MAEC_targets['15_day'] = np.where(MAEC_targets['15_day'] == 0, MAEC_targets['15_day_alpha'], MAEC_targets['15_day'])
MAEC_targets['30_day'] = np.where(MAEC_targets['30_day'] == 0, MAEC_targets['30_day_alpha'], MAEC_targets['30_day'])

MAEC_targets.replace([float('inf'), -float('inf')], 0, inplace=True)
print('Number of rows without data-', len(MAEC_targets[((MAEC_targets['3_day'] == 0) & 
                                                        (MAEC_targets['7_day'] == 0) & 
                                                        (MAEC_targets['15_day'] == 0)& 
                                                        (MAEC_targets['30_day'] == 0))]))
print('----------------------------------------------------------------------')

# delete?
# MAEC_targets = MAEC_targets[~((MAEC_targets['3_day'] == 0) & 
#                 (MAEC_targets['7_day'] == 0) & 
#                 (MAEC_targets['15_day'] == 0) & 
#                 (MAEC_targets['30_day'] == 0))]

In [None]:
MAEC_targets = MAEC_targets.drop(['3_day_alpha', '3_day_single_alpha', '7_day_alpha', '7_day_single_alpha', '15_day_alpha', 
                        '15_day_single_alpha', '30_day_alpha', '30_day_single_alpha', '3_day_diff', '7_day_diff', 
                        '15_day_diff', '30_day_diff', '3_day_pct_change', '7_day_pct_change', '15_day_pct_change', 
                        '30_day_pct_change', '3_day_single_diff', '7_day_single_diff', '15_day_single_diff', 
                        '30_day_single_diff', '3_day_single_pct_change', '7_day_single_pct_change', 
                        '15_day_single_pct_change', '30_day_single_pct_change'], axis=1)
MAEC_targets = MAEC_targets.sort_values(by='7_day', ascending=True)
MAEC_targets.info(verbose=True)
### save ############################################
MAEC_targets.to_csv('data/data_prep/MAEC_targets.csv', index=False)
#####################################################

# Clean Praat features

- Replace '--undefined--' data
- Convert to numeric float64
- Compare our Praat data (MAEC), which was calculated from MP3 files.
    - To the provided Praat data (MAEC) [GitHub](https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction/tree/master)

In [None]:
praat_features = praat_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)

# Convert all columns into float64 except 
cols_to_convert = praat_features.columns.difference(['Company', 'Date', 'audio_file'])
praat_features[cols_to_convert] = praat_features[cols_to_convert].apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
praat_features[cols_to_convert] = praat_features[cols_to_convert].apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
praat_features.info(verbose=True)

In [None]:
MAEC_praat_features = MAEC_praat_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)

# Convert all columns into float64 except 
cols_to_convert = MAEC_praat_features.columns.difference(['Ticker', 'Date', 'audio_file'])
MAEC_praat_features[cols_to_convert] = MAEC_praat_features[cols_to_convert].apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
MAEC_praat_features[cols_to_convert] = MAEC_praat_features[cols_to_convert].apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
MAEC_praat_features.info(verbose=True)

Compare

In [None]:
MAEC_GitHub_features = pd.DataFrame()
def each_row(row):
    Ticker = row['Ticker']
    Date = row['Date'].replace('-', '') 

    features_df = pd.read_csv(f"data/MAEC/MAEC_Dataset/{Date}_{Ticker}/features.csv")
    features_df['Ticker'] = Ticker
    features_df['Date'] = Date
    features_df['Sentence_num'] = range(1, len(features_df) + 1)

    global MAEC_GitHub_features
    MAEC_GitHub_features = pd.concat([MAEC_GitHub_features, features_df], ignore_index=True)

MAEC_filename_data.progress_apply(each_row, axis=1)

MAEC_GitHub_features = MAEC_GitHub_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)
# Convert all columns into float64 except 'Ticker', 'Date', 'audio_file' 
cols_to_convert = MAEC_GitHub_features.columns.difference(['Ticker', 'Date', 'audio_file'])
MAEC_GitHub_features[cols_to_convert] = MAEC_GitHub_features[cols_to_convert].apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
MAEC_GitHub_features[cols_to_convert] = MAEC_GitHub_features[cols_to_convert].apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
MAEC_GitHub_features.to_csv('data/MAEC_GitHub_features.csv', index=False)
MAEC_GitHub_features.info(verbose=True)

# Compare
These are very close. That's good. This gives assurance that the original dataset Pratt features are okay.

In [None]:
MAEC_praat_features

In [None]:
MAEC_GitHub_features

# Final dataset

In [None]:

praat_features['Date'] = praat_features['Date'].astype('Int64')

##################################################################
# if you want to merge Glove features 
# Merge glove and Pratt features (each sentence is 327 features)
# features = pd.merge(praat_features, glove_features, how="left",on = ['Company','Date','Sentence_num'])
##################################################################
# Merge RoBERTa and Pratt features (each sentence is 1051 features)
features = pd.merge(praat_features, RoBERTa_features, how="left",on = ['Company','Date','Sentence_num'])
features = features.drop(['Shimmer apq11','Audio Length','audio_file'], axis=1)

# The maximum sentences per meeting is 523 
# Ensure each meeting has 523 sentences, add zeros to the end
def add_zero_padding(group):
    complete_index = pd.Index(np.arange(1, 524), name='Sentence_num')
    group = group.set_index('Sentence_num').reindex(complete_index).reset_index()
    
    group['Company'] = group['Company'].fillna(method='ffill').fillna(method='bfill')
    group['Date'] = group['Date'].fillna(method='ffill').fillna(method='bfill')
    
    group.fillna(0.0, inplace=True)
    return group

features = features.groupby(['Company', 'Date']).apply(add_zero_padding).reset_index(drop=True)
features.fillna(0, inplace=True)

In [None]:
# Merge the targets 
# (will be duplicated 523 times, for each meeting, but that's okay for now)
# we need the date and company columns to sort
targets['Date'] = targets['Date'].astype(str).str.replace('-', '').astype('Int64')
original_dataset = pd.merge(features, targets, how="left",on = ['Company','Date'])

# Fill in null
rows_with_nulls = original_dataset[original_dataset.isnull().any(axis=1)]
# print(rows_with_nulls)
original_dataset.fillna(0, inplace=True)

# dump rows without targets
original_dataset = original_dataset.drop(['Ticker'], axis=1)
original_dataset = original_dataset[~((original_dataset['3_day'] == 0) & 
                                    (original_dataset['7_day'] == 0) & 
                                    (original_dataset['15_day'] == 0) & 
                                    (original_dataset['30_day'] == 0))]

original_dataset = original_dataset.sort_values(by=['Date', 'Company', 'Sentence_num'], ascending=[True, True, True])
TEXT_emb = original_dataset.drop(['Company','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single',
                                 '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
# for HTML
# FEATURES- Reshape the NumPy array to have dimensions (565 meetings, 523 sentences (with padding), 327 features)
# TEXT_emb = TEXT_emb.reshape(565, 523, 327)
TEXT_emb = TEXT_emb.reshape(565, 523, 1051)

# targets to numpy & dump rows without targets
targets = targets[~((targets['3_day'] == 0) & 
                    (targets['7_day'] == 0) & 
                    (targets['15_day'] == 0) & 
                    (targets['30_day'] == 0))]

targets = targets.copy().sort_values(by=['Date', 'Company'], ascending=[True, True])
LABEL_emb_b = targets['3_day_single'].to_numpy() # (1051,)
LABEL_emb = targets['3_day'].to_numpy() # (1051,)

np.save('data/TEXT_emb.npy', TEXT_emb)
np.save('data/LABEL_emb.npy', LABEL_emb)
np.save('data/LABEL_emb_b.npy', LABEL_emb_b)

# original_dataset = original_dataset.groupby(['Company', 'Date'])
original_dataset.info(verbose=True)
###### save ############################################
original_dataset.to_csv('data/original_dataset.csv', index=False)
########################################################

# Final MAEC dataset

In [None]:

MAEC_praat_features['Date'] = MAEC_praat_features['Date'].astype('Int64')

##################################################################
# if you want to merge Glove features 
# Merge glove and Pratt features (each sentence is 327 features)
# MAEC_features = pd.merge(MAEC_praat_features, MAEC_glove_features, how="left",on = ['Ticker','Date','Sentence_num'])
##################################################################
# Merge RoBERTa and Pratt features (each sentence is 1051 features)
MAEC_features = pd.merge(MAEC_praat_features, MAEC_RoBERTa_features, how="left",on = ['Ticker','Date','Sentence_num'])
MAEC_features = MAEC_features.drop(['Shimmer apq11','Audio Length','audio_file'], axis=1)

# The maximum sentences per meeting is 497
# Ensure each meeting has 497 sentences, add zeros to the end
def add_zero_padding(group):
    complete_index = pd.Index(np.arange(1, 498), name='Sentence_num')
    group = group.set_index('Sentence_num').reindex(complete_index).reset_index()
    
    group['Company'] = group['Company'].fillna(method='ffill').fillna(method='bfill')
    group['Date'] = group['Date'].fillna(method='ffill').fillna(method='bfill')
    
    group.fillna(0.0, inplace=True)
    return group

MAEC_features = MAEC_features.groupby(['Company', 'Date']).apply(add_zero_padding).reset_index(drop=True)
MAEC_features.fillna(0, inplace=True)

In [None]:
# Merge the targets 
# (will be duplicated 497 times, for each meeting, but that's okay for now)
# we need the date and Ticker columns to sort
MAEC_targets['Date'] = MAEC_targets['Date'].astype(str).str.replace('-', '').astype('Int64')
MAEC_dataset = pd.merge(MAEC_features, MAEC_targets, how="left",on = ['Ticker','Date'])

# Fill in null
rows_with_nulls = MAEC_dataset[MAEC_dataset.isnull().any(axis=1)]
# print(rows_with_nulls)
MAEC_dataset.fillna(0, inplace=True)

# dump rows without targets
MAEC_dataset = MAEC_dataset.drop(['Ticker'], axis=1)
MAEC_dataset = MAEC_dataset[~((MAEC_dataset['3_day'] == 0) & 
                                    (MAEC_dataset['7_day'] == 0) & 
                                    (MAEC_dataset['15_day'] == 0) & 
                                    (MAEC_dataset['30_day'] == 0))]

MAEC_dataset = MAEC_dataset.sort_values(by=['Date', 'Ticker', 'Sentence_num'], ascending=[True, True, True])
TEXT_emb = MAEC_dataset.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single',
                                 '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
# for HTML
# FEATURES- Reshape the NumPy array to have dimensions (565 meetings, 497 sentences (with padding), 1051 features)
# TEXT_emb = TEXT_emb.reshape(565, 497, 1051)
TEXT_emb = TEXT_emb.reshape(565, 497, 1051)

# targets to numpy & dump rows without targets
MAEC_targets = MAEC_targets[~((MAEC_targets['3_day'] == 0) & 
                            (MAEC_targets['7_day'] == 0) & 
                            (MAEC_targets['15_day'] == 0) & 
                            (MAEC_targets['30_day'] == 0))]

MAEC_targets = MAEC_targets.copy().sort_values(by=['Date', 'Ticker'], ascending=[True, True])
LABEL_emb_b = MAEC_targets['3_day_single'].to_numpy() # (1051,)
LABEL_emb = MAEC_targets['3_day'].to_numpy() # (1051,)

np.save('data/MAEC_TEXT_emb.npy',MAEC_ TEXT_emb)
np.save('data/MAEC_LABEL_emb.npy', MAEC_LABEL_emb)
np.save('data/MAEC_LABEL_emb_b.npy', MAEC_LABEL_emb_b)

# MAEC_dataset = MAEC_dataset.groupby(['Company', 'Date'])
MAEC_dataset.info(verbose=True)
###### save ############################################
MAEC_dataset.to_csv('data/MAEC_dataset.csv', index=False)
########################################################