# Earnings Call Project: Data Cleaning
<br>
CIS 831 Deep Learning – Term Project<br>
Kansas State University
<br><br>
James Chapman<br>
John Woods<br>
Nathan Diehl<br>
<br>

##  With this notebook I want to use the target data from the [GitHub repository](https://github.com/hankniu01/KeFVP/tree/main) for the paper [KeFVP: Knowledge-enhanced Financial Volatility Prediction](https://aclanthology.org/2023.findings-emnlp.770.pdf)

This notebook creates data used for training/testing.
- investigate KeFVP, compared to ours
- Grabs the targets from KeFVP
- Corrects Praat features for both datasets
- Combines features audio (Praat) & text (6 different) and targets
- save 9 numpy files specifically for HTML (for each audio/text pair)
    - train (features, targets, secondary_targets)
    - validation (features, targets, secondary_targets)
    - test (features, targets, secondary_targets)
- KeFVP also dropped about the same number of meetings as we did! But we can compare our results to KeFVP directly<br>

- 12 meetings were removed from our data because we could not find stock data (Yahoo or alphadvantage)<br>
- 218 meetings were removed from MAEC <br>


In [2]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
from tqdm import tqdm

tqdm.pandas()

In [3]:
MAEC_dir = 'data/MAEC/MAEC_Dataset' # https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction

############# too big for GitHub ########################
############# stored on local disk ######################
original_data_dir = r"D:\original_dataset" # https://github.com/GeminiLn/EarningsCall_Dataset 
MAEC_audio_dir = r"D:\MAEC_audio" 

# from Feature Engineering
glove_features = pd.read_csv('data/data_prep/glove_features.csv')
praat_features = pd.read_csv('data/data_prep/praat_features.csv', low_memory=False)
MAEC_glove_features = pd.read_csv('data/data_prep/MAEC_glove_features.csv')
MAEC_praat_features = pd.read_csv('data/data_prep/MAEC_praat_features.csv', low_memory=False)

# from Feature Engineering MORE
RoBERTa_features = pd.read_csv('data/data_prep/RoBERTa_features.csv', low_memory=False)
MAEC_RoBERTa_features = pd.read_csv('data/data_prep/MAEC_RoBERTa_features.csv', low_memory=False)
# Roberta with averages
RoBERTa_features2 = pd.read_csv('data/data_prep/RoBERTa_features2.csv', low_memory=False)
MAEC_RoBERTa_features2 = pd.read_csv('data/data_prep/MAEC_RoBERTa_features2.csv', low_memory=False)
# Sentence Transformers
investopedia_features = pd.read_csv('data/data_prep/investopedia_features.csv', low_memory=False)
MAEC_investopedia_features = pd.read_csv('data/data_prep/MAEC_investopedia_features.csv', low_memory=False)
bge_features = pd.read_csv('data/data_prep/bge_features.csv', low_memory=False)
MAEC_bge_features = pd.read_csv('data/data_prep/MAEC_bge_features.csv', low_memory=False)
bge_base_features = pd.read_csv('data/data_prep/bge_base_features.csv', low_memory=False)
MAEC_bge_base_features = pd.read_csv('data/data_prep/MAEC_bge_base_features.csv', low_memory=False)

In [4]:
# Loop through the directory, each folder represents an earnings conference call; the folders are named as "CompanyName_Date".
filename_data = []
for filename in os.listdir(original_data_dir):
    company_name, date_str = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    filename_data.append([company_name, date])
filename_data = pd.DataFrame(filename_data, columns=["Company", "Date"])
company_ticker = pd.read_csv('data/data_prep/company_ticker.csv')
filename_data = filename_data.merge(company_ticker, on="Company", how="left")

# Loop through the directory, each folder represents an earnings conference call; the folders are named as "Date_CompanyName".
MAEC_filename_data = []
for filename in os.listdir(MAEC_dir):
    date_str, ticker = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    MAEC_filename_data.append([ticker, date])
MAEC_filename_data = pd.DataFrame(MAEC_filename_data, columns=["Ticker", "Date"])

# Add TARGET of the regression

**n-day volatility predictions**: The predicted average volatility over the following n days.<br>

$$
v[0,n] = \ln \left( \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (r_i - \bar{r})^2 } \right)
$$

Where:
- \( r_i \) is the stock return on day \(i\),
- \( \bar{r} \) is the average stock return over \(n\) days.

The stock return \(r_i\) is defined as:

$$
r_i = \frac{P_i - P_{i-1}}{P_{i-1}}
$$

Where \(P_i\) is the adjusted closing price of the stock on day \(i\).

For **single-day log volatility**, we estimate it using the **daily log absolute return**:

$$
v_n = \ln \left( \left| \frac{P_n - P_{n-1}}{P_{n-1}} \right| \right)
$$

Where:
- \(P_n\) is the adjusted closing price of the stock on day \(n\),
- \(P_{n-1}\) is the adjusted closing price on the previous day.

Our multi-task learning objective is to **simultaneously predict** these two quantities:
- \(v[0,n]\): The average volatility over \(n\) days (the main task).
- \(v_n\): The single-day volatility (the auxiliary task).


In [12]:

def get_KeFVP(n_days_filename, single_day_filename):
    n_days_data = pd.read_csv(f'data/KeFVP_price_data/{n_days_filename}.csv')
    n_days_data['Date'] = pd.to_datetime(n_days_data[['year', 'month', 'day']])
    n_days_data.drop_duplicates(subset=['Date', 'ticker'], inplace=True)
    # drop this row specifically, is no transcript/ udio for this
    n_days_data = n_days_data[~((n_days_data['Date'] == '2017-10-31') & 
                                        (n_days_data['name'] == 'American Tower Corp A'))]
    # single day volatility
    single_day_data = pd.read_csv(f'data/KeFVP_price_data/{single_day_filename}.csv')
    single_day_data['Date'] = pd.to_datetime(single_day_data[['year', 'month', 'day']])
    single_day_data.drop_duplicates(subset=['Date', 'ticker'], inplace=True)
    single_day_data = single_day_data[['future_Single_3', 'future_Single_7', 'future_Single_15', 
                                       'future_Single_30', 'Date', 'ticker', 'name']].copy()# , 'name'
    # merge
    target_data = n_days_data[['future_3', 'future_7', 'future_15', 'future_30', 'Date', 'ticker', 'name']].copy()# , 'name'
    target_data = pd.merge(target_data, single_day_data, how="left",on = ['Date','ticker', 'name'])
    target_data = target_data.rename(columns={'ticker':'Ticker', 
                                              'name':'Company',
                                            'future_3':'3_day',
                                            'future_7':'7_day',
                                            'future_15':'15_day',
                                            'future_30':'30_day',
                                            'future_Single_3':'3_day_single',
                                            'future_Single_7':'7_day_single',
                                            'future_Single_15':'15_day_single',
                                            'future_Single_30':'30_day_single'})
    print('should all be the same,- ', len(n_days_data), len(single_day_data), len(target_data))
    return target_data


Investigate KeFVP target data. Compared to our data from Yahoo/Alphadvantage

In [13]:

train_n_days_filename = 'train_split_Avg_Series_WITH_LOG'
train_single_day_filename = 'train_split_SeriesSingleDayVol3'
val_n_days_filename = 'val_split_Avg_Series_WITH_LOG'
val_single_day_filename = 'val_split_SeriesSingleDayVol3'
test_n_days_filename = 'test_split_Avg_Series_WITH_LOG'
test_single_day_filename = 'test_split_SeriesSingleDayVol3'

train_targets = get_KeFVP(train_n_days_filename, train_single_day_filename)
val_targets = get_KeFVP(val_n_days_filename, val_single_day_filename)
test_targets = get_KeFVP(test_n_days_filename, test_single_day_filename)

# merge all  train/validate/test
EC_dataset = pd.concat([train_targets, val_targets, test_targets], ignore_index=True)

# grab all unique ticker/company
KeFVP_ticker_Company = EC_dataset[['Ticker','Company']].copy()
KeFVP_ticker_Company['Ticker'] = KeFVP_ticker_Company['Ticker'].str.upper()
KeFVP_ticker_Company.drop_duplicates(inplace=True)
print(len(KeFVP_ticker_Company))

# original unique ticker/company
company_ticker = pd.read_csv('data/data_prep/company_ticker.csv')
print(len(company_ticker))
combo = pd.merge(KeFVP_ticker_Company, company_ticker, how="right",on = ['Company'])
print(len(combo))
print(len(combo[combo['Ticker_y']!=combo['Ticker_x']]))
problems = combo[combo['Ticker_y']!=combo['Ticker_x']].copy()
problems.dropna(inplace=True)
problems

should all be the same,-  391 391 391
should all be the same,-  56 56 56
should all be the same,-  111 112 111
272
280
280
21


Unnamed: 0,Ticker_x,Company,Ticker_y
18,ADS,Alliance Data Systems,BFH
24,ABC,AmerisourceBergen Corp,COR
27,ANTM,Anthem Inc.,ELV
35,BLL,Ball Corp,BALL
54,CTL,CenturyLink Inc,LUMN
58,XEC,Cimarex Energy,CTRA
95,FB,"Facebook, Inc.",META
103,FBHS,Fortune Brands Home & Security,FBIN
106,GPS,Gap Inc.,GAP
177,MYL,Mylan N.V.,VTRS


In [14]:
# companies that are not included in the KeFVP  data
combo[combo.isnull().any(axis=1)]

Unnamed: 0,Ticker_x,Company,Ticker_y
110,,General Growth Properties Inc.,GGP
120,,Harris Corporation,LHX
135,,Ingersoll-Rand PLC,IR
160,,Lowe's Cos.,LOW
170,,Michael Kors Holdings,CPRI
222,,SCANA Corp,SCG
235,,Symantec Corp.,GEN
264,,Viacom Inc.,VIACOM


In [15]:
# COMPANY matches between the 2 data sets
# this is good because the tickers are different sometimes (mergers/name changes)
print(KeFVP_ticker_Company[~KeFVP_ticker_Company['Company'].isin(company_ticker.Company.unique())])
company_ticker[~company_ticker['Company'].isin(KeFVP_ticker_Company.Company.unique())]

Empty DataFrame
Columns: [Ticker, Company]
Index: []


Unnamed: 0,Company,Ticker
110,General Growth Properties Inc.,GGP
120,Harris Corporation,LHX
135,Ingersoll-Rand PLC,IR
160,Lowe's Cos.,LOW
170,Michael Kors Holdings,CPRI
222,SCANA Corp,SCG
235,Symantec Corp.,GEN
264,Viacom Inc.,VIACOM


# Clean Praat features

- Replace '--undefined--' data
- Convert to numeric float64
- Compare our Praat data (MAEC), which was calculated from MP3 files.
    - To the provided Praat data (MAEC) [GitHub](https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction/tree/master)

In [16]:
praat_features = praat_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)

# Convert all columns into float64 except 
cols_to_convert = praat_features.columns.difference(['Company', 'Date', 'audio_file'])
praat_features[cols_to_convert] = praat_features[cols_to_convert].progress_apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
praat_features[cols_to_convert] = praat_features[cols_to_convert].progress_apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
praat_features.info(verbose=True)

100%|██████████| 30/30 [00:00<00:00, 70.14it/s]
100%|██████████| 30/30 [00:00<00:00, 728.24it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Mean pitch                    89722 non-null  float64
 1   Standard deviation            89722 non-null  float64
 2   Minimum pitch                 89722 non-null  float64
 3   Maximum pitch                 89722 non-null  float64
 4   Number of pulses              89722 non-null  float64
 5   Number of periods             89722 non-null  float64
 6   Mean period                   89722 non-null  float64
 7   Mean intensity                89722 non-null  float64
 8   Minimum intensity             89722 non-null  float64
 9   Maximum intensity             89722 non-null  float64
 10  Standard deviation of period  89722 non-null  float64
 11  Fraction of unvoiced          89722 non-null  float64
 12  Number of voice breaks        89722 non-null  float64
 13  D




# Final datasets


In [17]:
max_sent_len = 523

# The maximum sentences per meeting is 523 
# Ensure each meeting has 523 sentences, add zeros to the end
def add_zero_padding(group):
    complete_index = pd.Index(np.arange(1, (max_sent_len +1)), name='Sentence_num')
    group = group.set_index('Sentence_num').reindex(complete_index).reset_index()
    group['Date'] = group['Date'].ffill().bfill()
    group['Company'] = group['Company'].ffill().bfill()
    group.fillna(0.0, inplace=True)
    return group

def combine_audio_text_targets(audio_features, text_features, num_features, save_directory, 
                               train_n_days_filename, train_single_day_filename, val_n_days_filename, 
                               val_single_day_filename, test_n_days_filename, test_single_day_filename ):
    
    features = pd.merge(audio_features, text_features, how="left",on = ['Company','Date','Sentence_num'])
    features = features.drop(['Shimmer apq11','Audio Length','audio_file'], axis=1)
    features = features.groupby(['Company', 'Date'], group_keys=False).progress_apply(add_zero_padding).reset_index(drop=True)
    features.fillna(0, inplace=True)

    # match the targets to the features
    # (target values will be duplicated 523 times, for each meeting, but that's okay for now)
    # we need the date and company columns to sort
    train_targets = get_KeFVP(train_n_days_filename, train_single_day_filename)
    val_targets = get_KeFVP(val_n_days_filename, val_single_day_filename)
    test_targets = get_KeFVP(test_n_days_filename, test_single_day_filename)

    train_targets['Date'] = train_targets['Date'].astype(str).str.replace('-', '').astype('Int64')
    val_targets['Date'] = val_targets['Date'].astype(str).str.replace('-', '').astype('Int64')
    test_targets['Date'] = test_targets['Date'].astype(str).str.replace('-', '').astype('Int64')

    train_dataset = pd.merge(train_targets, features, how="left",on = ['Company','Date'])
    val_dataset = pd.merge(val_targets, features, how="left",on = ['Company','Date'])
    test_dataset = pd.merge(test_targets, features, how="left",on = ['Company','Date'])

    train_dataset.fillna(0, inplace=True)
    val_dataset.fillna(0, inplace=True)
    test_dataset.fillna(0, inplace=True)

    train_dataset = train_dataset.sort_values(by=['Date', 'Company', 'Sentence_num'], ascending=[True, True, True])
    val_dataset = val_dataset.sort_values(by=['Date', 'Company', 'Sentence_num'], ascending=[True, True, True])
    test_dataset = test_dataset.sort_values(by=['Date', 'Company', 'Sentence_num'], ascending=[True, True, True])

    train_dataset['Date'] = pd.to_datetime(train_dataset['Date'].astype(str), format='%Y%m%d')
    val_dataset['Date'] = pd.to_datetime(val_dataset['Date'].astype(str), format='%Y%m%d')
    test_dataset['Date'] = pd.to_datetime(test_dataset['Date'].astype(str), format='%Y%m%d')
    
    train_features = train_dataset.drop(['Ticker','Company','Date','Sentence_num','3_day','3_day_single','7_day',
                                         '7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    val_features = val_dataset.drop(['Ticker','Company','Date','Sentence_num','3_day','3_day_single','7_day',
                                      '7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    test_features = test_dataset.drop(['Ticker','Company','Date','Sentence_num','3_day','3_day_single','7_day',
                                       '7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    
    train_targets = train_dataset[['3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single']].copy()
    val_targets = val_dataset[['3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single']].copy()
    test_targets = test_dataset[['3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single']].copy()

    train_targets.drop_duplicates(inplace=True)
    val_targets.drop_duplicates(inplace=True)
    test_targets.drop_duplicates(inplace=True)

    #############################  save  ######################################################################
    if not os.path.exists(f'data/{save_directory}'):
        os.makedirs(f'data/{save_directory}')

    # FEATURES- Reshape the NumPy array to have dimensions ( # meetings, 523 sentences (with padding), num_features)
    np.save(f'data/{save_directory}/train_features.npy', train_features.reshape(int(len(train_features)/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/val_features.npy', val_features.reshape(int(len(val_features)/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/test_features.npy', test_features.reshape(int(len(test_features)/max_sent_len), max_sent_len, num_features))
    
    np.save(f'data/{save_directory}/train_targets_3.npy', train_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_3.npy', val_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_3.npy', test_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_3.npy', train_targets['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_3.npy', val_targets['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_3.npy', test_targets['3_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_7.npy', train_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_7.npy', val_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_7.npy', test_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_7.npy', train_targets['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_7.npy', val_targets['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_7.npy', test_targets['7_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_15.npy', train_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_15.npy', val_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_15.npy', test_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_15.npy', train_targets['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_15.npy', val_targets['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_15.npy', test_targets['15_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_30.npy', train_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_30.npy', val_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_30.npy', test_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_30.npy', train_targets['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_30.npy', val_targets['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_30.npy', test_targets['30_day_single'].to_numpy())
        

In [18]:
# praat_features['Date'] = praat_features['Date'].astype('Int64')

# these will be different for the MAEC data
train_n_days_filename = 'train_split_Avg_Series_WITH_LOG'
train_single_day_filename = 'train_split_SeriesSingleDayVol3'
val_n_days_filename = 'val_split_Avg_Series_WITH_LOG'
val_single_day_filename = 'val_split_SeriesSingleDayVol3'
test_n_days_filename = 'test_split_Avg_Series_WITH_LOG'
test_single_day_filename = 'test_split_SeriesSingleDayVol3'

combine_audio_text_targets(praat_features, RoBERTa_features, 1051, 'KeFVP_RoBERTa',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)


100%|██████████| 572/572 [00:06<00:00, 89.85it/s] 


should all be the same,-  391 391 391
should all be the same,-  56 56 56
should all be the same,-  111 112 111


In [19]:
# grouped_counts = test_features.groupby(['Date', 'Company']).size().reset_index(name='Occurrences')

# min_occurrences = grouped_counts[grouped_counts['Occurrences'] <500]

# min_occurrences


In [20]:
combine_audio_text_targets(praat_features, RoBERTa_features2, 1051, 'KeFVP_RoBERTa2',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 572/572 [00:06<00:00, 94.43it/s] 


should all be the same,-  391 391 391
should all be the same,-  56 56 56
should all be the same,-  111 112 111


In [21]:
combine_audio_text_targets(praat_features, investopedia_features, 795, 'KeFVP_investopedia',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 572/572 [00:04<00:00, 135.33it/s]


should all be the same,-  391 391 391
should all be the same,-  56 56 56
should all be the same,-  111 112 111


In [22]:
combine_audio_text_targets(praat_features, bge_features, 1051, 'KeFVP_bge',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 572/572 [00:05<00:00, 97.01it/s] 


should all be the same,-  391 391 391
should all be the same,-  56 56 56
should all be the same,-  111 112 111


In [23]:
combine_audio_text_targets(praat_features, bge_base_features, 795, 'KeFVP_bge_base',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 572/572 [00:04<00:00, 126.64it/s]


should all be the same,-  391 391 391
should all be the same,-  56 56 56
should all be the same,-  111 112 111


In [24]:
combine_audio_text_targets(praat_features, glove_features, 327, 'KeFVP_glove',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 572/572 [00:02<00:00, 283.06it/s]


should all be the same,-  391 391 391
should all be the same,-  56 56 56
should all be the same,-  111 112 111


# Final MAEC dataset

In [5]:
MAEC_praat_features = MAEC_praat_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)

# Convert all columns into float64 except 
cols_to_convert = MAEC_praat_features.columns.difference(['Ticker', 'Date', 'audio_file'])
MAEC_praat_features[cols_to_convert] = MAEC_praat_features[cols_to_convert].progress_apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
MAEC_praat_features[cols_to_convert] = MAEC_praat_features[cols_to_convert].progress_apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
MAEC_praat_features.info(verbose=True)

100%|██████████| 30/30 [00:01<00:00, 19.41it/s]
100%|██████████| 30/30 [00:00<00:00, 196.09it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Data columns (total 33 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Mean pitch                    394277 non-null  float64
 1   Standard deviation            394277 non-null  float64
 2   Minimum pitch                 394277 non-null  float64
 3   Maximum pitch                 394277 non-null  float64
 4   Number of pulses              394277 non-null  int64  
 5   Number of periods             394277 non-null  int64  
 6   Mean period                   394277 non-null  float64
 7   Mean intensity                394277 non-null  float64
 8   Minimum intensity             394277 non-null  float64
 9   Maximum intensity             394277 non-null  float64
 10  Standard deviation of period  394277 non-null  float64
 11  Fraction of unvoiced          394277 non-null  float64
 12  Number of voice breaks        394277 non-nul

In [6]:

def get_KeFVP(n_days_filename, single_day_filename):
    n_days_data = pd.read_csv(f'data/KeFVP_price_data/{n_days_filename}.csv')
    n_days_data['Date'] = pd.to_datetime(n_days_data['time'], format='%Y-%m-%d')
    n_days_data.drop_duplicates(subset=['Date', 'ticker'], inplace=True)
    # drop this row specifically, is no transcript/ udio for this
    # n_days_data = n_days_data[~((n_days_data['Date'] == '2017-10-31') & 
    #                                     (n_days_data['name'] == 'American Tower Corp A'))]
    # single day volatility
    single_day_data = pd.read_csv(f'data/KeFVP_price_data/{single_day_filename}.csv')
    single_day_data['Date'] = pd.to_datetime(single_day_data['time'], format='%Y-%m-%d')
    single_day_data.drop_duplicates(subset=['Date', 'ticker'], inplace=True)
    single_day_data = single_day_data[['future_Single_3', 'future_Single_7', 'future_Single_15', 
                                       'future_Single_30', 'Date', 'ticker']].copy()# , 'name'
    # merge
    target_data = n_days_data[['future_3', 'future_7', 'future_15', 'future_30', 'Date', 'ticker']].copy()# , 'name'
    target_data = pd.merge(target_data, single_day_data, how="left",on = ['Date','ticker'])
    target_data = target_data.rename(columns={'ticker':'Ticker', 
                                            'future_3':'3_day',
                                            'future_7':'7_day',
                                            'future_15':'15_day',
                                            'future_30':'30_day',
                                            'future_Single_3':'3_day_single',
                                            'future_Single_7':'7_day_single',
                                            'future_Single_15':'15_day_single',
                                            'future_Single_30':'30_day_single'})
    print('should all be the same,- ', len(n_days_data), len(single_day_data), len(target_data))
    return target_data


In [7]:
# these will be different for 2016
train_n_days_filename = 'maec/15/maec15_train_avg_val'
train_single_day_filename = 'maec/15/maec15_train_single_val'
val_n_days_filename = 'maec/15/maec15_dev_avg_val'
val_single_day_filename = 'maec/15/maec15_dev_single_val'
test_n_days_filename = 'maec/15/maec15_test_avg_val'
test_single_day_filename = 'maec/15/maec15_test_single_val'

train_targets = get_KeFVP(train_n_days_filename, train_single_day_filename)
val_targets = get_KeFVP(val_n_days_filename, val_single_day_filename)
test_targets = get_KeFVP(test_n_days_filename, test_single_day_filename)

# merge all  train/validate/test
MAEC_dataset = pd.concat([train_targets, val_targets, test_targets], ignore_index=True)

# grab all unique ticker/company
KeFVP_ticker_Company = MAEC_dataset[['Ticker','Date']].copy()
KeFVP_ticker_Company['Ticker'] = KeFVP_ticker_Company['Ticker'].str.upper()
KeFVP_ticker_Company.drop_duplicates(inplace=True)
print(len(KeFVP_ticker_Company))

should all be the same,-  535 535 535
should all be the same,-  76 76 76
should all be the same,-  154 154 154
765


In [8]:

max_sent_len = 500 

# The maximum sentences per meeting is  
# Ensure each meeting has  sentences, add zeros to the end
def add_zero_padding(group):
    complete_index = pd.Index(np.arange(1, (max_sent_len + 1)), name='Sentence_num')
    group = group.set_index('Sentence_num').reindex(complete_index).reset_index()
    group['Date'] = group['Date'].ffill().bfill()
    group['Ticker'] = group['Ticker'].ffill().bfill()
    group.fillna(0.0, inplace=True)
    return group

def combine_audio_text_targets(audio_features, text_features, num_features, save_directory, 
                               train_n_days_filename, train_single_day_filename, val_n_days_filename, 
                               val_single_day_filename, test_n_days_filename, test_single_day_filename ):
    # combine & add_zero_padding
    features = pd.merge(audio_features, text_features, how="left",on = ['Ticker','Date','Sentence_num'])
    features = features.drop(['Shimmer apq11','Audio Length','audio_file'], axis=1)
    features = features.groupby(['Ticker', 'Date'], group_keys=False).progress_apply(add_zero_padding).reset_index(drop=True)
    features.fillna(0, inplace=True)

    # match the targets to the features
    # (target values will be duplicated 500 times, for each meeting, but that's okay for now)
    # we need the date and company columns to sort
    train_targets = get_KeFVP(train_n_days_filename, train_single_day_filename)
    val_targets = get_KeFVP(val_n_days_filename, val_single_day_filename)
    test_targets = get_KeFVP(test_n_days_filename, test_single_day_filename)

    train_targets['Date'] = train_targets['Date'].astype(str).str.replace('-', '').astype('Int64')
    val_targets['Date'] = val_targets['Date'].astype(str).str.replace('-', '').astype('Int64')
    test_targets['Date'] = test_targets['Date'].astype(str).str.replace('-', '').astype('Int64')

    train_dataset = pd.merge(train_targets, features, how="left",on = ['Ticker','Date'])
    val_dataset = pd.merge(val_targets, features, how="left",on = ['Ticker','Date'])
    test_dataset = pd.merge(test_targets, features, how="left",on = ['Ticker','Date'])

    train_dataset.fillna(0, inplace=True)
    val_dataset.fillna(0, inplace=True)
    test_dataset.fillna(0, inplace=True)

    train_dataset = train_dataset.sort_values(by=['Date', 'Ticker', 'Sentence_num'], ascending=[True, True, True])
    val_dataset = val_dataset.sort_values(by=['Date', 'Ticker', 'Sentence_num'], ascending=[True, True, True])
    test_dataset = test_dataset.sort_values(by=['Date', 'Ticker', 'Sentence_num'], ascending=[True, True, True])

    train_dataset['Date'] = pd.to_datetime(train_dataset['Date'].astype(str), format='%Y%m%d')
    val_dataset['Date'] = pd.to_datetime(val_dataset['Date'].astype(str), format='%Y%m%d')
    test_dataset['Date'] = pd.to_datetime(test_dataset['Date'].astype(str), format='%Y%m%d')

    train_features = train_dataset.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day',
                                         '7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    val_features = test_dataset.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day',
                                      '7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    test_features = test_dataset.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day',
                                       '7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()

    # must include date and ticker, some targets are identical
    train_targets = train_dataset[['Date', 'Ticker', '3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single']].copy()
    val_targets = test_dataset[['Date', 'Ticker', '3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single']].copy()
    test_targets = test_dataset[['Date', 'Ticker', '3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single']].copy()

    train_targets.drop_duplicates(inplace=True)
    val_targets.drop_duplicates(inplace=True)
    test_targets.drop_duplicates(inplace=True)

    train_targets = train_targets.drop(['Ticker','Date'], axis=1)
    val_targets = val_targets.drop(['Ticker','Date'], axis=1)
    test_targets = test_targets.drop(['Ticker','Date'], axis=1)

    #############################  save  ######################################################################
    if not os.path.exists(f'data/{save_directory}'):
        os.makedirs(f'data/{save_directory}')

    # FEATURES- Reshape the NumPy array to have dimensions ( # meetings, 523 sentences (with padding), num_features)
    np.save(f'data/{save_directory}/train_features.npy', train_features.reshape(int(len(train_features)/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/val_features.npy', val_features.reshape(int(len(val_features)/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/test_features.npy', test_features.reshape(int(len(test_features)/max_sent_len), max_sent_len, num_features))
    
    np.save(f'data/{save_directory}/train_targets_3.npy', train_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_3.npy', val_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_3.npy', test_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_3.npy', train_targets['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_3.npy', val_targets['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_3.npy', test_targets['3_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_7.npy', train_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_7.npy', val_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_7.npy', test_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_7.npy', train_targets['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_7.npy', val_targets['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_7.npy', test_targets['7_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_15.npy', train_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_15.npy', val_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_15.npy', test_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_15.npy', train_targets['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_15.npy', val_targets['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_15.npy', test_targets['15_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_30.npy', train_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_30.npy', val_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_30.npy', test_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_30.npy', train_targets['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_30.npy', val_targets['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_30.npy', test_targets['30_day_single'].to_numpy())
        
    

In [9]:
MAEC_praat_features['Date'] = MAEC_praat_features['Date'].astype('Int64')

combine_audio_text_targets(MAEC_praat_features, MAEC_RoBERTa_features, 1051, 'KeFVP_MAEC15_RoBERTa',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)


100%|██████████| 3443/3443 [00:36<00:00, 93.33it/s] 


should all be the same,-  535 535 535
should all be the same,-  76 76 76
should all be the same,-  154 154 154


In [10]:
combine_audio_text_targets(MAEC_praat_features, MAEC_RoBERTa_features2, 1051, 'KeFVP_MAEC15_RoBERTa2',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:34<00:00, 100.43it/s]


should all be the same,-  535 535 535
should all be the same,-  76 76 76
should all be the same,-  154 154 154


In [11]:
combine_audio_text_targets(MAEC_praat_features, MAEC_investopedia_features, 795, 'KeFVP_MAEC15_investopedia',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:25<00:00, 134.82it/s]


should all be the same,-  535 535 535
should all be the same,-  76 76 76
should all be the same,-  154 154 154


In [12]:
combine_audio_text_targets(MAEC_praat_features, MAEC_bge_features, 1051, 'KeFVP_MAEC15_bge',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:33<00:00, 102.76it/s]


should all be the same,-  535 535 535
should all be the same,-  76 76 76
should all be the same,-  154 154 154


In [13]:
combine_audio_text_targets(MAEC_praat_features, MAEC_bge_base_features, 795, 'KeFVP_MAEC15_bge_base',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:25<00:00, 133.97it/s]


should all be the same,-  535 535 535
should all be the same,-  76 76 76
should all be the same,-  154 154 154


In [14]:
combine_audio_text_targets(MAEC_praat_features, MAEC_glove_features, 327, 'KeFVP_MAEC15_glove',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:12<00:00, 274.79it/s]


should all be the same,-  535 535 535
should all be the same,-  76 76 76
should all be the same,-  154 154 154


# 2016

In [15]:
def get_KeFVP(n_days_filename, single_day_filename):
    n_days_data = pd.read_csv(f'data/KeFVP_price_data/{n_days_filename}.csv')
    ######################
    n_days_data = n_days_data.loc[:, ~n_days_data.columns.str.contains('^Unnamed')]  # Drop unnamed columns
    n_days_data.replace([float('inf'), float('-inf')], 0, inplace=True)  # Replace infinity with 0
    ######################
    n_days_data['Date'] = pd.to_datetime(n_days_data['time'], format='%Y-%m-%d')
    n_days_data.drop_duplicates(subset=['Date', 'ticker'], inplace=True)
    # drop this row specifically, is no transcript/ udio for this
    # n_days_data = n_days_data[~((n_days_data['Date'] == '2017-10-31') & 
    #                                     (n_days_data['name'] == 'American Tower Corp A'))]
    # single day volatility
    single_day_data = pd.read_csv(f'data/KeFVP_price_data/{single_day_filename}.csv')
    ######################
    single_day_data = single_day_data.loc[:, ~single_day_data.columns.str.contains('^Unnamed')]  # Drop unnamed columns
    single_day_data.replace([float('inf'), float('-inf')], 0, inplace=True)  # Replace infinity with 0
    ######################
    single_day_data['Date'] = pd.to_datetime(single_day_data['time'], format='%Y-%m-%d')
    single_day_data.drop_duplicates(subset=['Date', 'ticker'], inplace=True)
    single_day_data = single_day_data[['future_Single_3', 'future_Single_7', 'future_Single_15', 
                                       'future_Single_30', 'Date', 'ticker']].copy()# , 'name'
    # merge
    target_data = n_days_data[['future_3', 'future_7', 'future_15', 'future_30', 'Date', 'ticker']].copy()# , 'name'
    target_data = pd.merge(target_data, single_day_data, how="left",on = ['Date','ticker'])
    target_data = target_data.rename(columns={'ticker':'Ticker', 
                                            'future_3':'3_day',
                                            'future_7':'7_day',
                                            'future_15':'15_day',
                                            'future_30':'30_day',
                                            'future_Single_3':'3_day_single',
                                            'future_Single_7':'7_day_single',
                                            'future_Single_15':'15_day_single',
                                            'future_Single_30':'30_day_single'})
    print('should all be the same,- ', len(n_days_data), len(single_day_data), len(target_data))
    return target_data

In [16]:

train_n_days_filename = 'maec/16/maec16_train_avg_val'
train_single_day_filename = 'maec/16/maec16_train_single_val'
val_n_days_filename = 'maec/16/maec16_dev_avg_val'
val_single_day_filename = 'maec/16/maec16_dev_single_val'
test_n_days_filename = 'maec/16/maec16_test_avg_val'
test_single_day_filename = 'maec/16/maec16_test_single_val'

train_targets = get_KeFVP(train_n_days_filename, train_single_day_filename)
val_targets = get_KeFVP(val_n_days_filename, val_single_day_filename)
test_targets = get_KeFVP(test_n_days_filename, test_single_day_filename)

# merge all  train/validate/test
MAEC_dataset = pd.concat([train_targets, val_targets, test_targets], ignore_index=True)

# grab all unique ticker/company
KeFVP_ticker_Company = MAEC_dataset[['Ticker','Date']].copy()
KeFVP_ticker_Company['Ticker'] = KeFVP_ticker_Company['Ticker'].str.upper()
KeFVP_ticker_Company.drop_duplicates(inplace=True)
print(len(KeFVP_ticker_Company))


should all be the same,-  980 980 980
should all be the same,-  140 140 140
should all be the same,-  280 280 280
1400


In [17]:
combine_audio_text_targets(MAEC_praat_features, MAEC_RoBERTa_features, 1051, 'KeFVP_MAEC16_RoBERTa',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:33<00:00, 103.47it/s]


should all be the same,-  980 980 980
should all be the same,-  140 140 140
should all be the same,-  280 280 280


In [18]:
combine_audio_text_targets(MAEC_praat_features, MAEC_RoBERTa_features2, 1051, 'KeFVP_MAEC16_RoBERTa2',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:33<00:00, 101.59it/s]


should all be the same,-  980 980 980
should all be the same,-  140 140 140
should all be the same,-  280 280 280


In [19]:
combine_audio_text_targets(MAEC_praat_features, MAEC_investopedia_features, 795, 'KeFVP_MAEC16_investopedia',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:25<00:00, 136.54it/s]


should all be the same,-  980 980 980
should all be the same,-  140 140 140
should all be the same,-  280 280 280


In [20]:
combine_audio_text_targets(MAEC_praat_features, MAEC_bge_features, 1051, 'KeFVP_MAEC16_bge',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:33<00:00, 104.12it/s]


should all be the same,-  980 980 980
should all be the same,-  140 140 140
should all be the same,-  280 280 280


In [21]:
combine_audio_text_targets(MAEC_praat_features, MAEC_bge_base_features, 795, 'KeFVP_MAEC16_bge_base',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:26<00:00, 128.79it/s]


should all be the same,-  980 980 980
should all be the same,-  140 140 140
should all be the same,-  280 280 280


In [22]:
combine_audio_text_targets(MAEC_praat_features, MAEC_glove_features, 327, 'KeFVP_MAEC16_glove',
                           train_n_days_filename, train_single_day_filename, val_n_days_filename,
                           val_single_day_filename, test_n_days_filename, test_single_day_filename 
)

100%|██████████| 3443/3443 [00:12<00:00, 269.44it/s]


should all be the same,-  980 980 980
should all be the same,-  140 140 140
should all be the same,-  280 280 280
