# Earnings Call Project: Data Cleaning
<br>
CIS 831 Deep Learning – Term Project<br>
Kansas State University
<br><br>
James Chapman<br>
John Woods<br>
Nathan Diehl<br>
<br>

##  With this notebook I want to use the target data from the [GitHub repository](https://github.com/hankniu01/KeFVP/tree/main) for the paper [KeFVP: Knowledge-enhanced Financial Volatility Prediction](https://aclanthology.org/2023.findings-emnlp.770.pdf)

This notebook creates data used for training/testing.
- Calculates the targets for both datasets
- Corrects Praat features for both datasets
- Combines features audio (Praat) & text (6 different) and targets
- save 9 numpy files specifically for HTML (for each audio/text pair)
    - train (features, targets, secondary_targets)
    - validation (features, targets, secondary_targets)
    - test (features, targets, secondary_targets)
- it is important to note 12 meetings were removed because we could not find stock data (Yahoo or alphadvantage)<br>
- 218 meetings were removed from MAEC <br>

The rest data from this notebook is stored in the "data" directory as the following CSVs (for each audio/text pair)
- MAEC_dataset
- MAEC_dataset

In [1]:
import sys
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    #!pip install 
    from google.colab import drive
    drive.mount('/content/gdrive')
    %cd gdrive/My Drive/831

In [2]:
import pandas as pd
import numpy as np
import os
from datetime import datetime
from tqdm import tqdm

tqdm.pandas()

In [3]:
MAEC_dir = 'data/MAEC/MAEC_Dataset' # https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction

############# too big for GitHub ########################
############# stored on local disk ######################
original_data_dir = r"D:\original_dataset" # https://github.com/GeminiLn/EarningsCall_Dataset 
MAEC_audio_dir = r"D:\MAEC_audio" 
# there is a link for the audio data in the MAEC GitHub, but it does not work
# I emailed the authors, and they send another link.
# There is like a half-million files, but only 19 GB
# https://drive.google.com/file/d/1m1GRCHgKn9Vz9IFMC_SpCog6uP3-gFgY/view?usp=drive_link 

# from Webscraping
alpha_dir = 'data/data_prep/alpha_data/{}.csv' #.format(ticker) # I saved the raw alphadvantage data, so I don't have to do it again
yahoo_data = pd.read_csv('data/data_prep/yahoo_data.csv', index_col=0)
alpha_data = pd.read_csv('data/data_prep/alpha_data.csv', index_col=0)
MAEC_yahoo_data = pd.read_csv('data/data_prep/MAEC_yahoo_data.csv', index_col=0)
MAEC_alpha_data = pd.read_csv('data/data_prep/MAEC_alpha_data.csv', index_col=0)

yahoo_data.index = pd.to_datetime(yahoo_data.index)
alpha_data.index = pd.to_datetime(alpha_data.index)
MAEC_yahoo_data.index = pd.to_datetime(MAEC_yahoo_data.index)
MAEC_alpha_data.index = pd.to_datetime(MAEC_alpha_data.index)

# from Feature Engineering
glove_features = pd.read_csv('data/data_prep/glove_features.csv')
praat_features = pd.read_csv('data/data_prep/praat_features.csv', low_memory=False)
MAEC_glove_features = pd.read_csv('data/data_prep/MAEC_glove_features.csv')
MAEC_praat_features = pd.read_csv('data/data_prep/MAEC_praat_features.csv', low_memory=False)

# from Feature Engineering MORE
RoBERTa_features = pd.read_csv('data/data_prep/RoBERTa_features.csv', low_memory=False)
MAEC_RoBERTa_features = pd.read_csv('data/data_prep/MAEC_RoBERTa_features.csv', low_memory=False)
# Roberta with averages
RoBERTa_features2 = pd.read_csv('data/data_prep/RoBERTa_features2.csv', low_memory=False)
MAEC_RoBERTa_features2 = pd.read_csv('data/data_prep/MAEC_RoBERTa_features2.csv', low_memory=False)

investopedia_features = pd.read_csv('data/data_prep/investopedia_features.csv', low_memory=False)
MAEC_investopedia_features = pd.read_csv('data/data_prep/MAEC_investopedia_features.csv', low_memory=False)
bge_features = pd.read_csv('data/data_prep/bge_features.csv', low_memory=False)
MAEC_bge_features = pd.read_csv('data/data_prep/MAEC_bge_features.csv', low_memory=False)
bge_base_features = pd.read_csv('data/data_prep/bge_base_features.csv', low_memory=False)
MAEC_bge_base_features = pd.read_csv('data/data_prep/MAEC_bge_base_features.csv', low_memory=False)

In [4]:
# Loop through the directory, each folder represents an earnings conference call; the folders are named as "CompanyName_Date".
filename_data = []
for filename in os.listdir(original_data_dir):
    company_name, date_str = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    filename_data.append([company_name, date])
filename_data = pd.DataFrame(filename_data, columns=["Company", "Date"])
company_ticker = pd.read_csv('data/data_prep/company_ticker.csv')
filename_data = filename_data.merge(company_ticker, on="Company", how="left")

# Loop through the directory, each folder represents an earnings conference call; the folders are named as "Date_CompanyName".
MAEC_filename_data = []
for filename in os.listdir(MAEC_dir):
    date_str, ticker = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    MAEC_filename_data.append([ticker, date])
MAEC_filename_data = pd.DataFrame(MAEC_filename_data, columns=["Ticker", "Date"])

# Add TARGET of the regression

**n-day volatility predictions**: The predicted average volatility over the following n days.<br>

$$
v[0,n] = \ln \left( \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (r_i - \bar{r})^2 } \right)
$$

Where:
- \( r_i \) is the stock return on day \(i\),
- \( \bar{r} \) is the average stock return over \(n\) days.

The stock return \(r_i\) is defined as:

$$
r_i = \frac{P_i - P_{i-1}}{P_{i-1}}
$$

Where \(P_i\) is the adjusted closing price of the stock on day \(i\).

For **single-day log volatility**, we estimate it using the **daily log absolute return**:

$$
v_n = \ln \left( \left| \frac{P_n - P_{n-1}}{P_{n-1}} \right| \right)
$$

Where:
- \(P_n\) is the adjusted closing price of the stock on day \(n\),
- \(P_{n-1}\) is the adjusted closing price on the previous day.

Our multi-task learning objective is to **simultaneously predict** these two quantities:
- \(v[0,n]\): The average volatility over \(n\) days (the main task).
- \(v_n\): The single-day volatility (the auxiliary task).


In [5]:

dev_price_label = pd.read_csv('data/KeFVP_price_data/dev_price_label.csv')
test_price_label = pd.read_csv('data/KeFVP_price_data/test_price_label.csv')
test_split_Avg_Series_WITH_LOG = pd.read_csv('data/KeFVP_price_data/test_split_Avg_Series_WITH_LOG.csv')
test_split_SeriesSingleDayVol3 = pd.read_csv('data/KeFVP_price_data/test_split_SeriesSingleDayVol3.csv')
train_price_label = pd.read_csv('data/KeFVP_price_data/train_price_label.csv')
train_split_Avg_Series_WITH_LOG = pd.read_csv('data/KeFVP_price_data/train_split_Avg_Series_WITH_LOG.csv')
train_split_SeriesSingleDayVol3 = pd.read_csv('data/KeFVP_price_data/train_split_SeriesSingleDayVol3.csv')
val_split_Avg_Series_WITH_LOG = pd.read_csv('data/KeFVP_price_data/val_split_Avg_Series_WITH_LOG.csv')
val_split_SeriesSingleDayVol3 = pd.read_csv('data/KeFVP_price_data/val_split_SeriesSingleDayVol3.csv')

maec15_dev_avg_val = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_dev_avg_val.csv')
maec15_dev_price_label = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_dev_price_label.csv')
maec15_dev_single_val = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_dev_single_val.csv')
maec15_test_avg_val = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_test_avg_val.csv')
maec15_test_price_label = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_test_price_label.csv')
maec15_test_single_val = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_test_single_val.csv')
maec15_train_avg_val = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_train_avg_val.csv')
maec15_train_price_label = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_train_price_label.csv')
maec15_train_single_val = pd.read_csv('data/KeFVP_price_data/maec/15/maec15_train_single_val.csv')

maec16_dev_avg_val = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_dev_avg_val.csv')
maec16_dev_price_label = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_dev_price_label.csv')
maec16_dev_single_val = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_dev_single_val.csv')
maec16_test_avg_val = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_test_avg_val.csv')
maec16_test_price_label = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_test_price_label.csv')
maec16_test_single_val = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_test_single_val.csv')
maec16_train_avg_val = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_train_avg_val.csv')
maec16_train_price_label = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_train_price_label.csv')
maec16_train_single_val = pd.read_csv('data/KeFVP_price_data/maec/16/maec16_train_single_val.csv')


In [6]:
work in progress

NameError: name 'work' is not defined

# Clean Praat features

- Replace '--undefined--' data
- Convert to numeric float64
- Compare our Praat data (MAEC), which was calculated from MP3 files.
    - To the provided Praat data (MAEC) [GitHub](https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction/tree/master)

In [None]:
praat_features = praat_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)

# Convert all columns into float64 except 
cols_to_convert = praat_features.columns.difference(['Company', 'Date', 'audio_file'])
praat_features[cols_to_convert] = praat_features[cols_to_convert].progress_apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
praat_features[cols_to_convert] = praat_features[cols_to_convert].progress_apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
praat_features.info(verbose=True)

100%|██████████| 30/30 [00:00<00:00, 80.07it/s]
100%|██████████| 30/30 [00:00<00:00, 899.50it/s]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Mean pitch                    89722 non-null  float64
 1   Standard deviation            89722 non-null  float64
 2   Minimum pitch                 89722 non-null  float64
 3   Maximum pitch                 89722 non-null  float64
 4   Number of pulses              89722 non-null  float64
 5   Number of periods             89722 non-null  float64
 6   Mean period                   89722 non-null  float64
 7   Mean intensity                89722 non-null  float64
 8   Minimum intensity             89722 non-null  float64
 9   Maximum intensity             89722 non-null  float64
 10  Standard deviation of period  89722 non-null  float64
 11  Fraction of unvoiced          89722 non-null  float64
 12  Number of voice breaks        89722 non-null  float64
 13  D




In [None]:
MAEC_praat_features = MAEC_praat_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)

# Convert all columns into float64 except 
cols_to_convert = MAEC_praat_features.columns.difference(['Ticker', 'Date', 'audio_file'])
MAEC_praat_features[cols_to_convert] = MAEC_praat_features[cols_to_convert].progress_apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
MAEC_praat_features[cols_to_convert] = MAEC_praat_features[cols_to_convert].progress_apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
MAEC_praat_features.info(verbose=True)

100%|██████████| 30/30 [00:01<00:00, 22.53it/s]
100%|██████████| 30/30 [00:00<00:00, 207.92it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Data columns (total 33 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Mean pitch                    394277 non-null  float64
 1   Standard deviation            394277 non-null  float64
 2   Minimum pitch                 394277 non-null  float64
 3   Maximum pitch                 394277 non-null  float64
 4   Number of pulses              394277 non-null  int64  
 5   Number of periods             394277 non-null  int64  
 6   Mean period                   394277 non-null  float64
 7   Mean intensity                394277 non-null  float64
 8   Minimum intensity             394277 non-null  float64
 9   Maximum intensity             394277 non-null  float64
 10  Standard deviation of period  394277 non-null  float64
 11  Fraction of unvoiced          394277 non-null  float64
 12  Number of voice breaks        394277 non-nul

Compare

In [None]:
MAEC_GitHub_features = pd.DataFrame()
def each_row(row):
    Ticker = row['Ticker']
    Date = row['Date'].replace('-', '') 

    features_df = pd.read_csv(f"data/MAEC/MAEC_Dataset/{Date}_{Ticker}/features.csv")
    features_df['Ticker'] = Ticker
    features_df['Date'] = Date
    features_df['Sentence_num'] = range(1, len(features_df) + 1)

    global MAEC_GitHub_features
    MAEC_GitHub_features = pd.concat([MAEC_GitHub_features, features_df], ignore_index=True)

MAEC_filename_data.progress_apply(each_row, axis=1)

MAEC_GitHub_features = MAEC_GitHub_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)
# Convert all columns into float64 except 'Ticker', 'Date', 'audio_file' 
cols_to_convert = MAEC_GitHub_features.columns.difference(['Ticker', 'Date', 'audio_file'])
MAEC_GitHub_features[cols_to_convert] = MAEC_GitHub_features[cols_to_convert].progress_apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
MAEC_GitHub_features[cols_to_convert] = MAEC_GitHub_features[cols_to_convert].progress_apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
MAEC_GitHub_features.info(verbose=True)

100%|██████████| 3443/3443 [02:20<00:00, 24.58it/s]
100%|██████████| 30/30 [00:00<00:00, 33.65it/s]
100%|██████████| 30/30 [00:00<00:00, 209.28it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Data columns (total 32 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Mean pitch                    394277 non-null  float64
 1   Standard deviation            394277 non-null  float64
 2   Minimum pitch                 394277 non-null  float64
 3   Maximum pitch                 394277 non-null  float64
 4   Mean intensity                394277 non-null  float64
 5   Minimum intensity             394277 non-null  float64
 6   Maximum intensity             394277 non-null  float64
 7   Number of pulses              394277 non-null  float64
 8   Number of periods             394277 non-null  float64
 9   Mean period                   394277 non-null  float64
 10  Standard deviation of period  394277 non-null  float64
 11  Fraction of unvoiced          394277 non-null  float64
 12  Number of voice breaks        394277 non-nul

# Compare
These are very close. That's good. This gives assurance that the original dataset Pratt features are okay.

In [None]:
MAEC_praat_features

Unnamed: 0,Mean pitch,Standard deviation,Minimum pitch,Maximum pitch,Number of pulses,Number of periods,Mean period,Mean intensity,Minimum intensity,Maximum intensity,...,Shimmer apq11,Shimmer dda,Mean autocorrelation,Mean NHR,Mean HNR,Audio Length,Ticker,Date,Sentence_num,audio_file
0,122.509,4.582,119.282,132.486,15,14,0.008145,48.617273,33.245858,59.811569,...,30.665,18.235,0.675386,0.524708,3.356,0.370979,LMAT,20150225,2,LMAT_20150225_f000002100.mp3
1,137.390,76.065,91.597,572.261,103,96,0.007461,42.840412,17.956942,61.970121,...,17.935,29.632,0.712884,0.482183,4.893,2.506979,LMAT,20150225,3,LMAT_20150225_f000002101.mp3
2,117.168,11.177,91.897,150.239,283,266,0.008527,43.585335,14.603231,68.719015,...,12.035,23.836,0.717230,0.468501,4.744,5.170979,LMAT,20150225,4,LMAT_20150225_f000002102.mp3
3,117.238,11.779,97.850,145.271,170,162,0.008542,42.735277,15.718385,63.319417,...,17.816,23.452,0.741878,0.407407,5.231,3.250979,LMAT,20150225,5,LMAT_20150225_f000002103.mp3
4,136.303,70.402,98.310,577.893,342,319,0.007383,44.440119,22.618467,62.993807,...,14.846,20.482,0.752116,0.394099,5.627,5.338979,LMAT,20150225,6,LMAT_20150225_f000002104.mp3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
394272,133.531,14.505,105.107,157.856,42,40,0.007475,59.591205,-15.285724,80.196919,...,6.825,10.370,0.800006,0.290069,6.972,0.743379,BKS,20180621,24,BKS_20180621_f000039100.mp3
394273,128.154,49.849,95.974,567.649,396,377,0.007820,67.647350,45.119193,80.862428,...,13.618,15.330,0.823743,0.265119,8.079,4.452766,BKS,20180621,25,BKS_20180621_f000039101.mp3
394274,130.803,12.533,102.188,160.576,315,304,0.007669,62.469106,38.641636,80.928417,...,11.752,14.277,0.824435,0.265622,8.167,4.452766,BKS,20180621,26,BKS_20180621_f000039102.mp3
394275,116.219,11.366,87.046,137.937,157,148,0.008614,61.795049,36.740310,80.262575,...,13.835,13.291,0.814544,0.285962,8.096,2.075624,BKS,20180621,27,BKS_20180621_f000039103.mp3


In [None]:
MAEC_GitHub_features

Unnamed: 0,Mean pitch,Standard deviation,Minimum pitch,Maximum pitch,Mean intensity,Minimum intensity,Maximum intensity,Number of pulses,Number of periods,Mean period,...,Shimmer apq5,Shimmer apq11,Shimmer dda,Mean autocorrelation,Mean NHR,Mean HNR,Audio Length,Ticker,Date,Sentence_num
0,122.509,4.582,119.282,132.486,48.617273,59.811569,33.245858,15.0,14.0,0.008145,...,11.513,30.665,18.235,0.675386,0.524708,3.356,0.370979,LMAT,20150225,1
1,137.390,76.065,91.597,572.261,42.840412,61.970121,17.956942,103.0,96.0,0.007461,...,12.248,17.935,29.632,0.712884,0.482183,4.893,2.506979,LMAT,20150225,2
2,117.168,11.177,91.897,150.239,43.585335,68.719015,14.603231,283.0,266.0,0.008527,...,10.633,12.035,23.836,0.717230,0.468501,4.744,5.170979,LMAT,20150225,3
3,117.238,11.779,97.850,145.271,42.735277,63.319417,15.718385,170.0,162.0,0.008542,...,9.040,17.816,23.452,0.741878,0.407407,5.231,3.250979,LMAT,20150225,4
4,136.303,70.402,98.310,577.893,44.440119,62.993807,22.618467,342.0,319.0,0.007383,...,8.630,14.846,20.482,0.752116,0.394099,5.627,5.338979,LMAT,20150225,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
394272,133.531,14.505,105.107,157.856,59.591205,80.196919,-15.285724,42.0,40.0,0.007475,...,4.486,6.825,10.370,0.800006,0.290069,6.972,0.743379,BKS,20180621,24
394273,128.154,49.849,95.974,567.649,67.647350,80.862428,45.119193,396.0,377.0,0.007820,...,7.697,13.618,15.330,0.823743,0.265119,8.079,4.452766,BKS,20180621,25
394274,130.803,12.533,102.188,160.576,62.469106,80.928417,38.641636,315.0,304.0,0.007669,...,5.904,11.752,14.277,0.824435,0.265622,8.167,4.452766,BKS,20180621,26
394275,116.219,11.366,87.046,137.937,61.795049,80.262575,36.740310,157.0,148.0,0.008614,...,7.349,13.835,13.291,0.814544,0.285962,8.096,2.075624,BKS,20180621,27


# Final datasets


In [None]:
max_sent_len = 523

# The maximum sentences per meeting is 523 
# Ensure each meeting has 523 sentences, add zeros to the end
def add_zero_padding(group):
    complete_index = pd.Index(np.arange(1, (max_sent_len +1)), name='Sentence_num')
    group = group.set_index('Sentence_num').reindex(complete_index).reset_index()
    group['Date'] = group['Date'].ffill().bfill()
    group['Company'] = group['Company'].ffill().bfill()
    group.fillna(0.0, inplace=True)
    return group

def combine_audio_text_targets(audio_features, text_features, targets, num_features, save_directory):
    # combine & add_zero_padding
    features = pd.merge(audio_features, text_features, how="left",on = ['Company','Date','Sentence_num'])
    features = features.drop(['Shimmer apq11','Audio Length','audio_file'], axis=1)
    features = features.groupby(['Company', 'Date'], group_keys=False).progress_apply(add_zero_padding).reset_index(drop=True)
    features.fillna(0, inplace=True)

    # match the targets to the features
    # (target values will be duplicated 523 times, for each meeting, but that's okay for now)
    # we need the date and company columns to sort
    targets['Date'] = targets['Date'].astype(str).str.replace('-', '').astype('Int64')
    original_dataset = pd.merge(features, targets, how="left",on = ['Company','Date'])
    rows_with_nulls = original_dataset[original_dataset.isnull().any(axis=1)]
    original_dataset.fillna(0, inplace=True)
    print('Number of rows with NULL values -   ', len(rows_with_nulls))

    # dump rows without targets
    original_dataset = original_dataset.drop(['Ticker'], axis=1)
    original_dataset = original_dataset[~((original_dataset['3_day'] == 0) & 
                                        (original_dataset['7_day'] == 0) & 
                                        (original_dataset['15_day'] == 0) & 
                                        (original_dataset['30_day'] == 0))]
    original_dataset = original_dataset.sort_values(by=['Date', 'Company', 'Sentence_num'], ascending=[True, True, True])
    original_dataset.info(verbose=False)

    # train/val/test split FEATURES ONLY
    original_dataset['Date'] = pd.to_datetime(original_dataset['Date'].astype(str), format='%Y%m%d')
    train_features = original_dataset[original_dataset['Date'] <= '2017-08-02']
    val_features = original_dataset[(original_dataset['Date'] >= '2017-08-03') & (original_dataset['Date'] < '2017-10-24')]
    test_features = original_dataset[original_dataset['Date'] >= '2017-10-24']
    train_features = train_features.drop(['Company','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    val_features = val_features.drop(['Company','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    test_features = test_features.drop(['Company','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()

    # now for the TARGETS - we need 4 sets of targets (3, 7, 15, 30) each with primary/secondary targets 
    targets = targets[~((targets['3_day'] == 0) & 
                        (targets['7_day'] == 0) & 
                        (targets['15_day'] == 0) & 
                        (targets['30_day'] == 0))] # dump rows without targets
    targets = targets.copy().sort_values(by=['Date', 'Company'], ascending=[True, True])
    targets['Date'] = pd.to_datetime(targets['Date'].astype(str), format='%Y%m%d')
    # train/val/test split
    train_targets = targets[targets['Date'] <= '2017-08-02']
    val_targets = targets[(targets['Date'] >= '2017-08-03') & (targets['Date'] < '2017-10-24')]
    test_targets = targets[targets['Date'] >= '2017-10-24']

    #############################  save  ######################################################################
    # FEATURES- Reshape the NumPy array to have dimensions ( # meetings, 523 sentences (with padding), num_features)
    np.save(f'data/{save_directory}/train_features.npy', train_features.reshape(int(len(train_features)/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/val_features.npy', val_features.reshape(int(len(val_features)/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/test_features.npy', test_features.reshape(int(len(test_features)/max_sent_len), max_sent_len, num_features))
    
    np.save(f'data/{save_directory}/train_targets_3.npy', train_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_3.npy', val_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_3.npy', test_targets['3_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_3.npy', train_targets['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_3.npy', val_targets['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_3.npy', test_targets['3_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_7.npy', train_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_7.npy', val_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_7.npy', test_targets['7_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_7.npy', train_targets['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_7.npy', val_targets['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_7.npy', test_targets['7_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_15.npy', train_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_15.npy', val_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_15.npy', test_targets['15_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_15.npy', train_targets['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_15.npy', val_targets['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_15.npy', test_targets['15_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_30.npy', train_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_30.npy', val_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_30.npy', test_targets['30_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_30.npy', train_targets['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_30.npy', val_targets['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_30.npy', test_targets['30_day_single'].to_numpy())

        

In [None]:
praat_features['Date'] = praat_features['Date'].astype('Int64')
combine_audio_text_targets(praat_features, RoBERTa_features, targets, 1051, 'RoBERTa')

100%|██████████| 572/572 [00:05<00:00, 106.24it/s]


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 292880 entries, 73743 to 75311
Columns: 1062 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(1059), int32(1), object(1)
memory usage: 2.3+ GB


In [None]:
combine_audio_text_targets(praat_features, RoBERTa_features2, targets, 1051, 'RoBERTa2')

100%|██████████| 572/572 [00:06<00:00, 88.56it/s] 


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 292880 entries, 73743 to 75311
Columns: 1062 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(1059), int32(1), object(1)
memory usage: 2.3+ GB


In [None]:
combine_audio_text_targets(praat_features, investopedia_features, targets, 795, 'investopedia')

100%|██████████| 572/572 [00:05<00:00, 105.00it/s]


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 292880 entries, 73743 to 75311
Columns: 806 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(803), int32(1), object(1)
memory usage: 1.8+ GB


In [None]:
combine_audio_text_targets(praat_features, bge_features, targets, 1051, 'bge')

100%|██████████| 572/572 [00:07<00:00, 71.51it/s]


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 292880 entries, 73743 to 75311
Columns: 1062 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(1059), int32(1), object(1)
memory usage: 2.3+ GB


In [None]:
combine_audio_text_targets(praat_features, bge_base_features, targets, 795, 'bge_base')

100%|██████████| 572/572 [00:03<00:00, 146.25it/s]


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 292880 entries, 73743 to 75311
Columns: 806 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(803), int32(1), object(1)
memory usage: 1.8+ GB


In [None]:
combine_audio_text_targets(praat_features, glove_features, targets, 327, 'glove')

100%|██████████| 572/572 [00:01<00:00, 292.00it/s]


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 292880 entries, 73743 to 75311
Columns: 338 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(335), int32(1), object(1)
memory usage: 756.7+ MB


# Final MAEC dataset

In [None]:
max_sent_len = 500 

# The maximum sentences per meeting is  
# Ensure each meeting has  sentences, add zeros to the end
def add_zero_padding(group):
    complete_index = pd.Index(np.arange(1, (max_sent_len + 1)), name='Sentence_num')
    group = group.set_index('Sentence_num').reindex(complete_index).reset_index()
    group['Date'] = group['Date'].ffill().bfill()
    group['Ticker'] = group['Ticker'].ffill().bfill()
    group.fillna(0.0, inplace=True)
    return group

def combine_audio_text_targets(audio_features, text_features, targets, num_features, save_directory):
    # combine & add_zero_padding
    features = pd.merge(audio_features, text_features, how="left",on = ['Ticker','Date','Sentence_num'])
    features = features.drop(['Shimmer apq11','Audio Length','audio_file'], axis=1)
    features = features.groupby(['Ticker', 'Date'], group_keys=False).progress_apply(add_zero_padding).reset_index(drop=True)
    features.fillna(0, inplace=True)

    # match the targets to the features
    # (target values will be duplicated 500 times, for each meeting, but that's okay for now)
    # we need the date and company columns to sort
    targets['Date'] = targets['Date'].astype(str).str.replace('-', '').astype('Int64')
    MAEC_dataset = pd.merge(features, targets, how="left",on = ['Ticker','Date'])
    rows_with_nulls = MAEC_dataset[MAEC_dataset.isnull().any(axis=1)]
    MAEC_dataset.fillna(0, inplace=True)
    print('Number of rows with NULL values -   ', len(rows_with_nulls))

    # dump rows without targets
    MAEC_dataset = MAEC_dataset[~((MAEC_dataset['3_day'] == 0) & 
                                        (MAEC_dataset['7_day'] == 0) & 
                                        (MAEC_dataset['15_day'] == 0) & 
                                        (MAEC_dataset['30_day'] == 0))]
    MAEC_dataset = MAEC_dataset.sort_values(by=['Date', 'Ticker', 'Sentence_num'], ascending=[True, True, True])
    MAEC_dataset.info(verbose=False)

    # train/val/test split FEATURES ONLY
    MAEC_dataset['Date'] = pd.to_datetime(MAEC_dataset['Date'].astype(str), format='%Y%m%d')
    train_features_2015 = MAEC_dataset[MAEC_dataset['Date'] <= '2015-10-21']
    val_features_2015 = MAEC_dataset[(MAEC_dataset['Date'] >= '2015-10-22') & (MAEC_dataset['Date'] < '2015-10-28')]
    test_features_2015 = MAEC_dataset[(MAEC_dataset['Date'] >= '2015-10-29') & (MAEC_dataset['Date'] < '2016-01-01')]
    # 2016
    train_features_2016 = MAEC_dataset[(MAEC_dataset['Date'] >= '2016-01-01') & (MAEC_dataset['Date'] < '2016-08-03')]
    val_features_2016 = MAEC_dataset[(MAEC_dataset['Date'] >= '2016-08-03') & (MAEC_dataset['Date'] < '2016-08-12')]
    test_features_2016 = MAEC_dataset[(MAEC_dataset['Date'] >= '2016-08-12') & (MAEC_dataset['Date'] < '2017-01-01')]
    # 2017
    train_features_2017 = MAEC_dataset[(MAEC_dataset['Date'] >= '2017-01-01') & (MAEC_dataset['Date'] < '2017-11-07')]
    val_features_2017 = MAEC_dataset[(MAEC_dataset['Date'] >= '2017-11-07') & (MAEC_dataset['Date'] < '2018-02-12')]
    test_features_2017 = MAEC_dataset[(MAEC_dataset['Date'] >= '2018-02-15') & (MAEC_dataset['Date'] < '2019-01-01')]

    train_features_2015 = train_features_2015.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    val_features_2015 = val_features_2015.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    test_features_2015 = test_features_2015.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    # 2016
    train_features_2016 = train_features_2016.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    val_features_2016 = val_features_2016.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    test_features_2016 = test_features_2016.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    # 2017
    train_features_2017 = train_features_2017.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    val_features_2017 = val_features_2017.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
    test_features_2017 = test_features_2017.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()

    # now for the TARGETS - we need 4 sets of targets (3, 7, 15, 30) each with primary/secondary targets 
    targets = targets[~((targets['3_day'] == 0) & 
                        (targets['7_day'] == 0) & 
                        (targets['15_day'] == 0) & 
                        (targets['30_day'] == 0))] # dump rows without targets
    targets = targets.copy().sort_values(by=['Date', 'Ticker'], ascending=[True, True])
    targets['Date'] = pd.to_datetime(targets['Date'].astype(str), format='%Y%m%d')
    # train/val/test split
    train_targets_2015 = targets[targets['Date'] <= '2015-10-21']
    val_targets_2015 = targets[(targets['Date'] >= '2015-10-22') & (targets['Date'] < '2015-10-28')]
    test_targets_2015 = targets[(targets['Date'] >= '2015-10-29') & (targets['Date'] < '2016-01-01')]
    # 2016
    train_targets_2016 = targets[(targets['Date'] >= '2016-01-01') & (targets['Date'] < '2016-08-03')]
    val_targets_2016 = targets[(targets['Date'] >= '2016-08-03') & (targets['Date'] < '2016-08-12')]
    test_targets_2016 = targets[(targets['Date'] >= '2016-08-12') & (targets['Date'] < '2017-01-01')]
    # 2017
    train_targets_2017 = targets[(targets['Date'] >= '2017-01-01') & (targets['Date'] < '2017-11-07')]
    val_targets_2017 = targets[(targets['Date'] >= '2017-11-07') & (targets['Date'] < '2018-02-12')]
    test_targets_2017 = targets[(targets['Date'] >= '2018-02-15') & (targets['Date'] < '2019-01-01')]

    #############################  save  ######################################################################
    # FEATURES- Reshape the NumPy array to have dimensions ( # meetings, # sentences (with padding), num_features)
    np.save(f'data/{save_directory}/train_features_2015.npy', train_features_2015.reshape(int((len(train_features_2015))/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/val_features_2015.npy', val_features_2015.reshape(int((len(val_features_2015))/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/test_features_2015.npy', test_features_2015.reshape(int((len(test_features_2015))/max_sent_len), max_sent_len, num_features))
    # 2016
    np.save(f'data/{save_directory}/train_features_2016.npy', train_features_2016.reshape(int((len(train_features_2016))/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/val_features_2016.npy', val_features_2016.reshape(int((len(val_features_2016))/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/test_features_2016.npy', test_features_2016.reshape(int((len(test_features_2016))/max_sent_len), max_sent_len, num_features))
    # 2017
    np.save(f'data/{save_directory}/train_features_2017.npy', train_features_2017.reshape(int((len(train_features_2017))/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/val_features_2017.npy', val_features_2017.reshape(int((len(val_features_2017))/max_sent_len), max_sent_len, num_features))
    np.save(f'data/{save_directory}/test_features_2017.npy', test_features_2017.reshape(int((len(test_features_2017))/max_sent_len), max_sent_len, num_features))

    ##################################################################################################
    # 2015 targets
    np.save(f'data/{save_directory}/train_targets_2015_3.npy', train_targets_2015['3_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2015_3.npy', val_targets_2015['3_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2015_3.npy', test_targets_2015['3_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2015_3.npy', train_targets_2015['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2015_3.npy', val_targets_2015['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2015_3.npy', test_targets_2015['3_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2015_7.npy', train_targets_2015['7_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2015_7.npy', val_targets_2015['7_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2015_7.npy', test_targets_2015['7_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2015_7.npy', train_targets_2015['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2015_7.npy', val_targets_2015['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2015_7.npy', test_targets_2015['7_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2015_15.npy', train_targets_2015['15_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2015_15.npy', val_targets_2015['15_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2015_15.npy', test_targets_2015['15_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2015_15.npy', train_targets_2015['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2015_15.npy', val_targets_2015['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2015_15.npy', test_targets_2015['15_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2015_30.npy', train_targets_2015['30_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2015_30.npy', val_targets_2015['30_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2015_30.npy', test_targets_2015['30_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2015_30.npy', train_targets_2015['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2015_30.npy', val_targets_2015['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2015_30.npy', test_targets_2015['30_day_single'].to_numpy())

    ##################################################################################################
    # 2016 targets
    np.save(f'data/{save_directory}/train_targets_2016_3.npy', train_targets_2016['3_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2016_3.npy', val_targets_2016['3_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2016_3.npy', test_targets_2016['3_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2016_3.npy', train_targets_2016['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2016_3.npy', val_targets_2016['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2016_3.npy', test_targets_2016['3_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2016_7.npy', train_targets_2016['7_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2016_7.npy', val_targets_2016['7_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2016_7.npy', test_targets_2016['7_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2016_7.npy', train_targets_2016['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2016_7.npy', val_targets_2016['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2016_7.npy', test_targets_2016['7_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2016_15.npy', train_targets_2016['15_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2016_15.npy', val_targets_2016['15_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2016_15.npy', test_targets_2016['15_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2016_15.npy', train_targets_2016['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2016_15.npy', val_targets_2016['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2016_15.npy', test_targets_2016['15_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2016_30.npy', train_targets_2016['30_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2016_30.npy', val_targets_2016['30_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2016_30.npy', test_targets_2016['30_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2016_30.npy', train_targets_2016['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2016_30.npy', val_targets_2016['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2016_30.npy', test_targets_2016['30_day_single'].to_numpy())

    ##################################################################################################
    # 2017
    np.save(f'data/{save_directory}/train_targets_2017_3.npy', train_targets_2017['3_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2017_3.npy', val_targets_2017['3_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2017_3.npy', test_targets_2017['3_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2017_3.npy', train_targets_2017['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2017_3.npy', val_targets_2017['3_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2017_3.npy', test_targets_2017['3_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2017_7.npy', train_targets_2017['7_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2017_7.npy', val_targets_2017['7_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2017_7.npy', test_targets_2017['7_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2017_7.npy', train_targets_2017['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2017_7.npy', val_targets_2017['7_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2017_7.npy', test_targets_2017['7_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2017_15.npy', train_targets_2017['15_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2017_15.npy', val_targets_2017['15_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2017_15.npy', test_targets_2017['15_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2017_15.npy', train_targets_2017['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2017_15.npy', val_targets_2017['15_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2017_15.npy', test_targets_2017['15_day_single'].to_numpy())

    np.save(f'data/{save_directory}/train_targets_2017_30.npy', train_targets_2017['30_day'].to_numpy())
    np.save(f'data/{save_directory}/val_targets_2017_30.npy', val_targets_2017['30_day'].to_numpy())
    np.save(f'data/{save_directory}/test_targets_2017_30.npy', test_targets_2017['30_day'].to_numpy())
    np.save(f'data/{save_directory}/train_secondary_targets_2017_30.npy', train_targets_2017['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/val_secondary_targets_2017_30.npy', val_targets_2017['30_day_single'].to_numpy())
    np.save(f'data/{save_directory}/test_secondary_targets_2017_30.npy', test_targets_2017['30_day_single'].to_numpy())




   

In [None]:
MAEC_praat_features['Date'] = MAEC_praat_features['Date'].astype('Int64')
combine_audio_text_targets(MAEC_praat_features, MAEC_RoBERTa_features, MAEC_targets, 1051, 'MAEC_RoBERTa')

100%|██████████| 3443/3443 [00:35<00:00, 97.80it/s] 


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 1612500 entries, 934500 to 234999
Columns: 1062 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(1059), int32(1), object(1)
memory usage: 12.8+ GB


In [None]:
combine_audio_text_targets(MAEC_praat_features, MAEC_RoBERTa_features2, MAEC_targets, 1051, 'MAEC_RoBERTa2')

100%|██████████| 3443/3443 [00:38<00:00, 90.11it/s] 


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 1612500 entries, 934500 to 234999
Columns: 1062 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(1059), int32(1), object(1)
memory usage: 12.8+ GB


In [None]:
combine_audio_text_targets(MAEC_praat_features, MAEC_investopedia_features, MAEC_targets, 795, 'MAEC_investopedia')

100%|██████████| 3443/3443 [00:28<00:00, 119.03it/s]


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 1612500 entries, 934500 to 234999
Columns: 806 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(803), int32(1), object(1)
memory usage: 9.7+ GB


In [None]:
combine_audio_text_targets(MAEC_praat_features, MAEC_bge_features, MAEC_targets, 1051, 'MAEC_bge')

100%|██████████| 3443/3443 [00:40<00:00, 85.11it/s] 


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 1612500 entries, 934500 to 234999
Columns: 1062 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(1059), int32(1), object(1)
memory usage: 12.8+ GB


In [None]:
combine_audio_text_targets(MAEC_praat_features, MAEC_bge_base_features, MAEC_targets, 795, 'MAEC_bge_base')

100%|██████████| 3443/3443 [00:28<00:00, 120.67it/s]


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 1612500 entries, 934500 to 234999
Columns: 806 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(803), int32(1), object(1)
memory usage: 9.7+ GB


In [None]:
combine_audio_text_targets(MAEC_praat_features, MAEC_glove_features, MAEC_targets, 327, 'MAEC_glove')

100%|██████████| 3443/3443 [00:13<00:00, 251.55it/s]


Number of rows with NULL values -    0
<class 'pandas.core.frame.DataFrame'>
Index: 1612500 entries, 934500 to 234999
Columns: 338 entries, Sentence_num to 30_day_single
dtypes: Int64(1), float64(335), int32(1), object(1)
memory usage: 4.1+ GB
