# Earnings Call Project: Data Cleaning
<br>
CIS 831 Deep Learning – Term Project<br>
Kansas State University
<br><br>
James Chapman<br>
John Woods<br>
Nathan Diehl<br>
<br>

This notebook creates data used for training/testing.
- Calculates the targets for both datasets
- Corrects Praat features for both datasets
- Combines features (Glove & Praat) or (RoBERTa & Praat) and targets
- save 9 numpy files specifically for HTML (RoBERTa & Praat)
    - train (features, targets, secondary_targets)
    - validation (features, targets, secondary_targets)
    - test (features, targets, secondary_targets)
- it is important to note 7 meetings were removed because we could not find stock data (Yahoo or alphadvantage)<br>
- 218 meetings were removed from MAEC <br>

The rest data from this notebook is stored in the "data" directory as the following CSVs
- original_dataset
- MAEC_dataset

In [3]:
import sys
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install yfinance
    from google.colab import drive
    drive.mount('/content/gdrive')
    %cd gdrive/My Drive/831

In [4]:
import pandas as pd
import numpy as np
import requests
import time
import json
import csv
import re
import os
from datetime import datetime
from tqdm import tqdm

tqdm.pandas()

In [5]:
MAEC_dir = 'data/MAEC/MAEC_Dataset' # https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction

############# too big for GitHub ########################
############# stored on local disk ######################
original_data_dir = r"D:\original_dataset" # https://github.com/GeminiLn/EarningsCall_Dataset 
MAEC_audio_dir = r"D:\MAEC_audio" 
# there is a link for the audio data in the MAEC GitHub, but it does not work
# I emailed the authors, and they send another link.
# There is like a half-million files, but only 19 GB
# https://drive.google.com/file/d/1m1GRCHgKn9Vz9IFMC_SpCog6uP3-gFgY/view?usp=drive_link 

# from Webscraping
alpha_dir = 'data/data_prep/alpha_data/{}.csv' #.format(ticker) # I saved the raw alphadvantage data, so I don't have to do it again
yahoo_data = pd.read_csv('data/data_prep/yahoo_data.csv', index_col=0)
alpha_data = pd.read_csv('data/data_prep/alpha_data.csv', index_col=0)
MAEC_yahoo_data = pd.read_csv('data/data_prep/MAEC_yahoo_data.csv', index_col=0)
MAEC_alpha_data = pd.read_csv('data/data_prep/MAEC_alpha_data.csv', index_col=0)

yahoo_data.index = pd.to_datetime(yahoo_data.index)
alpha_data.index = pd.to_datetime(alpha_data.index)
MAEC_yahoo_data.index = pd.to_datetime(MAEC_yahoo_data.index)
MAEC_alpha_data.index = pd.to_datetime(MAEC_alpha_data.index)

# from Feature Engineering
glove_features = pd.read_csv('data/data_prep/glove_features.csv')
praat_features = pd.read_csv('data/data_prep/praat_features.csv', low_memory=False)
MAEC_glove_features = pd.read_csv('data/data_prep/MAEC_glove_features.csv')
MAEC_praat_features = pd.read_csv('data/data_prep/MAEC_praat_features.csv', low_memory=False)

# from Feature Engineering RoBERTa
RoBERTa_features = pd.read_csv('data/data_prep/RoBERTa_features.csv', low_memory=False)
# MACE_RoBERTa_features = pd.read_csv('data/data_prep/MACE_RoBERTa_features.csv', low_memory=False)

In [6]:
# Loop through the directory, each folder represents an earnings conference call; the folders are named as "CompanyName_Date".
filename_data = []
for filename in os.listdir(original_data_dir):
    company_name, date_str = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    filename_data.append([company_name, date])
filename_data = pd.DataFrame(filename_data, columns=["Company", "Date"])
company_ticker = pd.read_csv('data/data_prep/company_ticker.csv')
filename_data = filename_data.merge(company_ticker, on="Company", how="left")

# Loop through the directory, each folder represents an earnings conference call; the folders are named as "Date_CompanyName".
MAEC_filename_data = []
for filename in os.listdir(MAEC_dir):
    date_str, ticker = filename.rsplit('_', 1)
    date_str = date_str.split('.')[0] 
    date = datetime.strptime(date_str, "%Y%m%d").strftime("%Y-%m-%d")
    MAEC_filename_data.append([ticker, date])
MAEC_filename_data = pd.DataFrame(MAEC_filename_data, columns=["Ticker", "Date"])

In [7]:
filename_data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 572 entries, 0 to 571
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Company  572 non-null    object
 1   Date     572 non-null    object
 2   Ticker   572 non-null    object
dtypes: object(3)
memory usage: 13.5+ KB


# Add TARGET of the regression

**n-day volatility predictions**: The predicted average volatility over the following n days.<br>

$$
v[0,n] = \ln \left( \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (r_i - \bar{r})^2 } \right)
$$

Where:
- \( r_i \) is the stock return on day \(i\),
- \( \bar{r} \) is the average stock return over \(n\) days.

The stock return \(r_i\) is defined as:

$$
r_i = \frac{P_i - P_{i-1}}{P_{i-1}}
$$

Where \(P_i\) is the adjusted closing price of the stock on day \(i\).

For **single-day log volatility**, we estimate it using the **daily log absolute return**:

$$
v_n = \ln \left( \left| \frac{P_n - P_{n-1}}{P_{n-1}} \right| \right)
$$

Where:
- \(P_n\) is the adjusted closing price of the stock on day \(n\),
- \(P_{n-1}\) is the adjusted closing price on the previous day.

Our multi-task learning objective is to **simultaneously predict** these two quantities:
- \(v[0,n]\): The average volatility over \(n\) days (the main task).
- \(v_n\): The single-day volatility (the auxiliary task).


In [9]:
targets = filename_data.copy()

# Yahoo (missing 9 companies)
def add_n_day(row, n_day):
    Ticker = row['Ticker']
    if Ticker in ['GGP', 'CA', 'STI', 'FLT', 'NLSN', 'WRK','RTN', 'UTX', 'DISH']:
        return float('inf'), float('inf')
    Date = pd.to_datetime(row['Date'])
    Date_index = yahoo_data.index.get_loc(Date)
    data = yahoo_data.iloc[Date_index:(Date_index + n_day)][f"{Ticker}_Adj Close"]  
    # # calendar days? 
    # start = Date # - pd.Timedelta(days=(1))
    # end = Date + pd.Timedelta(days=(n_day))
    # data = yahoo_data.loc[start:end, f"{Ticker}_Adj Close"]  
    stock_return = data.pct_change().dropna() # ri =(Pi − Pi−1)/Pi−1
    std_dev = stock_return.std()
    if pd.isna(std_dev) or std_dev == 0:
        return 0, 0
    else:
        return np.log(std_dev), np.log(abs(stock_return.iloc[-1]) + 1e-10) #log 0 is undefined

targets['3_day'], targets['3_day_single'] = zip(*targets.apply(lambda row: add_n_day(row, 3), axis=1))
targets['7_day'], targets['7_day_single'] = zip(*targets.apply(lambda row: add_n_day(row, 7), axis=1))
targets['15_day'], targets['15_day_single'] = zip(*targets.apply(lambda row: add_n_day(row, 15), axis=1))
targets['30_day'], targets['30_day_single'] = zip(*targets.apply(lambda row: add_n_day(row, 30), axis=1))
targets.info(verbose=True)
targets

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 572 entries, 0 to 571
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company        572 non-null    object 
 1   Date           572 non-null    object 
 2   Ticker         572 non-null    object 
 3   3_day          572 non-null    float64
 4   3_day_single   572 non-null    float64
 5   7_day          572 non-null    float64
 6   7_day_single   572 non-null    float64
 7   15_day         572 non-null    float64
 8   15_day_single  572 non-null    float64
 9   30_day         572 non-null    float64
 10  30_day_single  572 non-null    float64
dtypes: float64(8), object(3)
memory usage: 49.3+ KB


Unnamed: 0,Company,Date,Ticker,3_day,3_day_single,7_day,7_day_single,15_day,15_day_single,30_day,30_day_single
0,3M Company,2017-04-25,MMM,-5.404551,-5.168630,-5.227834,-5.185492,-5.259718,-5.222715,-5.102879,-5.539662
1,3M Company,2017-07-25,MMM,-5.318414,-5.273632,-5.186644,-4.512109,-5.216313,-4.998475,-5.094012,-4.368167
2,A.O. Smith Corp,2017-07-26,AOS,-4.988524,-4.954916,-4.684553,-5.768826,-4.814348,-5.552145,-4.790013,-8.623037
3,Abbott Laboratories,2017-10-18,ABT,-6.790460,-5.164792,-5.008287,-8.621272,-4.663901,-5.904620,-4.776023,-4.846676
4,AbbVie Inc.,2017-04-27,ABBV,-4.936198,-4.804931,-5.347910,-5.635156,-4.685676,-3.810418,-4.899575,-8.837683
...,...,...,...,...,...,...,...,...,...,...,...
567,Xerox,2017-08-01,XRX,-4.445102,-4.281294,-4.816060,-8.079787,-4.456054,-4.962510,-4.514521,-3.768720
568,Xilinx,2017-04-26,AMD,-3.586806,-3.750975,-2.295418,-3.578718,-2.547102,-2.150187,-2.800488,-3.537225
569,XL Group,2017-10-24,AXS,-4.062756,-3.593155,-4.458956,-7.907852,-4.724728,-4.653762,-4.581437,-4.306009
570,Yum! Brands Inc,2017-02-09,YUM,-5.248287,-6.426674,-5.382302,-7.444556,-4.815293,-5.204771,-5.058510,-6.114364


In [10]:
# using alphadvantage 
def add_n_day(row, n_day):
    Ticker = row['Ticker']
    Date = pd.to_datetime(row['Date'])
    Date_index = alpha_data.index.get_loc(Date)
    data = alpha_data.iloc[Date_index:(Date_index + n_day)][Ticker]  
    # # calendar days
    # start = Date #- pd.Timedelta(days=1)
    # end = Date + pd.Timedelta(days=n_day)
    # data = alpha_data.loc[start:end, Ticker]
    stock_return = data.pct_change().dropna() # ri =(Pi − Pi−1)/Pi−1
    std_dev = stock_return.std()
    if pd.isna(std_dev) or std_dev == 0:
        return 0, 0
    else:
        return np.log(std_dev), np.log(abs(stock_return.iloc[-1]) + 1e-10)

targets['3_day_alpha'], targets['3_day_single_alpha'] = zip(*targets.apply(lambda row: add_n_day(row, 3), axis=1))
targets['7_day_alpha'], targets['7_day_single_alpha'] = zip(*targets.apply(lambda row: add_n_day(row, 7), axis=1))
targets['15_day_alpha'], targets['15_day_single_alpha'] = zip(*targets.apply(lambda row: add_n_day(row, 15), axis=1))
targets['30_day_alpha'], targets['30_day_single_alpha'] = zip(*targets.apply(lambda row: add_n_day(row, 30), axis=1))

Here I am comparing the Yahoo data to alphadvantage. Why is there a difference? How does each calculate dividend adjustment?

In [12]:
targets['3_day_diff'] = targets['3_day_alpha'] - targets['3_day']
targets['7_day_diff'] = targets['7_day_alpha'] - targets['7_day']
targets['15_day_diff'] = targets['15_day_alpha'] - targets['15_day']
targets['30_day_diff'] = targets['30_day_alpha'] - targets['30_day']

targets['3_day_pct_change'] = abs(((targets['3_day_alpha'] - targets['3_day']) / targets['3_day_alpha']) * 100)
targets['7_day_pct_change'] = abs(((targets['7_day_alpha'] - targets['7_day']) / targets['7_day_alpha']) * 100)
targets['15_day_pct_change'] = abs(((targets['15_day_alpha'] - targets['15_day']) / targets['15_day_alpha']) * 100)
targets['30_day_pct_change'] = abs(((targets['30_day_alpha'] - targets['30_day']) / targets['30_day_alpha']) * 100)
 
targets['3_day_single_diff'] = targets['3_day_single_alpha'] - targets['3_day_single']
targets['7_day_single_diff'] = targets['7_day_single_alpha'] - targets['7_day_single']
targets['15_day_single_diff'] = targets['15_day_single_alpha'] - targets['15_day_single']
targets['30_day_single_diff'] = targets['30_day_single_alpha'] - targets['30_day_single']

targets['3_day_single_pct_change'] = abs(((targets['3_day_single_alpha'] - targets['3_day_single']) / targets['3_day_single_alpha']) * 100)
targets['7_day_single_pct_change'] = abs(((targets['7_day_single_alpha'] - targets['7_day_single']) / targets['7_day_single_alpha']) * 100)
targets['15_day_single_pct_change'] = abs(((targets['15_day_single_alpha'] - targets['15_day_single']) / targets['15_day_single_alpha']) * 100)
targets['30_day_single_pct_change'] = abs(((targets['30_day_single_alpha'] - targets['30_day_single']) / targets['30_day_single_alpha']) * 100)

# investigate discrepancies
targets = targets.sort_values(by='30_day_pct_change', ascending=False)
targets = targets.sort_values(by='30_day_single_pct_change', ascending=False)
targets.to_csv('data/data_prep/temp.csv', index=False) 

In [13]:
# Change Yahoo errors to the values found with alphadvantage 
targets['3_day_single'] = np.where(targets['3_day'] == float('inf'), targets['3_day_single_alpha'], targets['3_day_single'])
targets['7_day_single'] = np.where(targets['7_day'] == float('inf'), targets['7_day_single_alpha'], targets['7_day_single'])
targets['15_day_single'] = np.where(targets['15_day'] == float('inf'), targets['15_day_single_alpha'], targets['15_day_single'])
targets['30_day_single'] = np.where(targets['30_day'] == float('inf'), targets['30_day_single_alpha'], targets['30_day_single'])
targets['3_day'] = np.where(targets['3_day'] == float('inf'), targets['3_day_alpha'], targets['3_day'])
targets['7_day'] = np.where(targets['7_day'] == float('inf'), targets['7_day_alpha'], targets['7_day'])
targets['15_day'] = np.where(targets['15_day'] == float('inf'), targets['15_day_alpha'], targets['15_day'])
targets['30_day'] = np.where(targets['30_day'] == float('inf'), targets['30_day_alpha'], targets['30_day'])

targets['3_day_single'] = np.where(targets['3_day'] == 0, targets['3_day_single_alpha'], targets['3_day_single'])
targets['7_day_single'] = np.where(targets['7_day'] == 0, targets['7_day_single_alpha'], targets['7_day_single'])
targets['15_day_single'] = np.where(targets['15_day'] == 0, targets['15_day_single_alpha'], targets['15_day_single'])
targets['30_day_single'] = np.where(targets['30_day'] == 0, targets['30_day_single_alpha'], targets['30_day_single'])
targets['3_day'] = np.where(targets['3_day'] == 0, targets['3_day_alpha'], targets['3_day'])
targets['7_day'] = np.where(targets['7_day'] == 0, targets['7_day_alpha'], targets['7_day'])
targets['15_day'] = np.where(targets['15_day'] == 0, targets['15_day_alpha'], targets['15_day'])
targets['30_day'] = np.where(targets['30_day'] == 0, targets['30_day_alpha'], targets['30_day'])

print('Number of rows without data-', len(targets[((targets['3_day'] == 0) & 
                                                    (targets['7_day'] == 0) & 
                                                    (targets['15_day'] == 0) & 
                                                    (targets['30_day'] == 0))]))
print('----------------------------------------------------------------------')

# delete?
# targets = targets[~((targets['3_day'] == 0) & 
#                 (targets['7_day'] == 0) & 
#                 (targets['15_day'] == 0) & 
#                 (targets['30_day'] == 0))]

Number of rows without data- 7
----------------------------------------------------------------------


In [14]:
targets = targets.drop(['3_day_alpha', '3_day_single_alpha', '7_day_alpha', '7_day_single_alpha', '15_day_alpha', 
                        '15_day_single_alpha', '30_day_alpha', '30_day_single_alpha', '3_day_diff', '7_day_diff', 
                        '15_day_diff', '30_day_diff', '3_day_pct_change', '7_day_pct_change', '15_day_pct_change', 
                        '30_day_pct_change', '3_day_single_diff', '7_day_single_diff', '15_day_single_diff', 
                        '30_day_single_diff', '3_day_single_pct_change', '7_day_single_pct_change', 
                        '15_day_single_pct_change', '30_day_single_pct_change'], axis=1)
targets = targets.sort_values(by='3_day', ascending=True)
targets.info(verbose=True)
### save ############################################
targets.to_csv('data/data_prep/targets.csv', index=False)
#####################################################

<class 'pandas.core.frame.DataFrame'>
Index: 572 entries, 37 to 273
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Company        572 non-null    object 
 1   Date           572 non-null    object 
 2   Ticker         572 non-null    object 
 3   3_day          572 non-null    float64
 4   3_day_single   572 non-null    float64
 5   7_day          572 non-null    float64
 6   7_day_single   572 non-null    float64
 7   15_day         572 non-null    float64
 8   15_day_single  572 non-null    float64
 9   30_day         572 non-null    float64
 10  30_day_single  572 non-null    float64
dtypes: float64(8), object(3)
memory usage: 53.6+ KB


# MAEC dataset TARGETS

[paper](https://dl.acm.org/doi/10.1145/3340531.3412879)
[GitHub](https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction/tree/master)

In [16]:
MAEC_targets = MAEC_filename_data.copy()

# Yahoo (missing 9 companies)
def add_n_day(row, n_day):
    Ticker = row['Ticker']
    Date = pd.to_datetime(row['Date'])
    Date_index = MAEC_yahoo_data.index.get_loc(Date)
    data = MAEC_yahoo_data.iloc[Date_index:(Date_index + n_day)][f"{Ticker}_Adj Close"]  
    # # calendar days? 
    # start = Date # - pd.Timedelta(days=(1))
    # end = Date + pd.Timedelta(days=(n_day))
    # data = MAEC_yahoo_data.loc[start:end, f"{Ticker}_Adj Close"]  
    stock_return = data.pct_change().dropna() # ri =(Pi − Pi−1)/Pi−1
    std_dev = stock_return.std()
    if pd.isna(std_dev) or std_dev == 0:
        return 0, 0
    else:
        return np.log(std_dev), np.log(abs(stock_return.iloc[-1]) + 1e-10)

MAEC_targets['3_day'], MAEC_targets['3_day_single'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 3), axis=1))
MAEC_targets['7_day'], MAEC_targets['7_day_single'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 7), axis=1))
MAEC_targets['15_day'], MAEC_targets['15_day_single'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 15), axis=1))
MAEC_targets['30_day'], MAEC_targets['30_day_single'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 30), axis=1))
MAEC_targets.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3443 entries, 0 to 3442
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Ticker         3443 non-null   object 
 1   Date           3443 non-null   object 
 2   3_day          3443 non-null   float64
 3   3_day_single   3443 non-null   float64
 4   7_day          3443 non-null   float64
 5   7_day_single   3443 non-null   float64
 6   15_day         3443 non-null   float64
 7   15_day_single  3443 non-null   float64
 8   30_day         3443 non-null   float64
 9   30_day_single  3443 non-null   float64
dtypes: float64(8), object(2)
memory usage: 269.1+ KB


In [17]:
# 58 tickers that did not return data from alphadvantage
bad_tickers = ['GPS', 'JCP', 'TUP', 'BBT', 'MDP', 'LL', 'ABC', 'PKI', 'HFC', 'HSC', 'CBB', 'ILG', 'JCOM', 'EBIX', 'ENDP', 
               'BIG', 'ASNA', 'IVC', 'BCOR', 'INT', 'FRED', 'CAMP', 'ELY', 'COG', 'CLD', 'CRY', 'PLT', 'FRAN', 'ADS', 'CPSI', 
               'FBHS', 'CHFC', 'UIHC', 'OFC', 'TMST', 'FTD', 'SWM', 'WPG', 'WLTW', 'AKRX', 'BLL', 'DF', 'TLRD', 'SPN', 'CLI', 
               'ESV', 'RCII', 'ANTM', 'RE', 'NCR', 'NEWM', 'PEI', 'LCI', 'ERA', 'ACOR', 'FB', 'AAXN', 'NLS']
def add_n_day(row, n_day):
    Ticker = row['Ticker']
    if Ticker in bad_tickers:
        return float('inf'), float('inf')
    Date = pd.to_datetime(row['Date'])
    Date_index = MAEC_alpha_data.index.get_loc(Date)
    data = MAEC_alpha_data.iloc[Date_index:(Date_index + n_day)][Ticker]  
    # # calendar days
    # start = Date #- pd.Timedelta(days=1)
    # end = Date + pd.Timedelta(days=n_day)
    # data = alpha_data.loc[start:end, Ticker]
    stock_return = data.pct_change().dropna() # ri =(Pi − Pi−1)/Pi−1
    std_dev = stock_return.std()
    if pd.isna(std_dev) or std_dev == 0:
        return 0, 0
    else:
        return np.log(std_dev), np.log(abs(stock_return.iloc[-1]) + 1e-10)

MAEC_targets['3_day_alpha'], MAEC_targets['3_day_single_alpha'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 3), axis=1))
MAEC_targets['7_day_alpha'], MAEC_targets['7_day_single_alpha'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 7), axis=1))
MAEC_targets['15_day_alpha'], MAEC_targets['15_day_single_alpha'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 15), axis=1))
MAEC_targets['30_day_alpha'], MAEC_targets['30_day_single_alpha'] = zip(*MAEC_targets.apply(lambda row: add_n_day(row, 30), axis=1))

In [18]:
MAEC_targets['3_day_diff'] = MAEC_targets['3_day_alpha'] - MAEC_targets['3_day']
MAEC_targets['7_day_diff'] = MAEC_targets['7_day_alpha'] - MAEC_targets['7_day']
MAEC_targets['15_day_diff'] = MAEC_targets['15_day_alpha'] - MAEC_targets['15_day']
MAEC_targets['30_day_diff'] = MAEC_targets['30_day_alpha'] - MAEC_targets['30_day']

MAEC_targets['3_day_pct_change'] = abs(((MAEC_targets['3_day_alpha'] - MAEC_targets['3_day']) / MAEC_targets['3_day_alpha']) * 100)
MAEC_targets['7_day_pct_change'] = abs(((MAEC_targets['7_day_alpha'] - MAEC_targets['7_day']) / MAEC_targets['7_day_alpha']) * 100)
MAEC_targets['15_day_pct_change'] = abs(((MAEC_targets['15_day_alpha'] - MAEC_targets['15_day']) / MAEC_targets['15_day_alpha']) * 100)
MAEC_targets['30_day_pct_change'] = abs(((MAEC_targets['30_day_alpha'] - MAEC_targets['30_day']) / MAEC_targets['30_day_alpha']) * 100)
 
MAEC_targets['3_day_single_diff'] = MAEC_targets['3_day_single_alpha'] - MAEC_targets['3_day_single']
MAEC_targets['7_day_single_diff'] = MAEC_targets['7_day_single_alpha'] - MAEC_targets['7_day_single']
MAEC_targets['15_day_single_diff'] = MAEC_targets['15_day_single_alpha'] - MAEC_targets['15_day_single']
MAEC_targets['30_day_single_diff'] = MAEC_targets['30_day_single_alpha'] - MAEC_targets['30_day_single']

MAEC_targets['3_day_single_pct_change'] = abs(((MAEC_targets['3_day_single_alpha'] - MAEC_targets['3_day_single']) / MAEC_targets['3_day_single_alpha']) * 100)
MAEC_targets['7_day_single_pct_change'] = abs(((MAEC_targets['7_day_single_alpha'] - MAEC_targets['7_day_single']) / MAEC_targets['7_day_single_alpha']) * 100)
MAEC_targets['15_day_single_pct_change'] = abs(((MAEC_targets['15_day_single_alpha'] - MAEC_targets['15_day_single']) / MAEC_targets['15_day_single_alpha']) * 100)
MAEC_targets['30_day_single_pct_change'] = abs(((MAEC_targets['30_day_single_alpha'] - MAEC_targets['30_day_single']) / MAEC_targets['30_day_single_alpha']) * 100)
# investigate discrepancies
MAEC_targets = MAEC_targets.sort_values(by='30_day_pct_change', ascending=False)
MAEC_targets.to_csv('data/data_prep/MAEC_temp.csv', index=False)

In [19]:
# Change Yahoo errors to the values found with alphadvantage 
MAEC_targets['3_day_single'] = np.where(MAEC_targets['3_day'] == float('inf'), MAEC_targets['3_day_single_alpha'], MAEC_targets['3_day_single'])
MAEC_targets['7_day_single'] = np.where(MAEC_targets['7_day'] == float('inf'), MAEC_targets['7_day_single_alpha'], MAEC_targets['7_day_single'])
MAEC_targets['15_day_single'] = np.where(MAEC_targets['15_day'] == float('inf'), MAEC_targets['15_day_single_alpha'], MAEC_targets['15_day_single'])
MAEC_targets['30_day_single'] = np.where(MAEC_targets['30_day'] == float('inf'), MAEC_targets['30_day_single_alpha'], MAEC_targets['30_day_single'])
MAEC_targets['3_day'] = np.where(MAEC_targets['3_day'] == float('inf'), MAEC_targets['3_day_alpha'], MAEC_targets['3_day'])
MAEC_targets['7_day'] = np.where(MAEC_targets['7_day'] == float('inf'), MAEC_targets['7_day_alpha'], MAEC_targets['7_day'])
MAEC_targets['15_day'] = np.where(MAEC_targets['15_day'] == float('inf'), MAEC_targets['15_day_alpha'], MAEC_targets['15_day'])
MAEC_targets['30_day'] = np.where(MAEC_targets['30_day'] == float('inf'), MAEC_targets['30_day_alpha'], MAEC_targets['30_day'])

MAEC_targets['3_day_single'] = np.where(MAEC_targets['3_day'] == 0, MAEC_targets['3_day_single_alpha'], MAEC_targets['3_day_single'])
MAEC_targets['7_day_single'] = np.where(MAEC_targets['7_day'] == 0, MAEC_targets['7_day_single_alpha'], MAEC_targets['7_day_single'])
MAEC_targets['15_day_single'] = np.where(MAEC_targets['15_day'] == 0, MAEC_targets['15_day_single_alpha'], MAEC_targets['15_day_single'])
MAEC_targets['30_day_single'] = np.where(MAEC_targets['30_day'] == 0, MAEC_targets['30_day_single_alpha'], MAEC_targets['30_day_single'])
MAEC_targets['3_day'] = np.where(MAEC_targets['3_day'] == 0, MAEC_targets['3_day_alpha'], MAEC_targets['3_day'])
MAEC_targets['7_day'] = np.where(MAEC_targets['7_day'] == 0, MAEC_targets['7_day_alpha'], MAEC_targets['7_day'])
MAEC_targets['15_day'] = np.where(MAEC_targets['15_day'] == 0, MAEC_targets['15_day_alpha'], MAEC_targets['15_day'])
MAEC_targets['30_day'] = np.where(MAEC_targets['30_day'] == 0, MAEC_targets['30_day_alpha'], MAEC_targets['30_day'])

MAEC_targets.replace([float('inf'), -float('inf')], 0, inplace=True)
print('Number of rows without data-', len(MAEC_targets[((MAEC_targets['3_day'] == 0) & 
                                                        (MAEC_targets['7_day'] == 0) & 
                                                        (MAEC_targets['15_day'] == 0)& 
                                                        (MAEC_targets['30_day'] == 0))]))
print('----------------------------------------------------------------------')

# delete?
# MAEC_targets = MAEC_targets[~((MAEC_targets['3_day'] == 0) & 
#                 (MAEC_targets['7_day'] == 0) & 
#                 (MAEC_targets['15_day'] == 0) & 
#                 (MAEC_targets['30_day'] == 0))]

Number of rows without data- 218
----------------------------------------------------------------------


In [20]:
MAEC_targets = MAEC_targets.drop(['3_day_alpha', '3_day_single_alpha', '7_day_alpha', '7_day_single_alpha', '15_day_alpha', 
                        '15_day_single_alpha', '30_day_alpha', '30_day_single_alpha', '3_day_diff', '7_day_diff', 
                        '15_day_diff', '30_day_diff', '3_day_pct_change', '7_day_pct_change', '15_day_pct_change', 
                        '30_day_pct_change', '3_day_single_diff', '7_day_single_diff', '15_day_single_diff', 
                        '30_day_single_diff', '3_day_single_pct_change', '7_day_single_pct_change', 
                        '15_day_single_pct_change', '30_day_single_pct_change'], axis=1)
MAEC_targets = MAEC_targets.sort_values(by='7_day', ascending=True)
MAEC_targets.info(verbose=True)
### save ############################################
MAEC_targets.to_csv('data/data_prep/MAEC_targets.csv', index=False)
#####################################################

<class 'pandas.core.frame.DataFrame'>
Index: 3443 entries, 1635 to 3441
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Ticker         3443 non-null   object 
 1   Date           3443 non-null   object 
 2   3_day          3443 non-null   float64
 3   3_day_single   3443 non-null   float64
 4   7_day          3443 non-null   float64
 5   7_day_single   3443 non-null   float64
 6   15_day         3443 non-null   float64
 7   15_day_single  3443 non-null   float64
 8   30_day         3443 non-null   float64
 9   30_day_single  3443 non-null   float64
dtypes: float64(8), object(2)
memory usage: 295.9+ KB


# Clean Praat features

- Replace '--undefined--' data
- Convert to numeric float64
- Compare our Praat data (MAEC), which was calculated from MP3 files.
    - To the provided Praat data (MAEC) [GitHub](https://github.com/Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction/tree/master)

In [22]:
praat_features = praat_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)

# Convert all columns into float64 except 
cols_to_convert = praat_features.columns.difference(['Company', 'Date', 'audio_file'])
praat_features[cols_to_convert] = praat_features[cols_to_convert].apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
praat_features[cols_to_convert] = praat_features[cols_to_convert].apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
praat_features.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89722 entries, 0 to 89721
Data columns (total 33 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Mean pitch                    89722 non-null  float64
 1   Standard deviation            89722 non-null  float64
 2   Minimum pitch                 89722 non-null  float64
 3   Maximum pitch                 89722 non-null  float64
 4   Number of pulses              89722 non-null  float64
 5   Number of periods             89722 non-null  float64
 6   Mean period                   89722 non-null  float64
 7   Mean intensity                89722 non-null  float64
 8   Minimum intensity             89722 non-null  float64
 9   Maximum intensity             89722 non-null  float64
 10  Standard deviation of period  89722 non-null  float64
 11  Fraction of unvoiced          89722 non-null  float64
 12  Number of voice breaks        89722 non-null  float64
 13  D

In [23]:
MAEC_praat_features = MAEC_praat_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)

# Convert all columns into float64 except 
cols_to_convert = MAEC_praat_features.columns.difference(['Ticker', 'Date', 'audio_file'])
MAEC_praat_features[cols_to_convert] = MAEC_praat_features[cols_to_convert].apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
MAEC_praat_features[cols_to_convert] = MAEC_praat_features[cols_to_convert].apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
MAEC_praat_features.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Data columns (total 33 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Mean pitch                    394277 non-null  float64
 1   Standard deviation            394277 non-null  float64
 2   Minimum pitch                 394277 non-null  float64
 3   Maximum pitch                 394277 non-null  float64
 4   Number of pulses              394277 non-null  int64  
 5   Number of periods             394277 non-null  int64  
 6   Mean period                   394277 non-null  float64
 7   Mean intensity                394277 non-null  float64
 8   Minimum intensity             394277 non-null  float64
 9   Maximum intensity             394277 non-null  float64
 10  Standard deviation of period  394277 non-null  float64
 11  Fraction of unvoiced          394277 non-null  float64
 12  Number of voice breaks        394277 non-nul

Compare

In [25]:
MAEC_GitHub_features = pd.DataFrame()
def each_row(row):
    Ticker = row['Ticker']
    Date = row['Date'].replace('-', '') 

    features_df = pd.read_csv(f"data/MAEC/MAEC_Dataset/{Date}_{Ticker}/features.csv")
    features_df['Ticker'] = Ticker
    features_df['Date'] = Date
    features_df['Sentence_num'] = range(1, len(features_df) + 1)

    global MAEC_GitHub_features
    MAEC_GitHub_features = pd.concat([MAEC_GitHub_features, features_df], ignore_index=True)

MAEC_filename_data.progress_apply(each_row, axis=1)

MAEC_GitHub_features = MAEC_GitHub_features.replace(['--undefined--', '--undefined-', '--undefined-- ',  '--'], np.nan)
# Convert all columns into float64 except 'Ticker', 'Date', 'audio_file' 
cols_to_convert = MAEC_GitHub_features.columns.difference(['Ticker', 'Date', 'audio_file'])
MAEC_GitHub_features[cols_to_convert] = MAEC_GitHub_features[cols_to_convert].apply(pd.to_numeric)
# impute median on all NULL (--undefined--)
MAEC_GitHub_features[cols_to_convert] = MAEC_GitHub_features[cols_to_convert].apply(
    lambda col: col.fillna(col.median()) if pd.api.types.is_numeric_dtype(col) else col
)
MAEC_GitHub_features.to_csv('data/MAEC_GitHub_features.csv', index=False)
MAEC_GitHub_features.info(verbose=True)

100%|██████████| 3443/3443 [02:26<00:00, 23.43it/s]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394277 entries, 0 to 394276
Data columns (total 32 columns):
 #   Column                        Non-Null Count   Dtype  
---  ------                        --------------   -----  
 0   Mean pitch                    394277 non-null  float64
 1   Standard deviation            394277 non-null  float64
 2   Minimum pitch                 394277 non-null  float64
 3   Maximum pitch                 394277 non-null  float64
 4   Mean intensity                394277 non-null  float64
 5   Minimum intensity             394277 non-null  float64
 6   Maximum intensity             394277 non-null  float64
 7   Number of pulses              394277 non-null  float64
 8   Number of periods             394277 non-null  float64
 9   Mean period                   394277 non-null  float64
 10  Standard deviation of period  394277 non-null  float64
 11  Fraction of unvoiced          394277 non-null  float64
 12  Number of voice breaks        394277 non-nul

# Compare
These are very close. That's good. This gives assurance that the original dataset Pratt features are okay.

In [27]:
MAEC_praat_features

Unnamed: 0,Mean pitch,Standard deviation,Minimum pitch,Maximum pitch,Number of pulses,Number of periods,Mean period,Mean intensity,Minimum intensity,Maximum intensity,...,Shimmer apq11,Shimmer dda,Mean autocorrelation,Mean NHR,Mean HNR,Audio Length,Ticker,Date,Sentence_num,audio_file
0,122.509,4.582,119.282,132.486,15,14,0.008145,48.617273,33.245858,59.811569,...,30.665,18.235,0.675386,0.524708,3.356,0.370979,LMAT,20150225,2,LMAT_20150225_f000002100.mp3
1,137.390,76.065,91.597,572.261,103,96,0.007461,42.840412,17.956942,61.970121,...,17.935,29.632,0.712884,0.482183,4.893,2.506979,LMAT,20150225,3,LMAT_20150225_f000002101.mp3
2,117.168,11.177,91.897,150.239,283,266,0.008527,43.585335,14.603231,68.719015,...,12.035,23.836,0.717230,0.468501,4.744,5.170979,LMAT,20150225,4,LMAT_20150225_f000002102.mp3
3,117.238,11.779,97.850,145.271,170,162,0.008542,42.735277,15.718385,63.319417,...,17.816,23.452,0.741878,0.407407,5.231,3.250979,LMAT,20150225,5,LMAT_20150225_f000002103.mp3
4,136.303,70.402,98.310,577.893,342,319,0.007383,44.440119,22.618467,62.993807,...,14.846,20.482,0.752116,0.394099,5.627,5.338979,LMAT,20150225,6,LMAT_20150225_f000002104.mp3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
394272,133.531,14.505,105.107,157.856,42,40,0.007475,59.591205,-15.285724,80.196919,...,6.825,10.370,0.800006,0.290069,6.972,0.743379,BKS,20180621,24,BKS_20180621_f000039100.mp3
394273,128.154,49.849,95.974,567.649,396,377,0.007820,67.647350,45.119193,80.862428,...,13.618,15.330,0.823743,0.265119,8.079,4.452766,BKS,20180621,25,BKS_20180621_f000039101.mp3
394274,130.803,12.533,102.188,160.576,315,304,0.007669,62.469106,38.641636,80.928417,...,11.752,14.277,0.824435,0.265622,8.167,4.452766,BKS,20180621,26,BKS_20180621_f000039102.mp3
394275,116.219,11.366,87.046,137.937,157,148,0.008614,61.795049,36.740310,80.262575,...,13.835,13.291,0.814544,0.285962,8.096,2.075624,BKS,20180621,27,BKS_20180621_f000039103.mp3


In [28]:
MAEC_GitHub_features

Unnamed: 0,Mean pitch,Standard deviation,Minimum pitch,Maximum pitch,Mean intensity,Minimum intensity,Maximum intensity,Number of pulses,Number of periods,Mean period,...,Shimmer apq5,Shimmer apq11,Shimmer dda,Mean autocorrelation,Mean NHR,Mean HNR,Audio Length,Ticker,Date,Sentence_num
0,122.509,4.582,119.282,132.486,48.617273,59.811569,33.245858,15.0,14.0,0.008145,...,11.513,30.665,18.235,0.675386,0.524708,3.356,0.370979,LMAT,20150225,1
1,137.390,76.065,91.597,572.261,42.840412,61.970121,17.956942,103.0,96.0,0.007461,...,12.248,17.935,29.632,0.712884,0.482183,4.893,2.506979,LMAT,20150225,2
2,117.168,11.177,91.897,150.239,43.585335,68.719015,14.603231,283.0,266.0,0.008527,...,10.633,12.035,23.836,0.717230,0.468501,4.744,5.170979,LMAT,20150225,3
3,117.238,11.779,97.850,145.271,42.735277,63.319417,15.718385,170.0,162.0,0.008542,...,9.040,17.816,23.452,0.741878,0.407407,5.231,3.250979,LMAT,20150225,4
4,136.303,70.402,98.310,577.893,44.440119,62.993807,22.618467,342.0,319.0,0.007383,...,8.630,14.846,20.482,0.752116,0.394099,5.627,5.338979,LMAT,20150225,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
394272,133.531,14.505,105.107,157.856,59.591205,80.196919,-15.285724,42.0,40.0,0.007475,...,4.486,6.825,10.370,0.800006,0.290069,6.972,0.743379,BKS,20180621,24
394273,128.154,49.849,95.974,567.649,67.647350,80.862428,45.119193,396.0,377.0,0.007820,...,7.697,13.618,15.330,0.823743,0.265119,8.079,4.452766,BKS,20180621,25
394274,130.803,12.533,102.188,160.576,62.469106,80.928417,38.641636,315.0,304.0,0.007669,...,5.904,11.752,14.277,0.824435,0.265622,8.167,4.452766,BKS,20180621,26
394275,116.219,11.366,87.046,137.937,61.795049,80.262575,36.740310,157.0,148.0,0.008614,...,7.349,13.835,13.291,0.814544,0.285962,8.096,2.075624,BKS,20180621,27


# Final dataset

In [30]:

praat_features['Date'] = praat_features['Date'].astype('Int64')

##################################################################
# if you want to merge Glove features 
# Merge glove and Pratt features (each sentence is 327 features)
# features = pd.merge(praat_features, glove_features, how="left",on = ['Company','Date','Sentence_num'])
##################################################################
# Merge RoBERTa and Pratt features (each sentence is 1051 features)
features = pd.merge(praat_features, RoBERTa_features, how="left",on = ['Company','Date','Sentence_num'])
features = features.drop(['Shimmer apq11','Audio Length','audio_file'], axis=1)

# The maximum sentences per meeting is 523 
# Ensure each meeting has 523 sentences, add zeros to the end
def add_zero_padding(group):
    complete_index = pd.Index(np.arange(1, 524), name='Sentence_num')
    group = group.set_index('Sentence_num').reindex(complete_index).reset_index()
    
    group['Company'] = group['Company'].fillna(method='ffill').fillna(method='bfill')
    group['Date'] = group['Date'].fillna(method='ffill').fillna(method='bfill')
    
    group.fillna(0.0, inplace=True)
    return group

features = features.groupby(['Company', 'Date']).apply(add_zero_padding).reset_index(drop=True)
features.fillna(0, inplace=True)

In [31]:
# Merge the targets 
# (will be duplicated 523 times, for each meeting, but that's okay for now)
# we need the date and company columns to sort
targets['Date'] = targets['Date'].astype(str).str.replace('-', '').astype('Int64')
original_dataset = pd.merge(features, targets, how="left",on = ['Company','Date'])

# Fill in null
rows_with_nulls = original_dataset[original_dataset.isnull().any(axis=1)]
# print(rows_with_nulls)
original_dataset.fillna(0, inplace=True)

# dump rows without targets
original_dataset = original_dataset.drop(['Ticker'], axis=1)
original_dataset = original_dataset[~((original_dataset['3_day'] == 0) & 
                                    (original_dataset['7_day'] == 0) & 
                                    (original_dataset['15_day'] == 0) & 
                                    (original_dataset['30_day'] == 0))]

original_dataset = original_dataset.sort_values(by=['Date', 'Company', 'Sentence_num'], ascending=[True, True, True])

# original_dataset = original_dataset.groupby(['Company', 'Date'])
original_dataset.info(verbose=True)
###### save ############################################
original_dataset.to_csv('data/original_dataset.csv', index=False)
########################################################

<class 'pandas.core.frame.DataFrame'>
Index: 295495 entries, 73743 to 75311
Data columns (total 1062 columns):
 #     Column                        Dtype  
---    ------                        -----  
 0     Sentence_num                  int32  
 1     Mean pitch                    float64
 2     Standard deviation            float64
 3     Minimum pitch                 float64
 4     Maximum pitch                 float64
 5     Number of pulses              float64
 6     Number of periods             float64
 7     Mean period                   float64
 8     Mean intensity                float64
 9     Minimum intensity             float64
 10    Maximum intensity             float64
 11    Standard deviation of period  float64
 12    Fraction of unvoiced          float64
 13    Number of voice breaks        float64
 14    Degree of voice breaks        float64
 15    Jitter local                  float64
 16    Jitter local absolute         float64
 17    Jitter rap                 

In [32]:

original_dataset['Date'] = pd.to_datetime(original_dataset['Date'].astype(str), format='%Y%m%d')
train_features = original_dataset[original_dataset['Date'] <= '2017-08-02']
val_features = original_dataset[(original_dataset['Date'] >= '2017-08-03') & (original_dataset['Date'] < '2017-10-24')]
test_features = original_dataset[original_dataset['Date'] >= '2017-10-24']

print(len(train_features))
print(len(val_features))
print(len(test_features))

train_features = train_features.drop(['Company','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
val_features = val_features.drop(['Company','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
test_features = test_features.drop(['Company','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single', '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()

print(train_features.shape)
print(val_features.shape)
print(test_features.shape)

# # for HTML
# # FEATURES- Reshape the NumPy array to have dimensions (  meetings, 523 sentences (with padding), 327 features)
np.save('data/train_features.npy', train_features.reshape(392, 523, 1051))
np.save('data/val_features.npy', val_features.reshape(56, 523, 1051))
np.save('data/test_features.npy', test_features.reshape(117, 523, 1051))

# # targets to numpy & dump rows without targets
targets = targets[~((targets['3_day'] == 0) & 
                    (targets['7_day'] == 0) & 
                    (targets['15_day'] == 0) & 
                    (targets['30_day'] == 0))]


targets = targets.copy().sort_values(by=['Date', 'Company'], ascending=[True, True])
targets['Date'] = pd.to_datetime(targets['Date'].astype(str), format='%Y%m%d')

train_targets = targets[targets['Date'] <= '2017-08-02']
val_targets = targets[(targets['Date'] >= '2017-08-03') & (targets['Date'] < '2017-10-24')]
test_targets = targets[targets['Date'] >= '2017-10-24']

print(train_targets.shape)
print(val_targets.shape)
print(test_targets.shape)

np.save('data/train_targets.npy', train_targets['3_day'].to_numpy())
np.save('data/val_targets.npy', val_targets['3_day'].to_numpy())
np.save('data/test_targets.npy', test_targets['3_day'].to_numpy())

np.save('data/train_secondary_targets.npy', train_targets['3_day_single'].to_numpy())
np.save('data/val_secondary_targets.npy', val_targets['3_day_single'].to_numpy())
np.save('data/test_secondary_targets.npy', test_targets['3_day_single'].to_numpy())

205016
29288
61191
(205016, 1051)
(29288, 1051)
(61191, 1051)
(392, 11)
(56, 11)
(117, 11)


# Final MAEC dataset

In [34]:

MAEC_praat_features['Date'] = MAEC_praat_features['Date'].astype('Int64')

##################################################################
# if you want to merge Glove features 
# Merge glove and Pratt features (each sentence is 327 features)
# MAEC_features = pd.merge(MAEC_praat_features, MAEC_glove_features, how="left",on = ['Ticker','Date','Sentence_num'])
##################################################################
# Merge RoBERTa and Pratt features (each sentence is 1051 features)
MAEC_features = pd.merge(MAEC_praat_features, MAEC_RoBERTa_features, how="left",on = ['Ticker','Date','Sentence_num'])
MAEC_features = MAEC_features.drop(['Shimmer apq11','Audio Length','audio_file'], axis=1)

# The maximum sentences per meeting is 497
# Ensure each meeting has 497 sentences, add zeros to the end
def add_zero_padding(group):
    complete_index = pd.Index(np.arange(1, 498), name='Sentence_num')
    group = group.set_index('Sentence_num').reindex(complete_index).reset_index()
    
    group['Company'] = group['Company'].fillna(method='ffill').fillna(method='bfill')
    group['Date'] = group['Date'].fillna(method='ffill').fillna(method='bfill')
    
    group.fillna(0.0, inplace=True)
    return group

MAEC_features = MAEC_features.groupby(['Company', 'Date']).apply(add_zero_padding).reset_index(drop=True)
MAEC_features.fillna(0, inplace=True)

NameError: name 'MAEC_RoBERTa_features' is not defined

In [None]:
# Merge the targets 
# (will be duplicated 497 times, for each meeting, but that's okay for now)
# we need the date and Ticker columns to sort
MAEC_targets['Date'] = MAEC_targets['Date'].astype(str).str.replace('-', '').astype('Int64')
MAEC_dataset = pd.merge(MAEC_features, MAEC_targets, how="left",on = ['Ticker','Date'])

# Fill in null
rows_with_nulls = MAEC_dataset[MAEC_dataset.isnull().any(axis=1)]
# print(rows_with_nulls)
MAEC_dataset.fillna(0, inplace=True)

# dump rows without targets
MAEC_dataset = MAEC_dataset.drop(['Ticker'], axis=1)
MAEC_dataset = MAEC_dataset[~((MAEC_dataset['3_day'] == 0) & 
                                    (MAEC_dataset['7_day'] == 0) & 
                                    (MAEC_dataset['15_day'] == 0) & 
                                    (MAEC_dataset['30_day'] == 0))]

MAEC_dataset = MAEC_dataset.sort_values(by=['Date', 'Ticker', 'Sentence_num'], ascending=[True, True, True])
TEXT_emb = MAEC_dataset.drop(['Ticker','Date','Sentence_num','3_day','3_day_single','7_day','7_day_single',
                                 '15_day','15_day_single','30_day','30_day_single'], axis=1).to_numpy()
# for HTML
# FEATURES- Reshape the NumPy array to have dimensions (565 meetings, 497 sentences (with padding), 1051 features)
# TEXT_emb = TEXT_emb.reshape(565, 497, 1051)
TEXT_emb = TEXT_emb.reshape(565, 497, 1051)

# targets to numpy & dump rows without targets
MAEC_targets = MAEC_targets[~((MAEC_targets['3_day'] == 0) & 
                            (MAEC_targets['7_day'] == 0) & 
                            (MAEC_targets['15_day'] == 0) & 
                            (MAEC_targets['30_day'] == 0))]

MAEC_targets = MAEC_targets.copy().sort_values(by=['Date', 'Ticker'], ascending=[True, True])
LABEL_emb_b = MAEC_targets['3_day_single'].to_numpy() # (1051,)
LABEL_emb = MAEC_targets['3_day'].to_numpy() # (1051,)

np.save('data/MAEC_TEXT_emb.npy',MAEC_ TEXT_emb)
np.save('data/MAEC_LABEL_emb.npy', MAEC_LABEL_emb)
np.save('data/MAEC_LABEL_emb_b.npy', MAEC_LABEL_emb_b)

# MAEC_dataset = MAEC_dataset.groupby(['Company', 'Date'])
MAEC_dataset.info(verbose=True)
###### save ############################################
MAEC_dataset.to_csv('data/MAEC_dataset.csv', index=False)
########################################################