In this notebook, we create five datasets for VAR analysis, each combining various elements such as SPI logarithmic returns, sentiment scores, estimated topics, and extracted components:

    1. The first dataset includes SPI logarithmic returns, Loughran-McDonald (LM) sentiment scores, and estimated topics.
    2. The second dataset extends the first by adding components derived from topics and topics multiplied with LM sentiment.
    3. The third dataset combines SPI logarithmic returns with extended sentiment scores and estimated topics.
    4. The fourth dataset, similar to the third, consists of SPI logarithmic returns, extended sentiment scores, and estimated topics. The distinction lies in its focus: this dataset is constructed exclusively from articles mentioning either the index firms or their competitors.
    5. The fifth dataset is a combination of SPI logarithmic returns, extended sentiment scores, estimated topics, and components based on topics multiplied with extended sentiment.

In [1]:
import pandas as pd
import os

In [2]:
path = os.path.realpath(__name__) # path
drt = os.path.dirname(path)       # directory

## The first dataset: returns, LM sentiment, topics

First, we merge the returns data with the estimated daily topics and LM sentiment. The combined dataset is then saved into a CSV file named `daily_topics_LM.csv`.

In [3]:
# Set the paths
# returns
daily_ret_import = drt + '/daily/daily_common.csv' 
# topics
daily_topics_import = drt.replace('\\finance data', '') + '\\analysis\\analysis_topics' + '/daily_topics.csv'
# LM sentiment
daily_sentiment_path = drt.replace('\\finance data', '') + '\\analysis' + '/sentiment_daily_LM.xlsx'
daily_sentiment = pd.read_excel(daily_sentiment_path)

In [4]:
# Import daily SPI data:
daily_returns = pd.read_csv(daily_ret_import)
# Topics data:
daily_topics = pd.read_csv(daily_topics_import)
# Rename the 'dates_day' column to 'date':
daily_topics.rename(columns = {'dates_day':'date'}, inplace = True)
daily_sentiment.rename(columns = {'dates_day':'date'}, inplace = True)

In [5]:
# Merge together:
daily_topics = pd.merge(daily_returns, daily_topics, on=['date'], how='outer')
daily_topics = pd.merge(daily_topics, daily_sentiment, on=['date'], how='outer')
# Delete columns 'common', 'react_label', 'day', 'month', and 'year':
daily_topics = daily_topics.drop(["common", "react_label", "day", "month", 'year'], axis = 1)

In [6]:
daily_topics = daily_topics.loc[7:,:].reset_index(drop = True)

In cases where there are log returns data but no corresponding news articles on a particular trading day, we fill the missing topic data by carrying forward the topics from the previous day. This approach is based on the assumption that the topics discussed on the last available trading day continue to influence the market until new information becomes available.

In [7]:
daily_topics = daily_topics.fillna(method='ffill')

In [8]:
# Save the DF:
daily_topics.to_csv('daily/daily_topics_LM.csv', index = False)

## The second dataset: returns, LM sentiment, topics, components based on topics, components based on topics multiplied with LM sentiment

Now we add components based on topics and topics multiplied with LM sentiment. The resulting dataset is saved into a CSV file named `daily_topics_LM_components.csv`.

In [9]:
# Import daily SPI data:
daily_returns = pd.read_csv(daily_ret_import)
# Topics data:
daily_topics = pd.read_csv(daily_topics_import)
# LM sentiment:
daily_sentiment = pd.read_excel(daily_sentiment_path)
# Rename the 'dates_day' column to 'date':
daily_topics.rename(columns = {'dates_day':'date'}, inplace = True)
daily_sentiment.rename(columns = {'dates_day':'date'}, inplace = True)

# Set the paths
# components based on topics
daily_components_import = drt.replace('\\finance data', '') + '\\analysis\\VAR' + '/components.csv'
# components based on topics multiplied with sentiment
daily_components_sent_import = drt.replace('\\finance data', '') + '\\analysis\\VAR' + '/components_sent.csv'

In [10]:
# Components
daily_components = pd.read_csv(daily_components_import)
daily_components_sent = pd.read_csv(daily_components_sent_import)

# Convert date format in 'daily_components' to 'YYYY-MM-DD'
daily_components['date'] = pd.to_datetime(daily_components['date'], dayfirst=True).dt.strftime('%Y-%m-%d')
# Convert date format in 'daily_components_sent' to 'YYYY-MM-DD'
daily_components_sent['date'] = pd.to_datetime(daily_components_sent['date'], dayfirst=True).dt.strftime('%Y-%m-%d')

In [11]:
# Merge together:
daily_topics = pd.merge(daily_returns, daily_topics, on=['date'], how='outer')
daily_topics = pd.merge(daily_topics, daily_sentiment, on=['date'], how='outer')
daily_topics = pd.merge(daily_topics, daily_components, on=['date'], how='outer')
daily_topics = pd.merge(daily_topics, daily_components_sent, on=['date'], how='outer')
# Delete columns 'common', 'react_label', 'day', 'month', and 'year':
daily_topics = daily_topics.drop(["common", "react_label", "day", "month", 'year'], axis = 1)

In [12]:
daily_topics = daily_topics.loc[7:,:].reset_index(drop = True)
daily_topics = daily_topics.fillna(method='ffill')

In [13]:
# Save the DF:
daily_topics.to_csv('daily/daily_topics_LM_components.csv', index = False)

## The third dataset: returns, extended sentiment, and topics (all articles)

In [14]:
# Set the paths 
# extended sentiment:
daily_sentiment_path = drt.replace('\\finance data', '') + '\\analysis' + '/sentiment_daily_extend_comp_rev.xlsx'
daily_sentiment = pd.read_excel(daily_sentiment_path)

In [15]:
# Import daily SPI data:
daily_returns = pd.read_csv(daily_ret_import)
# Topics data:
daily_topics = pd.read_csv(daily_topics_import)
# Rename the 'dates_day' column to 'date':
daily_topics.rename(columns = {'dates_day':'date'}, inplace = True)
daily_sentiment.rename(columns = {'dates_day':'date'}, inplace = True)

In [16]:
# Merge together:
daily_topics = pd.merge(daily_returns, daily_topics, on=['date'], how='outer')
daily_topics = pd.merge(daily_topics, daily_sentiment, on=['date'], how='outer')
# Delete columns 'common', 'react_label', 'day', 'month', and 'year':
daily_topics = daily_topics.drop(["common", "react_label", "day", "month", 'year'], axis = 1)

In [17]:
daily_topics = daily_topics.loc[7:,:].reset_index(drop = True)
daily_topics = daily_topics.fillna(method='ffill')

In [18]:
# Save the DF:
daily_topics.to_csv('daily/daily_topics_extend_comp_rev.csv', index = False)

## The fourth dataset: returns, extended sentiment, and topics (our firms and competitors, not all articles)

In [19]:
# Set the paths
# Topics: our firms and competitors
daily_topics_import = drt.replace('\\finance data', '') + '\\analysis\\analysis_topics' + '/daily_topics_our_comp.csv'
# Extended sentiment: our firms and competitors
daily_sentiment_path = drt.replace('\\finance data', '') + '\\analysis' + '/sentiment_daily_extend_our_comp.xlsx'
daily_sentiment = pd.read_excel(daily_sentiment_path)

In [20]:
# Import daily SPI data:
daily_returns = pd.read_csv(daily_ret_import)
# Topics data:
daily_topics = pd.read_csv(daily_topics_import)
# Rename the 'dates_day' column to 'date':
daily_topics.rename(columns = {'dates_day':'date'}, inplace = True)
daily_sentiment.rename(columns = {'dates_day':'date'}, inplace = True)

In [21]:
# Merge together:
daily_topics = pd.merge(daily_returns, daily_topics, on=['date'], how='left')
daily_topics = pd.merge(daily_topics, daily_sentiment, on=['date'], how='outer')
daily_topics = daily_topics.drop(["common", "react_label", "day", "month", 'year'], axis = 1)

In [22]:
daily_topics = daily_topics.loc[13:,:].reset_index(drop = True)
daily_topics = daily_topics.fillna(method='ffill')

In [23]:
# Save the DF:
daily_topics.to_csv('daily/daily_topics_extend_our_comp.csv', index = False)

## The fifth dataset: returns, extended sentiment, topics, and components based on topics multiplied with extended sentiment (all articles)

In [24]:
# Set the paths
daily_topics_import = drt.replace('\\finance data', '') + '\\analysis\\analysis_topics' + '/daily_topics.csv'
# Extended sentiment
daily_sentiment_path = drt.replace('\\finance data', '') + '\\analysis' + '/sentiment_daily_extend_comp_rev.xlsx'
daily_sentiment = pd.read_excel(daily_sentiment_path)
# Components based on topics multiplied with extended sentiment
daily_components_sent_extend_import = drt.replace('\\finance data', '') + '\\analysis\\VAR' + '/components_sent_extend.csv'

In [25]:
# Import daily SPI data:
daily_returns = pd.read_csv(daily_ret_import)
# Topics data:
daily_topics = pd.read_csv(daily_topics_import)
daily_topics.rename(columns = {'dates_day':'date'}, inplace = True)
daily_sentiment.rename(columns = {'dates_day':'date'}, inplace = True)
## Components
daily_components_sent_extend = pd.read_csv(daily_components_sent_extend_import)

In [26]:
# Merge together:
daily_topics = pd.merge(daily_returns, daily_topics, on=['date'], how='outer')
daily_topics = pd.merge(daily_topics, daily_sentiment, on=['date'], how='outer')
daily_topics = pd.merge(daily_topics, daily_components_sent_extend, on=['date'], how='outer')
daily_topics = daily_topics.drop(["common", "react_label", "day", "month", 'year'], axis = 1)

In [27]:
daily_topics = daily_topics.loc[7:,:].reset_index(drop = True)
daily_topics = daily_topics.fillna(method='ffill')

In [28]:
# Save the DF:
daily_topics.to_csv('daily/daily_topics_extend_comp_rev_components.csv', index = False)