In this notebook, our goal is to generate sign-adjusted topic series. To do this, we follow a series of steps: for each day, we identify the 11 articles with the highest proportion of each topic. We then assess the sentiment of these articles to determine whether the majority express positive/no clear tone or negative sentiment. This process allows us to assign a value of +1 for a predominance of positive/no clear tone articles or -1 for a predominance of negative articles, representing the sentiment of the topic for that day. These values are then used to adjust the sign of the topic by multiplying the topic's value for that day by +1 or -1.

First, we load the datasets from Handelsblatt, SZ, Welt, and dpa.

In [1]:
import os
import pandas as pd
from ast import literal_eval
from datetime import datetime
startTime = datetime.now()

# Set the path variable to point to the 'newspaper_data_processing' directory
path = os.getcwd().replace('\\newspaper_analysis\\topics', '\\newspaper_data_processing')

# Load pre-processed 'dpa' dataset from a CSV file
dpa = pd.read_csv(path + '\\dpa\\' + 'dpa_prepro_final.csv', encoding = 'utf-8', sep=';', index_col = 0,  keep_default_na=False,
                   dtype = {'rubrics': 'str', 
                            'source': 'str',
                            'keywords': 'str',
                            'title': 'str',
                            'city': 'str',
                            'genre': 'str',
                            'wordcount': 'str'},
                  converters = {'paragraphs': literal_eval})

# Keep only the article texts and their respective publication dates
dpa = dpa[['texts', 'day', 'month', 'year']]

# Load pre-processed 'SZ' dataset from a CSV file
sz = pd.read_csv(path + '\\SZ\\' + 'sz_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'newspaper': 'str',
                                                                                                 'newspaper_2': 'str',
                                                                                                 'quelle_texts': 'str',
                                                                                                 'page': 'str',
                                                                                                 'rubrics': 'str'})
sz.page = sz.page.fillna('')
sz.newspaper = sz.newspaper.fillna('')
sz.newspaper_2 = sz.newspaper_2.fillna('')
sz.rubrics = sz.rubrics.fillna('')
sz.quelle_texts = sz.quelle_texts.fillna('')

# Keep only the article texts and their respective publication dates
sz = sz[['texts', 'day', 'month', 'year']]

# Load pre-processed 'Handelsblatt' dataset from a CSV file
hb = pd.read_csv(path + '\\Handelsblatt\\' + 'hb_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'kicker': 'str',
                                                                                                 'page': 'str',
                                                                                                 'series_title': 'str',
                                                                                                 'rubrics': 'str'})
hb.page = hb.page.fillna('')
hb.series_title = hb.series_title.fillna('')
hb.kicker = hb.kicker.fillna('')
hb.rubrics = hb.rubrics.fillna('')

# Keep only the article texts and their respective publication dates
hb = hb[['texts', 'day', 'month', 'year']]

# Load pre-processed 'Welt' dataset from a CSV file
welt = pd.read_csv(path + '\\Welt\\' + 'welt_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'newspaper': 'str',
                                                                                                 'rubrics': 'str',
                                                                                                 'title': 'str'})
welt.title = welt.title.fillna('')
welt.rubrics = welt.rubrics.fillna('')

# Keep only the article texts and their respective publication dates
welt = welt[['texts', 'day', 'month', 'year']]

# Concatenate the 'dpa', 'sz', 'hb', and 'welt' DataFrames into a single DataFrame 'data'
data = pd.concat([dpa, sz, hb, welt])

# The number of articles in the final dataset
print(len(data))

# Sort the data in chronological order
data = data.sort_values(['year', 'month', 'day'], ascending=[True, True, True])
# Reset the index of the DataFrame
data.reset_index(inplace=True, drop=True)
data.head()

print(datetime.now()-startTime)

3336299
0:03:41.147624


Next, we import sentiment scores, previously computed using an LSTM model for each article in the corpus.

In [2]:
import csv
import codecs

# Set the path variable to point to the 'sentiment' directory
path = os.getcwd().replace('\\topics', '') + '\\sentiment'

with codecs.open(path + "\\scores_lstm.csv", "r", encoding='utf-8-sig') as f:
    reader = csv.reader(f)
    scores = [None if row[0] == '' else float(row[0]) for row in reader]   

We add sentiment scores as a new column to the `data` DataFrame and discard any rows with missing sentiment scores.

In [3]:
# Add the sentiment scores as a new column in the data DataFrame
data['scores'] = scores

# Remove any rows in the DataFrame where a sentiment score is missing (NaN). In this context, 
# NaN corresponds to the model's inability to predict sentiment for certain 
# articles due to formatting issues or because the article is too short (less than 20 tokens).
data = data.dropna(subset=['scores'])

# Reset the index of the DataFrame
data.reset_index(inplace=True, drop=True)

Afterward, we are incorporating the topic distributions for each article, which were previously computed using the Latent Dirichlet Allocation (LDA) algorithm in the notebook named `Topic model estimation`. 

In [4]:
# Load the article topics from a CSV file
article_topics = pd.read_csv('article_topic.csv', encoding='utf-8', index_col=0)

# Merge the `data` DataFrame with the `article_topics` DataFrame
data = pd.concat([data, article_topics], axis=1)

We create a function, `get_daily_sentiment_sign`, which processes a subset of `data` corresponding to a single day and returns a series with the sentiment sign for each topic on that day. This function takes as input the number of articles with the highest proportion of each topic, which it uses to determine the sentiment. The function `process_chunk` ensures that the resulting series is transformed into a pandas DataFrame with the date as the index. The `calculate_sentiment_signs` function applies the sentiment sign calculation for a given number of top articles. To speed up the computation, we use multiprocessing to process each day's data in parallel. 

In [5]:
import multiprocessing as mp
from process_chunk import process_chunk

# Number of cores to use
NUM_CORE = mp.cpu_count() - 8

# Convert 'year', 'month', 'day' to datetime
data['date'] = pd.to_datetime(data[['year', 'month', 'day']])

def calculate_sentiment_signs(data, n_articles):
    """
    Function to calculate sentiment signs for topics based on a specified number of articles.
    """
    # Group data by 'date' to ensure each day stays together
    grouped_by_date = [group for _, group in data.groupby('date')]
    
    # Prepare arguments for starmap (pair each chunk with the value of n_articles)
    args = [(chunk, n_articles) for chunk in grouped_by_date]
    
    if __name__ == "__main__":
        # Create a multiprocessing pool
        pool = mp.Pool(NUM_CORE)

        # Process each group (one day of data) in parallel
        results = pool.starmap(process_chunk, args)

        # Concatenate the results into a single DataFrame
        daily_sentiment_signs = pd.concat(results)

        # Close and join the pool
        pool.close()
        pool.join()

    return daily_sentiment_signs

We calculate daily sentiment signs for each topic using the top 11, 9, and 7 articles. While the final results are based on 11 articles, we also check the robustness of these results using 9 and 7 articles.

In [6]:
startTime = datetime.now() 

# Generate sentiment signs for 11, 9, and 7 articles
daily_sentiment_signs_11 = calculate_sentiment_signs(data, n_articles=11)
daily_sentiment_signs_9 = calculate_sentiment_signs(data, n_articles=9)
daily_sentiment_signs_7 = calculate_sentiment_signs(data, n_articles=7)

print(datetime.now()-startTime)

0:04:28.954358


Now we are going to load the daily topics and adjust their signs using the DataFrame `daily_sentiment_signs_11`. The result is a dataframe where topic distributions are multiplied with the prevailing sentiment of each topic for a given day. 

In [7]:
# Load the daily topics from a CSV file
daily_topics = pd.read_csv('daily_topics.csv', encoding='utf-8')

# Convert year, month, and day into a single date column
daily_topics['date'] = pd.to_datetime(daily_topics[['year','month','day']])
daily_topics.drop(columns=['year', 'month', 'day'], inplace=True)

# Now, set 'date' as index
daily_topics.set_index('date', inplace=True)

In [8]:
# Apply sentiment signs to the daily topics
sign_adjusted_daily_topics = daily_topics.multiply(daily_sentiment_signs_11)

# Save sign-adjusted topics to a CSV file
sign_adjusted_daily_topics.to_csv('sign_adjusted_daily_topics.csv', encoding='utf-8', index=True)

We iterate over each topic and generate a graph comparing the original and sign-adjusted values.

In [9]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Create a directory to save the plots
os.makedirs('topics_plots', exist_ok=True)

# Define the shaded areas for recessions
recessions = [
    ("1992-01-01", "1993-12-31"),  # Post-reunification recession
    ("2001-01-01", "2001-12-31"),  # Dot-com recession
    ("2008-01-01", "2009-12-31"),  # Great Recession
    ("2011-01-01", "2013-12-31")   # European sovereign debt crisis
]

# Calculate the 180-day rolling mean for each series
daily_topics_rm = daily_topics.rolling(window=180).mean()
sign_adjusted_daily_topics_11_rm = sign_adjusted_daily_topics.rolling(window=180).mean()

# Iterate over each topic
for i in range(daily_topics.shape[1]):
    # Generate the plot
    fig, ax = plt.subplots(figsize=(12,6))
    
    # Plot original and sign-adjusted topics
    ax.plot(daily_topics_rm.index, daily_topics_rm.iloc[:, i], label='Original topic', color='black')
    ax.plot(sign_adjusted_daily_topics_11_rm.index, sign_adjusted_daily_topics_11_rm.iloc[:, i], label='Sign-adjusted topic', linestyle='--')
    
    # Add shaded areas for recessions
    for start, end in recessions:
        ax.axvspan(pd.to_datetime(start), pd.to_datetime(end), color='grey', alpha=0.3)
    
    # Add title and labels
    #ax.set_title('Topic ' + str(i))
    ax.set_xlabel('Date')
    ax.set_ylabel('Topic Proportion')
    ax.legend()
    
    # Format the x-axis to show every year
    ax.xaxis.set_major_locator(mdates.YearLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

    # Rotate date labels for better visibility
    plt.xticks(rotation=45)
    
    # Save the plot in the 'topics_plots' directory
    plt.savefig('topics_plots/Topic_' + str(i) + '.png')
    
    # Clear the current figure to free memory
    plt.clf()

  fig, ax = plt.subplots(figsize=(12,6))


<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

We then repeat the process for the topics selected for the out-of-sample forecasting experiment, allowing for easier access.

In [10]:
# Create a directory to save the plots
os.makedirs('selected_topics_plots', exist_ok=True)

# List of selected topics
selected_topics = [11, 27, 52, 127, 81, 77, 74, 131, 138, 100]

# Iterate over the selected topics
for topic in selected_topics:
    # Generate the plot
    fig, ax = plt.subplots(figsize=(12,6))
    
    # Plot original and sign-adjusted topics
    ax.plot(daily_topics_rm.index, daily_topics_rm.iloc[:, topic], label='Original topic', color='black')
    ax.plot(sign_adjusted_daily_topics_11_rm.index, sign_adjusted_daily_topics_11_rm.iloc[:, topic], label='Sign-adjusted topic', linestyle='--')
    
    # Add shaded areas for recessions
    for start, end in recessions:
        ax.axvspan(pd.to_datetime(start), pd.to_datetime(end), color='grey', alpha=0.3)
    
    # Add title and labels
    #ax.set_title('Topic ' + str(topic))
    ax.set_xlabel('Date')
    ax.set_ylabel('Topic Proportion')
    ax.legend()
    
    # Format the x-axis to show every year
    ax.xaxis.set_major_locator(mdates.YearLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

    # Rotate date labels for better visibility
    plt.xticks(rotation=45)
    
    # Save the plot in the 'selected_topics_plots' directory
    plt.savefig('selected_topics_plots/Topic_' + str(topic) + '.png')
    
    # Clear the current figure to free memory
    plt.clf()

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

To check the robustness of our results to the choice of the number of top articles used to determine sentiment, we calculate the sign-adjusted topics for 7, 9, and 11 articles. We then standardize these daily series and plot them together for the selected topics.

In [11]:
# Standardize the sentiment series
def standardize_series(series):
    return (series - series.mean()) / series.std()

# Calculate the 180-day rolling mean for each series
sign_adjusted_daily_topics_9_rm = daily_topics.multiply(daily_sentiment_signs_9).rolling(window=180).mean()
sign_adjusted_daily_topics_7_rm = daily_topics.multiply(daily_sentiment_signs_7).rolling(window=180).mean()

# Standardize the sentiment series
sign_adjusted_11_rm_standardized = sign_adjusted_daily_topics_11_rm.apply(standardize_series, axis=0)
sign_adjusted_9_rm_standardized = sign_adjusted_daily_topics_9_rm.apply(standardize_series, axis=0)
sign_adjusted_7_rm_standardized = sign_adjusted_daily_topics_7_rm.apply(standardize_series, axis=0)

# Create a directory to save the plots
os.makedirs('selected_topics_plots_number_of_articles', exist_ok=True)

# Iterate over the selected topics
for topic in selected_topics:
    # Generate the plot
    fig, ax = plt.subplots(figsize=(12,6))
    
    # Plot sign-adjusted topics for 11, 9, and 7 articles
    ax.plot(sign_adjusted_11_rm_standardized.index, sign_adjusted_11_rm_standardized.iloc[:, topic], label='Sign-adjusted topic (11 articles)', linestyle='--', color='black')
    ax.plot(sign_adjusted_9_rm_standardized.index, sign_adjusted_9_rm_standardized.iloc[:, topic], label='Sign-adjusted topic (9 articles)', linestyle=':')
    ax.plot(sign_adjusted_7_rm_standardized.index, sign_adjusted_7_rm_standardized.iloc[:, topic], label='Sign-adjusted topic (7 articles)', linestyle='-.', color='green')
    
    # Add a horizontal line at zero to indicate the mean
    ax.axhline(0, color='gray', linewidth=1, linestyle='--')
    
    # Add title and labels
    ax.set_xlabel('Date')
    ax.set_ylabel('Standardized Sign-Adjusted Topic Proportion')
    ax.legend()

    # Shade the recession periods
    for start, end in recessions:
        ax.axvspan(pd.to_datetime(start), pd.to_datetime(end), color='grey', alpha=0.3)
        
    # Format the x-axis to show every year
    ax.xaxis.set_major_locator(mdates.YearLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

    # Rotate date labels for better visibility
    plt.xticks(rotation=45)

    # Save the plot in the 'selected_topics_plots_number_of_articles' directory
    plt.savefig(f'selected_topics_plots_number_of_articles/Topic_{topic}.png')

    # Clear the current figure to free memory
    plt.clf()

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

Another robustness check we perform is to determine whether sign adjustment and sentiment adjustment produce similar results. To do this, we calculate the average sentiment for each topic on each day, based on the 11 articles with the highest topic proportions. We then multiply this average sentiment by the daily topic proportions, resulting in sentiment-adjusted topics for each day. Finally, we standardize both the sign-adjusted and sentiment-adjusted topic series, and plot them together for the selected topics to compare their behavior.

In [12]:
from process_chunk_average_sentiment import process_chunk_average_sentiment

def calculate_average_sentiment(data, n_articles):
    """
    Function to calculate average sentiment for topics based on a specified number of articles.
    """
    # Group data by 'date' to ensure each day stays together
    grouped_by_date = [group for _, group in data.groupby('date')]
    
    # Prepare arguments for starmap (pair each chunk with the value of n_articles)
    args = [(chunk, n_articles) for chunk in grouped_by_date]
    
    if __name__ == "__main__":
        # Create a multiprocessing pool
        pool = mp.Pool(NUM_CORE)

        # Process each group (one day of data) in parallel
        results = pool.starmap(process_chunk_average_sentiment, args)

        # Concatenate the results into a single DataFrame
        daily_average_sentiment = pd.concat(results)

        # Close and join the pool
        pool.close()
        pool.join()

    return daily_average_sentiment

In [13]:
startTime = datetime.now() 

# Generate average sentiment for 11 articles
daily_average_sentiment_11 = calculate_average_sentiment(data, n_articles=11)

print(datetime.now()-startTime)

0:01:25.519267


In [14]:
# Calculate the 180-day rolling mean for the series
sentiment_adjusted_daily_topics_11_rm = daily_topics.multiply(daily_average_sentiment_11).rolling(window=180).mean()

# Standardize the sentiment series
sentiment_adjusted_11_rm_standardized = sentiment_adjusted_daily_topics_11_rm.apply(standardize_series, axis=0)

# Create a directory to save the plots
os.makedirs('selected_topics_plots_sentiment_adjustment', exist_ok=True)

# Iterate over the selected topics
for topic in selected_topics:
    # Generate the plot
    fig, ax = plt.subplots(figsize=(12,6))
    
    # Plot sign-adjusted and sentiment-adjusted topics for 11 top articles
    ax.plot(sign_adjusted_11_rm_standardized.index, sign_adjusted_11_rm_standardized.iloc[:, topic], label='Sign-adjusted topic', linestyle=':', color = 'black')
    ax.plot(sentiment_adjusted_11_rm_standardized.index, sentiment_adjusted_11_rm_standardized.iloc[:, topic], label='Sentiment-adjusted topic', linestyle='--')
    
    # Add a horizontal line at zero to indicate the mean
    ax.axhline(0, color='gray', linewidth=1, linestyle='--')

    # Add title and labels
    ax.set_xlabel('Date')
    ax.set_ylabel('Standardized Sign- or Sentiment-Adjusted Topic Proportion')
    ax.legend()

    # Shade the recession periods
    for start, end in recessions:
        ax.axvspan(pd.to_datetime(start), pd.to_datetime(end), color='grey', alpha=0.3)
        
    # Format the x-axis to show every year
    ax.xaxis.set_major_locator(mdates.YearLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))

    # Rotate date labels for better visibility
    plt.xticks(rotation=45)

    # Save the plot in the 'selected_topics_plots_sentiment_adjustment' directory
    plt.savefig(f'selected_topics_plots_sentiment_adjustment/Topic_{topic}.png')

    # Clear the current figure to free memory
    plt.clf()

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

<Figure size 1200x600 with 0 Axes>

Finally, we reformat the `sign_adjusted_daily_topics` DataFrame to include separate 'year', 'month', and 'day' columns, and save the reformatted DataFrame to a CSV file for further analysis.

In [15]:
# Reset the index
sign_adjusted_daily_topics = sign_adjusted_daily_topics.reset_index()

# Create 'year', 'month', and 'day' columns
sign_adjusted_daily_topics['year'] = sign_adjusted_daily_topics['date'].dt.year
sign_adjusted_daily_topics['month'] = sign_adjusted_daily_topics['date'].dt.month
sign_adjusted_daily_topics['day'] = sign_adjusted_daily_topics['date'].dt.day

# Drop the old 'index' column which holds the date
sign_adjusted_daily_topics = sign_adjusted_daily_topics.drop(columns=['date'])

# Reorder the columns to have 'year', 'month', 'day' as the first three columns
cols = ['year', 'month', 'day'] + [col for col in sign_adjusted_daily_topics if col not in ['year', 'month', 'day']]
sign_adjusted_daily_topics_format = sign_adjusted_daily_topics[cols]

# Save to CSV
sign_adjusted_daily_topics_format.to_csv('sign_adjusted_daily_topics_format.csv', index=False)