# Sentiment-Adjusted Topic Series (BPW)

In this notebook, I construct sentiment-adjusted topic series. For sentiment extraction, I use the German translation by [Bannier, Pauls, and Walter (BPW)](https://link.springer.com/article/10.1007/s11573-018-0914-8) of the widely recognized dictionary by [Loughran and McDonald (2011)](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1540-6261.2010.01625.x). 

The process involves identifying the 10 articles with the highest proportion of each topic for each day. I then calculate the average sentiment measure for these articles, which serves as the sentiment value for the topic on that day. Finally, I adjust the daily topic values by multiplying them with the corresponding sentiment measure.

To begin, I load the datasets from Handelsblatt, SZ, Welt, and dpa.

In [1]:
import os
import pandas as pd
from ast import literal_eval

# Set the path variable to point to the 'newspaper_data_processing' directory.
path = os.getcwd().replace('\\newspaper_analysis\\topics', '\\newspaper_data_processing')

# Load pre-processed 'dpa' dataset from a CSV file.
dpa = pd.read_csv(path + '\\dpa\\' + 'dpa_prepro_final.csv', encoding = 'utf-8', sep=';', index_col = 0,  keep_default_na=False,
                   dtype = {'rubrics': 'str', 
                            'source': 'str',
                            'keywords': 'str',
                            'title': 'str',
                            'city': 'str',
                            'genre': 'str',
                            'wordcount': 'str'},
                  converters = {'paragraphs': literal_eval})

# Keep only the article texts and their respective publication dates.
dpa = dpa[['texts', 'day', 'month', 'year']]

# Load pre-processed 'SZ' dataset from a CSV file.
sz = pd.read_csv(path + '\\SZ\\' + 'sz_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'newspaper': 'str',
                                                                                                 'newspaper_2': 'str',
                                                                                                 'quelle_texts': 'str',
                                                                                                 'page': 'str',
                                                                                                 'rubrics': 'str'})
sz.page = sz.page.fillna('')
sz.newspaper = sz.newspaper.fillna('')
sz.newspaper_2 = sz.newspaper_2.fillna('')
sz.rubrics = sz.rubrics.fillna('')
sz.quelle_texts = sz.quelle_texts.fillna('')

# Keep only the article texts and their respective publication dates.
sz = sz[['texts', 'day', 'month', 'year']]

# Load pre-processed 'Handelsblatt' dataset from a CSV file.
hb = pd.read_csv(path + '\\Handelsblatt\\' + 'hb_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'kicker': 'str',
                                                                                                 'page': 'str',
                                                                                                 'series_title': 'str',
                                                                                                 'rubrics': 'str'})
hb.page = hb.page.fillna('')
hb.series_title = hb.series_title.fillna('')
hb.kicker = hb.kicker.fillna('')
hb.rubrics = hb.rubrics.fillna('')

# Keep only the article texts and their respective publication dates.
hb = hb[['texts', 'day', 'month', 'year']]

# Load pre-processed 'Welt' dataset from a CSV file.
welt = pd.read_csv(path + '\\Welt\\' + 'welt_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'newspaper': 'str',
                                                                                                 'rubrics': 'str',
                                                                                                 'title': 'str'})
welt.title = welt.title.fillna('')
welt.rubrics = welt.rubrics.fillna('')

# Keep only the article texts and their respective publication dates.
welt = welt[['texts', 'day', 'month', 'year']]

# Concatenate the 'dpa', 'sz', 'hb', and 'welt' DataFrames into a single DataFrame 'data'.
data = pd.concat([dpa, sz, hb, welt])

# The number of articles in the final dataset.
print(len(data))

# Sort the data in chronological order.
data = data.sort_values(['year', 'month', 'day'], ascending=[True, True, True])
# Reset the index of the DataFrame
data.reset_index(inplace=True, drop=True)
data.head()

3336299


Unnamed: 0,texts,day,month,year
0,Schalck: Milliardenkredit sicherte Zahlungsfäh...,1,1,1991
1,Welajati: Iran bleibt bei einem Krieg am Golf ...,1,1,1991
2,Bush will offenbar seinen Außenminister erneut...,1,1,1991
3,Sperrfrist 1. Januar 1000 HBV fordert umfassen...,1,1,1991
4,Schamir weist Nahost-Äußerungen des neuen EG-P...,1,1,1991


Next, I import sentiment scores, previously computed using an LSTM model for each article in the corpus.

In [2]:
import csv
import codecs

# Set the path variable to point to the 'sentiment' directory
path = os.getcwd().replace('\\topics', '') + '\\sentiment'

with codecs.open(path + "\\scores_lstm.csv", "r", encoding='utf-8-sig') as f:
    reader = csv.reader(f)
    scores = [None if row[0] == '' else float(row[0]) for row in reader]   

I add sentiment scores as a new column to the `data` DataFrame and discard any rows with missing sentiment scores.

In [3]:
# Add the sentiment scores as a new column in the data DataFrame
data['scores'] = scores

# Remove any rows in the DataFrame where a sentiment score is missing (NaN). In this context, 
# NaN corresponds to the model's inability to predict sentiment for certain 
# articles due to formatting issues or because the article is too short (less than 20 tokens).
data = data.dropna(subset=['scores'])

# Reset the index of the DataFrame
data.reset_index(inplace=True, drop=True)

Next, I calculate the sentiment measure for each article as the difference between the number of positive and negative words, divided by the total number of words in the article.

To do this, I first load the BPW dictionary and create two lists: one containing negative terms and another containing positive terms.

In [4]:
import pandas as pd

# Read an Excel file, transform an output into a list
bpw_neg = list(pd.read_excel('BPW_Dictionary.xlsx', sheet_name='NEG_BPW', header=None).iloc[:,0]) 
bpw_pos = list(pd.read_excel('BPW_Dictionary.xlsx', sheet_name='POS_BPW', header=None).iloc[:,0])

# Convert boolean value back to its intended string form
bpw_neg = ['falsch' if word is False else word for word in bpw_neg]

print(bpw_neg[:5])
print(bpw_pos[:5])

['abbau', 'abbauen', 'abbauend', 'abbauende', 'abbauendem']
['adäquat', 'adäquate', 'adäquatem', 'adäquaten', 'adäquater']


I then calculate the number of negative words in each article.

In [5]:
import multiprocessing as mp
import count_words_chunk
import numpy as np
from datetime import datetime

# Number of cores to use
NUM_CORE = mp.cpu_count() - 4 

# Split data into chunks for parallel processing
chunk_size = len(data.texts) // NUM_CORE + 1 
text_chunks = [data.texts[i:i + chunk_size] for i in range(0, len(data.texts), chunk_size)]

startTime = datetime.now()

if __name__ == "__main__":
    
    pool = mp.Pool(NUM_CORE)

    # Process each chunk in parallel
    nw_results = pool.starmap(count_words_chunk.count_words_chunk, [(chunk, bpw_neg) for chunk in text_chunks])

    # Close and join the pool
    pool.close()
    pool.join()

    # Combine results from all chunks
    negative_counts = np.concatenate(nw_results)

print(datetime.now() - startTime)   

0:01:10.295570


I proceed to calculate the number of positive words in each article.

In [6]:
startTime = datetime.now()

if __name__ == "__main__":
    
    pool = mp.Pool(NUM_CORE)

    # Process each chunk in parallel
    pw_results = pool.starmap(count_words_chunk.count_words_chunk, [(chunk, bpw_pos) for chunk in text_chunks])

    # Close and join the pool
    pool.close()
    pool.join()

    # Combine results from all chunks
    positive_counts = np.concatenate(pw_results)

print(datetime.now() - startTime)   

0:01:09.139934


The final statistic required for calculating sentiment is the total number of words in each article. 

In [7]:
startTime = datetime.now() 

# Import the function calculating the number of words in a text
import count_words_mp

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    count_results = pool.map(count_words_mp.count_words_mp, [text for text in data['texts']]) 
    pool.close()
    pool.join()
    
print(datetime.now()-startTime)

0:00:18.499805


Finally, I calculate the sentiment for each article as the proportion of positive words minus the proportion of negative words, relative to the total word count. This value is then added as a new column to the DataFrame.

In [8]:
# Convert count_results to a NumPy array for element-wise operations
count_results = np.array(count_results)

# Calculate sentiment: (positive_counts - negative_counts) / count_results
sentiment_measure = np.divide(
    positive_counts - negative_counts, 
    count_results, 
    out=np.zeros_like(negative_counts, dtype=float), 
    where=count_results != 0
)

# Add the sentiment to the DataFrame
data['sentiment'] = sentiment_measure

Now, I incorporate the topic distributions for each article, which were previously computed using the Latent Dirichlet Allocation (LDA) algorithm in the notebook titled `Topic Model Estimation.ipynb`.

In [9]:
# Load the article topics from a CSV file
article_topics = pd.read_csv('article_topic.csv', encoding='utf-8', index_col=0)

# Merge the `data` DataFrame with the `article_topics` DataFrame
data = pd.concat([data, article_topics], axis=1)

I define a function, `get_average_sentiment`, which calculates the average sentiment measure for each topic on a given day. This function selects a specified number of articles with the highest proportions for each topic and computes the average sentiment across these articles. The `calculate_average_sentiment` function applies this calculation to the entire dataset by grouping the data by date and processing each group in parallel using multiprocessing. The results are then combined into a single pandas DataFrame, with the topics as columns and the dates as the index.

In [10]:
from get_average_sentiment_BPW import get_average_sentiment

# Convert 'year', 'month', 'day' to datetime
data['date'] = pd.to_datetime(data[['year', 'month', 'day']])

def calculate_average_sentiment(data, n_articles):
    """
    Function to calculate average sentiment for topics based on a specified number of articles.
    """
    # Group data by 'date' to ensure each day stays together
    grouped_by_date = [group for _, group in data.groupby('date')]
    
    # Prepare arguments for starmap (pair each group with the value of n_articles)
    args = [(group, n_articles) for group in grouped_by_date]
    
    if __name__ == "__main__":
        # Create a multiprocessing pool
        pool = mp.Pool(NUM_CORE)

        # Process each group (one day of data) in parallel
        results = pool.starmap(get_average_sentiment, args)

        # Concatenate the results into a single DataFrame
        daily_average_sentiment = pd.concat(results)

        # Close and join the pool
        pool.close()
        pool.join()

    return daily_average_sentiment

In [11]:
startTime = datetime.now() 

# Generate average sentiment for 10 articles
daily_average_sentiment_10 = calculate_average_sentiment(data, n_articles=10)

print(datetime.now()-startTime)

0:01:13.451334


Now I am going to load the daily topics and adjust them using the DataFrame `daily_average_sentiment_10`. The result is a dataframe where topic distributions are multiplied with the average sentiment of each topic for a given day.

In [12]:
# Load the daily topics from a CSV file
daily_topics = pd.read_csv('daily_topics.csv', encoding='utf-8')

# Convert year, month, and day into a single date column
daily_topics['date'] = pd.to_datetime(daily_topics[['year','month','day']])
daily_topics.drop(columns=['year', 'month', 'day'], inplace=True)

# Now, set 'date' as index
daily_topics.set_index('date', inplace=True)

In [13]:
# Apply sentiment adjustment to the daily topics
sentiment_adjusted_daily_topics = daily_topics.multiply(daily_average_sentiment_10)

I iterate over each topic and generate a graph comparing the original and sentiment-adjusted values.

In [14]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import os

# Create a directory to save the plots
os.makedirs('topics_BPW_plots', exist_ok=True)

# Define the shaded areas for recessions
recessions = [
    ("1992-01-01", "1993-12-31"),  # Post-reunification recession
    ("2001-01-01", "2001-12-31"),  # Dot-com recession
    ("2008-01-01", "2009-12-31"),  # Great Recession
    ("2011-01-01", "2013-12-31")   # European sovereign debt crisis
]

# Calculate the 180-day rolling mean for each series
daily_topics_rm = daily_topics.rolling(window=180).mean()
sentiment_adjusted_daily_topics_10_rm = sentiment_adjusted_daily_topics.rolling(window=180).mean()

# Iterate over each topic
for i in range(daily_topics.shape[1]):
    # Generate the plot
    fig, ax1 = plt.subplots(figsize=(12, 6))
    
    # Plot original topics on the primary y-axis
    ax1.plot(daily_topics_rm.index, daily_topics_rm.iloc[:, i], label='Original Topic', color='black')
    ax1.set_xlabel('Date')
    ax1.set_ylabel('Original Topic Proportion', color='black')
    ax1.tick_params(axis='y', labelcolor='black')
    
    # Add shaded areas for recessions
    for start, end in recessions:
        ax1.axvspan(pd.to_datetime(start), pd.to_datetime(end), color='grey', alpha=0.3)
    
    # Create a secondary y-axis for the sentiment-adjusted topics
    ax2 = ax1.twinx()
    ax2.plot(sentiment_adjusted_daily_topics_10_rm.index, sentiment_adjusted_daily_topics_10_rm.iloc[:, i], label='Sentiment-Adjusted Topic', linestyle='--')
    ax2.set_ylabel('Sentiment-Adjusted Topic')
    ax2.tick_params(axis='y')
    
    # Add title and legends
    ax1.legend(loc='upper left')
    ax2.legend(loc='upper right')
    
    # Format the x-axis to show every year
    ax1.xaxis.set_major_locator(mdates.YearLocator())
    ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    ax1.tick_params(axis='x', rotation=45)
    
    # Save the plot in the 'topics_plots' directory
    plt.savefig('topics_BPW_plots/Topic_' + str(i) + '.png')
    
    # Clear the current figure to free memory
    plt.clf()
    
    # Close the current figure to free memory
    plt.close(fig)

In [15]:
# Reset the index
sentiment_adjusted_daily_topics = sentiment_adjusted_daily_topics.reset_index()

# Create 'year', 'month', and 'day' columns
sentiment_adjusted_daily_topics['year'] = sentiment_adjusted_daily_topics['date'].dt.year
sentiment_adjusted_daily_topics['month'] = sentiment_adjusted_daily_topics['date'].dt.month
sentiment_adjusted_daily_topics['day'] = sentiment_adjusted_daily_topics['date'].dt.day

# Drop the old 'index' column which holds the date
sentiment_adjusted_daily_topics = sentiment_adjusted_daily_topics.drop(columns=['date'])

# Reorder the columns to have 'year', 'month', 'day' as the first three columns
cols = ['year', 'month', 'day'] + [col for col in sentiment_adjusted_daily_topics if col not in ['year', 'month', 'day']]
sentiment_adjusted_daily_topics_format = sentiment_adjusted_daily_topics[cols]

# Save sentiment-adjusted topics to a CSV file
sentiment_adjusted_daily_topics_format.to_csv('BPW_adjusted_daily_topics.csv', encoding='utf-8', index=True)

Finally, I load the BCC-adjusted topics, compute their 180-day rolling mean and then, for each of the selected topics, plot the original topic, the BPW-adjusted and the BCC-adjusted series together.

In [16]:
# Load BCC-adjusted topics
bcc = pd.read_csv('sign_adjusted_daily_topics_format.csv', encoding='utf-8')
# Reconstruct date
bcc['date'] = pd.to_datetime(bcc[['year','month','day']])
bcc = bcc.set_index('date').drop(columns=['year','month','day'])

# 180-day rolling means
bcc_rm = bcc.rolling(window=180).mean()

# Ensure output directory exists
os.makedirs('selected_topics_plots_BPW_BCC', exist_ok=True)

# Recessions shading
recessions = [
    ("1992-01-01", "1993-12-31"),
    ("2001-01-01", "2001-12-31"),
    ("2008-01-01", "2009-12-31"),
    ("2011-01-01", "2013-12-31")
]

# Selected topics
selected_topics = [11, 27, 52, 127, 81, 77, 74, 131, 138, 100]

for idx in selected_topics:
    fig, ax1 = plt.subplots(figsize=(12,6))
    
    # Plot both original and sign-adjusted topic on left axis
    line1, = ax1.plot(daily_topics_rm.index, daily_topics_rm.iloc[:, idx], label='Original topic', color='black')
    line2, = ax1.plot(bcc_rm.index, bcc_rm.iloc[:, idx], label='Sign-adjusted topic (BCC)', color='tab:blue')
    
    ax1.set_xlabel('Date')
    ax1.set_ylabel('Topic Proportion', color='black')
    ax1.tick_params(axis='y', labelcolor='black')
    
    # Add shaded areas for recessions
    for start, end in recessions:
        ax1.axvspan(pd.to_datetime(start), pd.to_datetime(end), color='grey', alpha=0.3)
    
    # Secondary axis for sentiment‐adjusted series (BPW)
    ax2 = ax1.twinx()
    line3, = ax2.plot(sentiment_adjusted_daily_topics_10_rm.index, sentiment_adjusted_daily_topics_10_rm.iloc[:, idx],
        label='Sentiment-adjusted topic (BPW)',
        color='tab:orange'
    )
    ax2.set_ylabel('Sentiment-Adjusted Topic')
    ax2.tick_params(axis='y')
    
    # Legend
    ax1.legend(handles=[line1, line2, line3], loc='upper right')
    
    # X-axis formatting
    ax1.xaxis.set_major_locator(mdates.YearLocator())
    ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    plt.setp(ax1.get_xticklabels(), rotation=45)
    
    # Save
    fig.savefig(f'selected_topics_plots_BPW_BCC/Topic_{idx}.png',
                bbox_inches='tight')
    plt.close(fig)