In this notebook, I focus on creating uncertainty-adjusted topic series by following a clear methodology. For each day, I start by identifying the 10 articles with the highest proportion of each topic. Next, I calculate the average uncertainty measure for these articles, representing the topic's uncertainty for that day. Finally, I adjust the daily topic values by multiplying them with the corresponding uncertainty measure.

First, I load the datasets from Handelsblatt, SZ, Welt, and dpa.

In [1]:
import os
import pandas as pd
from ast import literal_eval

# Set the path variable to point to the 'newspaper_data_processing' directory.
path = os.getcwd().replace('\\nowcasting_with_text\\uncertainty', '\\newspaper_data_processing')

# Load pre-processed 'dpa' dataset from a CSV file.
dpa = pd.read_csv(path + '\\dpa\\' + 'dpa_prepro_final.csv', encoding = 'utf-8', sep=';', index_col = 0,  keep_default_na=False,
                   dtype = {'rubrics': 'str', 
                            'source': 'str',
                            'keywords': 'str',
                            'title': 'str',
                            'city': 'str',
                            'genre': 'str',
                            'wordcount': 'str'},
                  converters = {'paragraphs': literal_eval})

# Keep only the article texts and their respective publication dates.
dpa = dpa[['texts', 'day', 'month', 'year']]

# Load pre-processed 'SZ' dataset from a CSV file.
sz = pd.read_csv(path + '\\SZ\\' + 'sz_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'newspaper': 'str',
                                                                                                 'newspaper_2': 'str',
                                                                                                 'quelle_texts': 'str',
                                                                                                 'page': 'str',
                                                                                                 'rubrics': 'str'})
sz.page = sz.page.fillna('')
sz.newspaper = sz.newspaper.fillna('')
sz.newspaper_2 = sz.newspaper_2.fillna('')
sz.rubrics = sz.rubrics.fillna('')
sz.quelle_texts = sz.quelle_texts.fillna('')

# Keep only the article texts and their respective publication dates.
sz = sz[['texts', 'day', 'month', 'year']]

# Load pre-processed 'Handelsblatt' dataset from a CSV file.
hb = pd.read_csv(path + '\\Handelsblatt\\' + 'hb_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'kicker': 'str',
                                                                                                 'page': 'str',
                                                                                                 'series_title': 'str',
                                                                                                 'rubrics': 'str'})
hb.page = hb.page.fillna('')
hb.series_title = hb.series_title.fillna('')
hb.kicker = hb.kicker.fillna('')
hb.rubrics = hb.rubrics.fillna('')

# Keep only the article texts and their respective publication dates.
hb = hb[['texts', 'day', 'month', 'year']]

# Load pre-processed 'Welt' dataset from a CSV file.
welt = pd.read_csv(path + '\\Welt\\' + 'welt_prepro_final.csv', encoding = 'utf-8-sig', sep=';', index_col = 0, dtype = {'newspaper': 'str',
                                                                                                 'rubrics': 'str',
                                                                                                 'title': 'str'})
welt.title = welt.title.fillna('')
welt.rubrics = welt.rubrics.fillna('')

# Keep only the article texts and their respective publication dates.
welt = welt[['texts', 'day', 'month', 'year']]

# Concatenate the 'dpa', 'sz', 'hb', and 'welt' DataFrames into a single DataFrame 'data'.
data = pd.concat([dpa, sz, hb, welt])

# The number of articles in the final dataset.
print(len(data))

# Sort the data in chronological order.
data = data.sort_values(['year', 'month', 'day'], ascending=[True, True, True])
# Reset the index of the DataFrame
data.reset_index(inplace=True, drop=True)
data.head()

3336299


Unnamed: 0,texts,day,month,year
0,Schalck: Milliardenkredit sicherte Zahlungsfäh...,1,1,1991
1,Welajati: Iran bleibt bei einem Krieg am Golf ...,1,1,1991
2,Bush will offenbar seinen Außenminister erneut...,1,1,1991
3,Sperrfrist 1. Januar 1000 HBV fordert umfassen...,1,1,1991
4,Schamir weist Nahost-Äußerungen des neuen EG-P...,1,1,1991


Next, I calculate the uncertainty measure for each article by determining the ratio of uncertainty-related words to the total number of words in the article. This measure is derived from a set of 46 uncertainty terms that appeared in the corpus at least 20 times. These terms include general words like "Unsicherheit" (uncertainty) and compound terms such as "Unsicherheitsfaktor" (uncertainty factor).

In [2]:
import pandas as pd

# Read the CSV file
df = pd.read_csv("sorted_counts.csv")

# Filter words with count >= 20
uncertainty_terms = df.loc[df['Count'] >= 20, 'Word'].tolist()

In [3]:
import multiprocessing as mp
import count_uncertainty_terms_chunk
import numpy as np
from datetime import datetime

# Number of cores to use
NUM_CORE = mp.cpu_count() - 4 

# Split data into chunks for parallel processing
chunk_size = len(data.texts) // NUM_CORE + 1 
text_chunks = [data.texts[i:i + chunk_size] for i in range(0, len(data.texts), chunk_size)]

startTime = datetime.now()

if __name__ == "__main__":
    
    pool = mp.Pool(NUM_CORE)

    # Process each chunk in parallel
    uncertainty_results = pool.starmap(count_uncertainty_terms_chunk.count_uncertainty_terms_chunk, [(chunk, uncertainty_terms) for chunk in text_chunks])

    # Close and join the pool
    pool.close()
    pool.join()

    # Combine results from all chunks
    uncertainty_counts = np.concatenate(uncertainty_results)

print(datetime.now() - startTime)   

0:01:59.225366


In [4]:
startTime = datetime.now() 

# Import the function calculating the number of words in a text
import count_words_mp

if __name__ == "__main__":
    pool = mp.Pool(NUM_CORE)
    count_results = pool.map(count_words_mp.count_words_mp, [text for text in data['texts']]) 
    pool.close()
    pool.join()
    
print(datetime.now()-startTime)

0:00:38.902484


In [5]:
# Convert a list to NumPy array
count_results = np.array(count_results)

# Perform element-wise division
uncertainty_measure = np.divide(uncertainty_counts, count_results, out=np.zeros_like(uncertainty_counts, dtype=float), 
                                where=count_results != 0)

data['uncertainty'] = uncertainty_measure

Now, I incorporate the topic distributions for each article, which were previously computed using the Latent Dirichlet Allocation (LDA) algorithm in the notebook titled `Topic Model Estimation.ipynb`.

In [6]:
# Set the path variable to point to the 'topics' directory.
path = os.getcwd().replace('\\uncertainty', '\\topics')

# Load the article topics from a CSV file
article_topics = pd.read_csv(path + '\\article_topic.csv', encoding='utf-8', index_col=0)

# Merge the `data` DataFrame with the `article_topics` DataFrame
data = pd.concat([data, article_topics], axis=1)

I define a function, `get_average_uncertainty`, which calculates the average uncertainty measure for each topic on a given day. This function selects a specified number of articles with the highest proportions for each topic and computes the average uncertainty across these articles. The `calculate_average_uncertainty` function applies this calculation to the entire dataset by grouping the data by date and processing each group in parallel using multiprocessing. The results are then combined into a single pandas DataFrame, with the topics as columns and the dates as the index.

In [7]:
from get_average_uncertainty import get_average_uncertainty

# Convert 'year', 'month', 'day' to datetime
data['date'] = pd.to_datetime(data[['year', 'month', 'day']])

def calculate_average_uncertainty(data, n_articles):
    """
    Function to calculate average uncertainty for topics based on a specified number of articles.
    """
    # Group data by 'date' to ensure each day stays together
    grouped_by_date = [group for _, group in data.groupby('date')]
    
    # Prepare arguments for starmap (pair each group with the value of n_articles)
    args = [(group, n_articles) for group in grouped_by_date]
    
    if __name__ == "__main__":
        # Create a multiprocessing pool
        pool = mp.Pool(NUM_CORE)

        # Process each group (one day of data) in parallel
        results = pool.starmap(get_average_uncertainty, args)

        # Concatenate the results into a single DataFrame
        daily_average_uncertainty = pd.concat(results)

        # Close and join the pool
        pool.close()
        pool.join()

    return daily_average_uncertainty

In [8]:
startTime = datetime.now() 

# Generate average uncertainty for 5, 10, and 15 articles
daily_average_uncertainty_5 = calculate_average_uncertainty(data, n_articles=5)
daily_average_uncertainty_10 = calculate_average_uncertainty(data, n_articles=10)
daily_average_uncertainty_15 = calculate_average_uncertainty(data, n_articles=15)

print(datetime.now()-startTime)

0:09:53.736113


Now I am going to load the daily topics and adjust them using the DataFrame `daily_average_uncertainty_10`. The result is a dataframe where topic distributions are multiplied with the average uncertainty of each topic for a given day.

In [9]:
# Load the daily topics from a CSV file
daily_topics = pd.read_csv(path + '\\daily_topics.csv', encoding='utf-8')

# Convert year, month, and day into a single date column
daily_topics['date'] = pd.to_datetime(daily_topics[['year','month','day']])
daily_topics.drop(columns=['year', 'month', 'day'], inplace=True)

# Now, set 'date' as index
daily_topics.set_index('date', inplace=True)

In [10]:
# Apply uncertainty adjustment to the daily topics
uncertainty_adjusted_daily_topics = daily_topics.multiply(daily_average_uncertainty_10)

I iterate over each topic and generate a graph comparing the original and uncertainty-adjusted values.

In [11]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import os

# Create a directory to save the plots
os.makedirs('topics_plots', exist_ok=True)

# Define the shaded areas for recessions
recessions = [
    ("1992-01-01", "1993-12-31"),  # Post-reunification recession
    ("2001-01-01", "2001-12-31"),  # Dot-com recession
    ("2008-01-01", "2009-12-31"),  # Great Recession
    ("2011-01-01", "2013-12-31")   # European sovereign debt crisis
]

# Calculate the 180-day rolling mean for each series
daily_topics_rm = daily_topics.rolling(window=180).mean()
uncertainty_adjusted_daily_topics_10_rm = uncertainty_adjusted_daily_topics.rolling(window=180).mean()

# Iterate over each topic
for i in range(daily_topics.shape[1]):
    # Generate the plot
    fig, ax1 = plt.subplots(figsize=(12, 6))
    
    # Plot original topics on the primary y-axis
    ax1.plot(daily_topics_rm.index, daily_topics_rm.iloc[:, i], label='Original Topic', color='black')
    ax1.set_xlabel('Date')
    ax1.set_ylabel('Original Topic Proportion', color='black')
    ax1.tick_params(axis='y', labelcolor='black')
    
    # Add shaded areas for recessions
    for start, end in recessions:
        ax1.axvspan(pd.to_datetime(start), pd.to_datetime(end), color='grey', alpha=0.3)
    
    # Create a secondary y-axis for the uncertainty-adjusted topics
    ax2 = ax1.twinx()
    ax2.plot(uncertainty_adjusted_daily_topics_10_rm.index, uncertainty_adjusted_daily_topics_10_rm.iloc[:, i], label='Uncertainty-Adjusted Topic', linestyle='--')
    ax2.set_ylabel('Uncertainty-Adjusted Topic')
    ax2.tick_params(axis='y')
    
    # Add title and legends
    ax1.legend(loc='upper left')
    ax2.legend(loc='upper right')
    
    # Format the x-axis to show every year
    ax1.xaxis.set_major_locator(mdates.YearLocator())
    ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    ax1.tick_params(axis='x', rotation=45)
    
    # Save the plot in the 'topics_plots' directory
    plt.savefig('topics_plots/Topic_' + str(i) + '.png')
    
    # Clear the current figure to free memory
    plt.clf()
    
    # Close the current figure to free memory
    plt.close(fig)

To check the robustness of my results to the choice of the number of top articles used to determine uncertainty, I compare the uncertainty-adjusted topics for 5, 10, and 15 articles.

In [12]:
# Create a directory to save the plots
os.makedirs('topics_plots_number_of_articles', exist_ok=True)

# Calculate the 180-day rolling mean for each series
uncertainty_adjusted_daily_topics_5_rm = daily_topics.multiply(daily_average_uncertainty_5).rolling(window=180).mean()
uncertainty_adjusted_daily_topics_15_rm = daily_topics.multiply(daily_average_uncertainty_15).rolling(window=180).mean()

# Iterate over each topic
for i in range(daily_topics.shape[1]):
    # Generate the plot
    fig, ax1 = plt.subplots(figsize=(12, 6))
    
    # Plot original topics on the primary y-axis
    ax1.plot(daily_topics_rm.index, daily_topics_rm.iloc[:, i], label='Original Topic', color='black')
    ax1.set_xlabel('Date')
    ax1.set_ylabel('Original Topic Proportion', color='black')
    ax1.tick_params(axis='y', labelcolor='black')
    
    # Add shaded areas for recessions
    for start, end in recessions:
        ax1.axvspan(pd.to_datetime(start), pd.to_datetime(end), color='grey', alpha=0.3)
    
    # Create a secondary y-axis for the uncertainty-adjusted topics
    ax2 = ax1.twinx()
    ax2.plot(uncertainty_adjusted_daily_topics_5_rm.index, uncertainty_adjusted_daily_topics_5_rm.iloc[:, i], label='Uncertainty-Adjusted (Top 5 Articles)', linestyle='--', color='blue')
    ax2.plot(uncertainty_adjusted_daily_topics_10_rm.index, uncertainty_adjusted_daily_topics_10_rm.iloc[:, i], label='Uncertainty-Adjusted (Top 10 Articles)', linestyle='-.', color='green')
    ax2.plot(uncertainty_adjusted_daily_topics_15_rm.index, uncertainty_adjusted_daily_topics_15_rm.iloc[:, i], label='Uncertainty-Adjusted (Top 15 Articles)', linestyle=':', color='red')
    ax2.set_ylabel('Uncertainty-Adjusted Topic')
    ax2.tick_params(axis='y')
    
    # Add title and legends
    ax1.legend(loc='upper left')
    ax2.legend(loc='upper right')
    
    # Format the x-axis to show every year
    ax1.xaxis.set_major_locator(mdates.YearLocator())
    ax1.xaxis.set_major_formatter(mdates.DateFormatter('%Y'))
    ax1.tick_params(axis='x', rotation=45)
    
    # Save the plot in the 'topics_plots' directory
    plt.savefig('topics_plots_number_of_articles/Topic_' + str(i) + '.png')
    
    # Clear the current figure to free memory
    plt.clf()
    
    # Close the current figure to free memory
    plt.close(fig)

In [13]:
# Reset the index
uncertainty_adjusted_daily_topics = uncertainty_adjusted_daily_topics.reset_index()

# Create 'year', 'month', and 'day' columns
uncertainty_adjusted_daily_topics['year'] = uncertainty_adjusted_daily_topics['date'].dt.year
uncertainty_adjusted_daily_topics['month'] = uncertainty_adjusted_daily_topics['date'].dt.month
uncertainty_adjusted_daily_topics['day'] = uncertainty_adjusted_daily_topics['date'].dt.day

# Drop the old 'index' column which holds the date
uncertainty_adjusted_daily_topics = uncertainty_adjusted_daily_topics.drop(columns=['date'])

# Reorder the columns to have 'year', 'month', 'day' as the first three columns
cols = ['year', 'month', 'day'] + [col for col in uncertainty_adjusted_daily_topics if col not in ['year', 'month', 'day']]
uncertainty_adjusted_daily_topics_format = uncertainty_adjusted_daily_topics[cols]

# Save uncertainty-adjusted topics to a CSV file
uncertainty_adjusted_daily_topics_format.to_csv('uncertainty_adjusted_daily_topics.csv', encoding='utf-8', index=True)