We acquired our training set for sentiment analysis from Media Tenor International, a Swiss-based media analysis institute. Specifically, we received an Excel file containing details such as the article's publication date, source, title, sentiment topic (e.g., sentiment towards business cycle conditions), the number of annotators who evaluated each article as negative, neutral, or positive, and the total number of annotators per article. We used this metadata to download the full texts of articles from LexisNexis and Factiva.

During this process, we identified orthographic mistakes and slight variations in article titles, as well as discrepancies in publication dates. We corrected these issues in the Excel file to ensure accurate matching between the downloaded articles and the metadata, addressing errors from the original annotations.

In this notebook, we demonstrate the modifications made to the titles and publication dates in the Excel file to ensure they match their counterparts from the downloaded articles.

In [1]:
import pandas as pd

# Load the Excel files as dataframes
corrected_data = pd.read_excel('Daten_Wirtschaftliche_Lage.xlsx')
original_data = pd.read_excel('Daten_Wirtschaftliche_Lage_original.xlsx')

We isolated rows in `corrected_data` where titles or dates were modified, specifically filtering for articles under the 'Konjunktur' sentiment topic with non-empty titles. The resulting dataset, `differences`, highlights the corrections, allowing for direct comparison with the original data to identify the exact changes.

In [2]:
# Identify rows in corrected_data where at least one column differs from original_data
differences = corrected_data[~(corrected_data == original_data).all(axis=1)]

# Further filter the differences to include only those rows where 'topicgroup' is 'Konjunktur' and 'title' is not null
differences = differences[(differences.topicgroup == 'Konjunktur') & (differences.title.notnull())]

In this example, we adjusted the title from "Erdogans Attacke auf unsere Einwanderungspolitik" to "Erdogans Attacke auf unsere Einwanderungsrepublik" within the corrected dataset. This change was essential to match the article successfully, as it was published under the corrected title on the same day and by the same newspaper.

In [3]:
corrected_data.iloc[differences.index[0]]

date                                                        01.06.2014
month                                                           201406
medium                                                            WamS
title                Erdogans Attacke auf unsere Einwanderungsrepublik
topicgroup                                                  Konjunktur
negative                                                             0
no_clear_tone                                                        0
positive                                                             1
Number_of_reports                                                    1
AverageRating                                                    100.0
Name: 129, dtype: object

In [4]:
original_data.iloc[differences.index[0]]

date                                                       01.06.2014
month                                                          201406
medium                                                           WamS
title                Erdogans Attacke auf unsere Einwanderungspolitik
topicgroup                                                 Konjunktur
negative                                                            0
no_clear_tone                                                       0
positive                                                            1
Number_of_reports                                                   1
AverageRating                                                   100.0
Name: 129, dtype: object

In total, 228 changes were implemented.

In [5]:
len(differences)

228