In this notebook, we focus on loading the full texts of articles from WamS related to business cycle conditions. All of these articles were available in Factiva. We downloaded the articles in RTF format, transformed them into TXT files, and then read them into a DataFrame. Our goal is to create a DataFrame containing the following information for each article: the name of the newspaper, the day, month, and year of publication, the title, the full text, and the sentiment based on Media Tenor's annotations.

Welt is a popular daily newspaper with a wide circulation. WamS, or Welt am Sonntag, is its Sunday edition.

## Media Tenor dataset

To match the articles downloaded from Factiva and LexisNexis with their metadata from the Media Tenor dataset, we first need to load the Media Tenor dataset. We only retain articles with non-empty titles, as it is not possible to identify and download articles without titles.

In [1]:
import pandas as pd

# Load the dataset acquired from Media Tenor
sentiment_data = pd.read_csv('Daten_Wirtschaftliche_Lage.csv', encoding='utf-8', sep=';')

# Filter out rows with empty titles, as we cannot identify and download the articles without titles
sentiment_data = sentiment_data[sentiment_data['title'].notnull()]

# Reset the index of the DataFrame
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the DataFrame
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0
1,01.01.2017,201701,FAS,Habt bloß keine Angst vor China !,Internationale Wirtschaft,0,0,1,1,100
2,01.01.2017,201701,BamS,Wir leben in einer Zeit der Wohlstands-Halluzi...,Konjunktur,0,0,1,1,100
3,01.02.2015,201502,WamS,Teheran ruft,Wettbewerbsfähigkeit/Nachfrage,1,3,0,4,-25
4,01.01.2017,201701,BamS,"Geht es und wirklich so gut, wie es uns Merkel...",Internationale Wirtschaft,0,1,0,1,0


The titles in the Media Tenor dataset were manually entered, leading to potential inconsistencies in punctuation and spacing. To address this issue and ensure accurate matching with the titles of the articles we download from databases, we normalize the titles in the dataset.

In [2]:
# Import the Normalize class from the normalize module
from normalize import Normalize

# Initialize the Normalize class with the titles from the 'sentiment_data' DataFrame
normalizer = Normalize(sentiment_data.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Add the normalized titles to the sentiment_data DataFrame as a new column 'title_clean'
sentiment_data['title_clean'] = normalized_titles

sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0,koalition
1,01.01.2017,201701,FAS,Habt bloß keine Angst vor China !,Internationale Wirtschaft,0,0,1,1,100,habt bloß keine angst vor china
2,01.01.2017,201701,BamS,Wir leben in einer Zeit der Wohlstands-Halluzi...,Konjunktur,0,0,1,1,100,wir leben in einer zeit der wohlstands halluzi...
3,01.02.2015,201502,WamS,Teheran ruft,Wettbewerbsfähigkeit/Nachfrage,1,3,0,4,-25,teheran ruft
4,01.01.2017,201701,BamS,"Geht es und wirklich so gut, wie es uns Merkel...",Internationale Wirtschaft,0,1,0,1,0,geht es und wirklich so gut wie es uns merkel ...


We need to focus on annotated articles from WamS related to business cycle conditions, as these are the specific articles we downloaded from the databases.

In [3]:
# Filter the dataset to include only articles from WamS
sentiment_data = sentiment_data[sentiment_data['medium'] == 'WamS']

# Reset the index of the DataFrame and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Further filter the dataset to include only articles related to the business cycle conditions (Konjunktur)
sentiment_data = sentiment_data[sentiment_data['topicgroup'] == 'Konjunktur']

# Reset the index of the DataFrame again and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

We filter the Media Tenor dataset to only keep articles where there was agreement between annotators on sentiment. Articles without annotator agreement (i.e., where `sentiment` is `NaN`) are removed.

In [4]:
from sentiment import sentiment

# Apply the 'sentiment' function to each row of the DataFrame and create a new 'sentiment' column
sentiment_data['sentiment'] = sentiment_data.apply(lambda row: sentiment(row), axis=1)

# Remove articles where there is no annotator agreement (i.e., sentiment is NaN)
sentiment_data = sentiment_data[sentiment_data['sentiment'].notnull()]

# Reset the index of the DataFrame again and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the filtered DataFrame to verify the results
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean,sentiment
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0,koalition,0.0
1,01.02.2015,201502,WamS,„Größtmöglicher Nutzen beim Kochen“,Konjunktur,0,1,0,1,0,größtmöglicher nutzen beim kochen,0.0
2,01.02.2015,201502,WamS,Rebellion gegen Merkel,Konjunktur,2,0,0,2,-100,rebellion gegen merkel,-1.0
3,01.01.2017,201701,WamS,Alles wird gut,Konjunktur,0,0,2,2,100,alles wird gut,1.0
4,01.03.2020,202003,WamS,Deutschland bereitet sich auf Corona-Pandemie vor,Konjunktur,3,0,0,3,-100,deutschland bereitet sich auf corona pandemie vor,-1.0


## Load and Match Articles from Factiva

Next, we focus on loading WamS articles downloaded from Factiva and matching them with their metadata. In our first step, we convert the RTF files into TXT format. All the RTF files are stored in `MediaTenor_LexisNexis_Factiva/WamS_Konjunktur_rtf`. The converted TXT files are stored in `MediaTenor_LexisNexis_Factiva/WamS_Konjunktur_txt`.

In [5]:
import os

# Import the function for converting RTF to TXT
from convert_rtf_to_txt import convert_rtf_to_txt

# Define paths for WamS RTF and TXT directories
wams_rtf_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'WamS_Konjunktur_rtf')
wams_txt_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'WamS_Konjunktur_txt')

# Convert RTF files to TXT format for WamS
convert_rtf_to_txt(wams_rtf_path, wams_txt_path)

As soon as the RTF files were transformed into TXT format, we made a few changes to the TXT files. Specifically, we corrected several titles to ensure accurate spelling and punctuation, which is important for matching them with the metadata from the Media Tenor dataset. For example, a title "Wer **kann ' s** besser" was corrected to "Wer **kann's** besser".

Once the TXT files were ready, we used the function `extract_article_data_wams` to load the text of the articles along with the journal's name, date of publication, title, and file name into a dictionary called `article_data`.

In [6]:
import extract_article_data_wams

# Read and extract relevant information from TXT files in WamS directory.
article_data = extract_article_data_wams.extract_article_data_wams(wams_txt_path)

We use the `article_data` dictionary to create a DataFrame `wams` that includes columns for the journal's name, publication date (day, month, and year), article title, text, and file name.

In [7]:
# Create a DataFrame from the collected data
wams = pd.DataFrame({
    'journal': article_data['journal'],
    'day': article_data['day'],
    'month': article_data['month'],
    'year': article_data['year'],
    'title': article_data['title'],
    'text': article_data['text'],
    'file': article_data['file']
})

wams.head()

Unnamed: 0,journal,day,month,year,title,text,file
0,WamS,3,Mai,2020,Regierung kritisiert Gerichte für Urteile gege...,"Kanzleramtschef Braun beklagt ""Herausforderung...",Factiva-20200813-1103.txt
1,WamS,3,Mai,2020,Was war denn das jetzt ?,"Zu den Dingen, die es vor der Corona-Epidemie ...",Factiva-20200813-1105.txt
2,WamS,3,Mai,2020,""" Wir können stolz sein """,Kanzleramtschef Helge Braun (CDU) verteidigt d...,Factiva-20200813-1105_1.txt
3,WamS,3,Mai,2020,Krise mit Ansage,PiS-Chef Kaczynski möchte trotz Lockdown eine ...,Factiva-20200813-1107.txt
4,WamS,3,Mai,2020,Die neuen Leiden der Generation Z,Wer in jungen Jahren eine schwere Rezession mi...,Factiva-20200813-1108.txt


To match the full texts of the loaded articles with their sentiment annotations from the Media Tenor dataset, we follow several key steps. First, we create a date in the same format as in the `sentiment_data` DataFrame. Next, we normalize the titles to ensure accurate matching. We also remove any duplicate articles that were mistakenly downloaded twice. After pre-processing, we merge the articles loaded from Factiva with their sentiment annotations from the Media Tenor dataset. We then sort the final DataFrame `data_match` in chronological order and retain only the relevant columns. Through this process, we successfully matched **468** WamS articles from Factiva with their sentiment annotations.

In [8]:
# Create dictionary to transform month name into month number
name_to_number = {
    u'Januar': '01', u'Februar': '02', u'M\xe4rz': '03', u'April': '04', u'Mai': '05',
    u'Juni': '06', u'Juli': '07', u'August': '08', u'September': '09', u'Oktober': '10',
    u'November': '11', u'Dezember': '12'
}

# Transform month names into month numbers
wams['month_num'] = wams['month'].map(name_to_number)

# Create dictionary to transform single-digit day numbers
day_transform = {u'1': '01', u'2': '02', u'3': '03', u'4': '04', u'5': '05', u'6': '06', u'7': '07', u'8': '08', u'9': '09'}

# Transform single-digit day numbers into two-digit format
wams['day'] = wams['day'].map(lambda d: day_transform.get(d, d))

# Combine day, month, and year into a date string
wams['date'] = wams.apply(lambda row: f"{row['day']}.{row['month_num']}.{row['year']}", axis=1)

# Drop duplicated articles, keeping the first occurrence
# We have duplicates because we mistakenly downloaded the same article twice
wams = wams.drop_duplicates(['text', 'year', 'month', 'day'], keep='first')

# Reset the index of the DataFrame
wams = wams.reset_index(drop=True)

# Initialize the Normalize class with the titles from the 'wams' DataFrame
normalizer = Normalize(wams.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Add the normalized titles to the 'wams' DataFrame as a new column 'title_clean'
wams['title_clean'] = normalized_titles

# Merge with sentiment_data on title_clean and date
data_match = pd.merge(sentiment_data, wams, how='inner', on=['title_clean', 'date'])

# Rename the 'month_num' column to 'month'
data_match = data_match.rename(columns={'month_num': 'month'})

# Rename the 'year_y' column to 'year'
data_match = data_match.rename(columns={'year_y': 'year'})

# Convert year, month, and day to integers
data_match['year'] = data_match['year'].astype(int)
data_match['month'] = data_match['month'].astype(int)
data_match['day'] = data_match['day'].astype(int)

# Sort the data in chronological order
data_match = data_match.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Reset the index of the DataFrame
data_match = data_match.reset_index(drop=True)

# Rename the 'title_y' column to 'title' to reflect the title from the Factiva dataset
data_match = data_match.rename(columns={'title_y': 'title'})

# Select only the required columns
data_match = data_match[['journal', 'day', 'month', 'year', 'title', 'text', 'sentiment', 'file', 'title_clean']]

# Print the number of articles
num_articles = len(data_match)

print(f"Number of articles: {num_articles}")

# Display the first few rows of the final matched dataset
data_match.head()

Number of articles: 468


Unnamed: 0,journal,day,month,year,title,text,sentiment,file,title_clean
0,WamS,8,7,2012,Die nächste allgemeine Verunsicherung,Deutschlands führende Ökonomen senken ihre Wac...,-1.0,Factiva-20200817-1247.txt,die nächste allgemeine verunsicherung
1,WamS,16,9,2012,Ist der Euro jetzt gerettet ?,Nach dem Urteil des Bundesverfassungsgerichts ...,1.0,Factiva-20200817-1244.txt,ist der euro jetzt gerettet
2,WamS,22,10,2012,So stark ist Deutschland wirklich,Eine noch unveröffentlichte Studie zeigt: Der ...,-1.0,Factiva-20200817-1243.txt,so stark ist deutschland wirklich
3,WamS,28,10,2012,Industrie : Wachstum hält trotz Krise an,Trotz der Belastungen durch die Euro-Krise ble...,1.0,Factiva-20200817-1243_1.txt,industrie wachstum hält trotz krise an
4,WamS,11,11,2012,""" Eine wahre Explosion der Kreativität ""","Der Chef der Werbeagentur Saatchi & Saatchi, K...",-1.0,Factiva-20200817-1240_1.txt,eine wahre explosion der kreativität


As the final step, the DataFrame `data_match` is saved as a CSV file named `wams.csv`.

In [9]:
# Drop the 'title_clean' column as it is no longer needed
data_match = data_match.drop(columns=['title_clean'])

# Save the DataFrame to a CSV file
data_match.to_csv('wams.csv', encoding='utf-8-sig', sep=';')