We have downloaded the full texts of articles on the topic of business cycle conditions from LexisNexis and Factiva. These articles were annotated by Media Tenor and come from several sources: Spiegel (1,020 annotations), BILD (875 annotations), Focus (730 annotations), WamS (488 annotations), Capital (411 annotations), FAS (392 annotations), BamS (176 annotations), and Manager Magazin (16 annotations).

In this notebook, we focus on loading the full texts of articles from BILD and BamS related to business cycle conditions. All of these articles were available in Factiva. We downloaded the articles in RTF format, transformed them into TXT files, and then read them into a DataFrame. Our goal is to create a DataFrame containing the following information for each article: the name of the newspaper (BILD or BamS), the day, month, and year of publication, the title, the full text, and the sentiment based on Media Tenor's annotations.

BILD is a popular daily newspaper with a wide circulation. BamS, or Bild am Sonntag, is its Sunday edition.

---

In our first step, we convert RTF files of articles from BamS and BILD into TXT format. These articles were downloaded from Factiva. All the RTF files are stored in `MediaTenor_LexisNexis_Factiva/BamS_Konjunktur_rtf` and `MediaTenor_LexisNexis_Factiva/BILD_Konjunktur_rtf`. The converted TXT files are stored in `MediaTenor_LexisNexis_Factiva/BamS_Konjunktur_txt` and `MediaTenor_LexisNexis_Factiva/BILD_Konjunktur_txt`.

In [1]:
import os
import codecs
from striprtf.striprtf import rtf_to_text

def convert_rtf_to_txt(input_directory, output_directory):
    """
    Convert RTF files in the input directory to TXT files in the output directory.
    Remove any temporary files starting with ~$ in the output directory.
    """
    # Ensure the output directory exists
    os.makedirs(output_directory, exist_ok=True)

    # Iterate through all RTF files in the input directory
    for filename in os.listdir(input_directory):
        if filename.endswith(".rtf"):
            # Read the RTF file and convert it to plain text
            with codecs.open(os.path.join(input_directory, filename), "r") as file:
                text_rtf = rtf_to_text(file.read())
            
            # Replace the .rtf extension with .txt for the new file
            new_filename = filename.replace(".rtf", ".txt")
            
            # Write the converted text to the new TXT file
            with codecs.open(os.path.join(output_directory, new_filename), 'w', encoding='utf-8') as new_file:
                new_file.write(text_rtf)
                
    # Remove any temporary files starting with ~$
    for filename in os.listdir(output_directory):
        if filename.startswith("~$"):
            os.remove(os.path.join(output_directory, filename))

# Define paths for BamS
bams_rtf_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'BamS_Konjunktur_rtf')
bams_txt_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'BamS_Konjunktur_txt')

# Define paths for BILD
bild_rtf_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'BILD_Konjunktur_rtf')
bild_txt_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'BILD_Konjunktur_txt')

# Convert RTF files to TXT for BamS
convert_rtf_to_txt(bams_rtf_path, bams_txt_path)

# Convert RTF files to TXT for BILD
convert_rtf_to_txt(bild_rtf_path, bild_txt_path)

For BamS and BILD, we had to manually review each TXT file to ensure it represented an individual article. Often, these files contained multiple articles in a single document. When this happened, we manually selected the annotated article from the compilation. During this review, we also added sentiment information from the Excel file provided by Media Tenor. We matched the downloaded articles with the sentiment based on the article's title, publication date, and source. Additionally, for some documents, we removed content found at the end of the text that was not part of the main article, such as captions for photos, editor notes, or background details.

Once the TXT files were ready, we used the function `extract_article_data` to load the text of the articles along with the newspaper's name, date of publication, title, sentiment, and file name into a dictionary called `article_data`.

In [2]:
import extract_article_data

# Read and extract relevant information from TXT files in BamS and BILD directories.
article_data = extract_article_data.extract_article_data(bams_txt_path, bild_txt_path)

Finally, we created a DataFrame from the `article_data` dictionary. This DataFrame includes columns for the newspaper's name, publication date (day, month, and year), article title, text, sentiment, and file name. Then, we sorted the DataFrame in chronological order and saved the resulting dataset to a CSV file named `bams_bild.csv`.

In [3]:
import pandas as pd

# Create a DataFrame from the article_data dictionary
bams_bild = pd.DataFrame({
    'journal': article_data['journal'],
    'day': article_data['day'],
    'month': article_data['month'],
    'year': article_data['year'],
    'title': article_data['titles'],
    'text': article_data['texts'],
    'sentiment': article_data['sentiment'],
    'file': article_data['file']
})

# Map month names to numbers
month_mapping = {
    'Januar': 1, 'Februar': 2, 'März': 3, 'April': 4,
    'Mai': 5, 'Juni': 6, 'Juli': 7, 'August': 8,
    'September': 9, 'Oktober': 10, 'November': 11, 'Dezember': 12,
    'January': 1, 'February': 2, 'March': 3, 'April': 4,
    'May': 5, 'June': 6, 'July': 7, 'August': 8,
    'September': 9, 'October': 10, 'November': 11, 'December': 12
}
bams_bild['month'] = bams_bild['month'].map(month_mapping)

# Convert day, month, and year to integers
bams_bild['day'] = bams_bild['day'].astype(int)
bams_bild['month'] = bams_bild['month'].astype(int)
bams_bild['year'] = bams_bild['year'].astype(int)

# Sort the data in chronological order
bams_bild = bams_bild.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Drop duplicated articles, keeping the first occurrence
# We have duplicates because we mistakenly downloaded the same article twice
bams_bild = bams_bild.drop_duplicates(['text', 'year', 'month', 'day'], keep='first')

# Reset the index of the DataFrame
bams_bild = bams_bild.reset_index(drop=True)

# Print the number of articles from BamS and BILD
num_bams_articles = len(bams_bild[bams_bild['journal'] == 'BamS'])
num_bild_articles = len(bams_bild[bams_bild['journal'] == 'BILD'])
total_articles = len(bams_bild)

print(f"Number of articles from BamS: {num_bams_articles}")
print(f"Number of articles from BILD: {num_bild_articles}")
print(f"Total number of articles: {total_articles}")

# Save to CSV file
bams_bild.to_csv('bams_bild.csv', encoding='utf-8-sig', sep=';')

# Display the DataFrame to verify the results
bams_bild.head()

Number of articles from BamS: 146
Number of articles from BILD: 564
Total number of articles: 710


Unnamed: 0,journal,day,month,year,title,text,sentiment,file
0,BILD,6,4,2013,Gipfeltreffen der Gefrusteten.,SPD-Kanzlerkandidat Peer Steinbrück hat miese ...,-1,Factiva-20200814-1023.txt
1,BILD,10,4,2013,RÖSLER SICHER; Konjunktur nimmt Fahrt auf,Berlin - Wirtschaftsminister Philipp Rösler (4...,1,Factiva-20200814-1021.txt
2,BamS,14,4,2013,BLITZ-UMFRAGE VOR DEM PARTEITAG; Selbst SPD -...,"""Wir sind noch nicht im Wahlkampfmodus"". Augs...",0,Factiva-20200810-1444.txt
3,BamS,14,4,2013,Wie lange wird es Opel noch geben ?,General-Motors-Chef Dan Akerson und sein neuer...,1,Factiva-20200810-1445.txt
4,BILD,16,4,2013,POLITIK & WIRTSCHAFT,Droht Pierer lebenslang? Athen - Wegen der Sc...,-1,Factiva-20200814-1021 (1).txt
