In this notebook, we perform the following steps:

1. **Match Articles from Spiegel's website**: 
   - We start by matching 505 articles scraped from Spiegel's website in the notebook **Web scraping from Spiegel.ipynb** with their metadata from the dataset acquired from Media Tenor, including their sentiment annotations.
   
2. **Identify and Download Missing Articles**:
   - We identify 100 articles published between 2011-2016 that were annotated by Media Tenor but were not available online and could only be found in the print version of the journal. We attempted to download them from Factiva and LexisNexis depending on their availability.

3. **Load and Match Articles from Factiva**: 
   - We begin by downloading 443 RTF files of Spiegel articles from Factiva. These RTF files are then converted to TXT format. After the conversion, we load the articles from the TXT files and match them with their metadata.

4. **Load and Match Articles from LexisNexis**: 
   - Similarly, we download 72 RTF files of Spiegel articles from LexisNexis. These RTF files are converted to TXT format, and the articles are then loaded and matched with their sentiment annotations.

5. **Combine All Articles**: 
   - Finally, we combine all these articles into one dataset and save it as a CSV file.

## Media Tenor dataset

To match the articles scraped from Spiegel's website or downloaded from Factiva and LexisNexis with their metadata from the Media Tenor dataset, we first need to load the Media Tenor dataset. We only retain articles with non-empty titles, as it is not possible to identify and download articles without titles.

In [1]:
import pandas as pd

# Load the dataset acquired from Media Tenor
sentiment_data = pd.read_csv('Daten_Wirtschaftliche_Lage.csv', encoding='utf-8', sep=';')

# Filter out rows with empty titles, as we cannot identify and download the articles without titles
sentiment_data = sentiment_data[sentiment_data['title'].notnull()]

# Reset the index of the DataFrame
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the DataFrame
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0
1,01.01.2017,201701,FAS,Habt bloß keine Angst vor China !,Internationale Wirtschaft,0,0,1,1,100
2,01.01.2017,201701,BamS,Wir leben in einer Zeit der Wohlstands-Halluzi...,Konjunktur,0,0,1,1,100
3,01.02.2015,201502,WamS,Teheran ruft,Wettbewerbsfähigkeit/Nachfrage,1,3,0,4,-25
4,01.01.2017,201701,BamS,"Geht es und wirklich so gut, wie es uns Merkel...",Internationale Wirtschaft,0,1,0,1,0


The titles in the Media Tenor dataset were manually entered, leading to potential inconsistencies in punctuation and spacing. To address this issue and ensure accurate matching with the titles of the articles we scrape from the website or download from databases, we normalize the titles in the dataset.

In [2]:
# Import the Normalize class from the normalize module
from normalize import Normalize

# Initialize the Normalize class with the titles from the sentiment_data DataFrame
normalizer = Normalize(sentiment_data.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Add the normalized titles to the sentiment_data DataFrame as a new column 'title_clean'
sentiment_data['title_clean'] = normalized_titles

sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0,koalition
1,01.01.2017,201701,FAS,Habt bloß keine Angst vor China !,Internationale Wirtschaft,0,0,1,1,100,habt bloß keine angst vor china
2,01.01.2017,201701,BamS,Wir leben in einer Zeit der Wohlstands-Halluzi...,Konjunktur,0,0,1,1,100,wir leben in einer zeit der wohlstands halluzi...
3,01.02.2015,201502,WamS,Teheran ruft,Wettbewerbsfähigkeit/Nachfrage,1,3,0,4,-25,teheran ruft
4,01.01.2017,201701,BamS,"Geht es und wirklich so gut, wie es uns Merkel...",Internationale Wirtschaft,0,1,0,1,0,geht es und wirklich so gut wie es uns merkel ...


We need to focus on annotated articles from Spiegel related to business cycle conditions, as these are the specific articles we scraped from the website or downloaded from the databases.

In [3]:
# Filter the dataset to include only articles from Spiegel
sentiment_data = sentiment_data[sentiment_data['medium'] == 'Spiegel']

# Reset the index of the DataFrame and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Further filter the dataset to include only articles related to the business cycle conditions (Konjunktur)
sentiment_data = sentiment_data[sentiment_data['topicgroup'] == 'Konjunktur']

# Reset the index of the DataFrame again and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

We filter the Media Tenor dataset to only keep articles where there was agreement between annotators on sentiment. Articles without annotator agreement (i.e., where `sentiment` is `NaN`) are removed.

In [4]:
from sentiment import sentiment

# Apply the 'sentiment' function to each row of the DataFrame and create a new 'sentiment' column
sentiment_data['sentiment'] = sentiment_data.apply(lambda row: sentiment(row), axis=1)

# Remove articles where there is no annotator agreement (i.e., sentiment is NaN)
sentiment_data = sentiment_data[sentiment_data['sentiment'].notnull()]

# Reset the index of the DataFrame again and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the filtered DataFrame to verify the results
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean,sentiment
0,01.02.2020,202002,Spiegel,Keim der Angst,Konjunktur,1,2,0,3,-3333,keim der angst,0.0
1,01.02.2020,202002,Spiegel,»Du musst die Gesellschaft verändern wollen«,Konjunktur,1,0,0,1,-100,du musst die gesellschaft verändern wollen,-1.0
2,01.04.2017,201704,Spiegel,Eine Stunde Applaus. Und dann?,Konjunktur,0,1,0,1,0,eine stunde applaus und dann,0.0
3,01.08.2015,201508,Spiegel,Chinesische Heuschrecke,Konjunktur,1,0,0,1,-100,chinesische heuschrecke,-1.0
4,01.07.2013,201307,Spiegel,Spirale nach unten,Konjunktur,1,0,0,1,-100,spirale nach unten,-1.0


## Match Articles from Spiegel's website

Next, we load the articles that we scraped from Spiegel's website and saved in two files: `spiegel_2011_2015.csv` and `spiegel_2016.csv`. We combine all the articles into a single DataFrame named `spiegel`.

In [5]:
import os

# Define the paths for the CSV files
path_spiegel_2011_2015 = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'spiegel_2011_2015.csv')
path_spiegel_2016 = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'spiegel_2016.csv')

# Load the articles scraped from Spiegel's website
spiegel_2011_2015 = pd.read_csv(path_spiegel_2011_2015, encoding='utf-8', sep=';', names=["date", "text", "title", "title_clean"])
spiegel_2016 = pd.read_csv(path_spiegel_2016, encoding='utf-8', sep=';', names=["date", "text", "title", "title_clean"])

# Reset the index of the DataFrames
spiegel_2011_2015 = spiegel_2011_2015.reset_index(drop=True)
spiegel_2016 = spiegel_2016.reset_index(drop=True)

# Combine all the articles into one DataFrame
spiegel = pd.concat([spiegel_2011_2015, spiegel_2016])

# Display the first few rows of the combined DataFrame
spiegel.head()

Unnamed: 0,date,text,title,title_clean
0,03.01.2011,Im freien Fall. Ein Aufstand der Landwirte in ...,Im freien Fall,im freien fall
1,10.01.2011,Im Niemandsland. Die Bürgersteige von Pétionvi...,Im Niemandsland,im niemandsland
2,24.01.2011,Die Reifeprüfung. Um eine schnelle Schlagzeile...,Die Reifeprüfung,die reifeprüfung
3,24.01.2011,"Jurassic Park. Warum ein Falter 3,3 Millionen ...",Jurassic Park,jurassic park
4,31.01.2011,Die Machtfrage. Eine Karriere im SPIEGEL zu pl...,Die Machtfrage,die machtfrage


Now we merge these articles with their metadata from the Media Tenor dataset. We use an inner join on the `title_clean` and `date` columns to ensure that only articles present in both datasets are included. The final merged DataFrame includes columns for the journal's name, publication date (day, month, and year), article title, text, sentiment, and file name.

In [6]:
from datetime import datetime

# Merge articles scraped from Spiegel's website with their metadata from the Media Tenor dataset
data_match_scraped = pd.merge(sentiment_data, spiegel, how='inner', on=['title_clean', 'date'])

# Rename the 'medium' column to 'journal'
data_match_scraped = data_match_scraped.rename(columns={'medium': 'journal'})

# Split 'date' into 'day', 'month', and 'year'
data_match_scraped['date'] = pd.to_datetime(data_match_scraped['date'], format='%d.%m.%Y')
data_match_scraped['day'] = data_match_scraped['date'].dt.day
data_match_scraped['month'] = data_match_scraped['date'].dt.month
data_match_scraped['year'] = data_match_scraped['date'].dt.year

# Create 'file' column that contains the name of the CSV file
data_match_scraped['file'] = data_match_scraped['date'].apply(lambda x: 'spiegel_2011_2015.csv' if x.year <= 2015 else 'spiegel_2016.csv')

# Rename the 'title_y' column to 'title' to reflect the title from the Spiegel dataset
data_match_scraped = data_match_scraped.rename(columns={'title_y': 'title'})

# Reorder columns
columns_order = ['journal', 'day', 'month', 'year', 'title', 'text', 'sentiment', 'file', 'title_clean']

# Select and reorder the columns
data_match_scraped = data_match_scraped[columns_order]

# Sort the data in chronological order
data_match_scraped = data_match_scraped.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Reset the index of the DataFrame
data_match_scraped = data_match_scraped.reset_index(drop=True)

# Display the first few rows of the merged DataFrame
data_match_scraped.head()

Unnamed: 0,journal,day,month,year,title,text,sentiment,file,title_clean
0,Spiegel,21,3,2011,Schockwellen aus Fernost,Schockwellen aus Fernost. Die Katastrophe in J...,-1.0,spiegel_2011_2015.csv,schockwellen aus fernost
1,Spiegel,30,5,2011,Kalter Krieg,Kalter Krieg. Sanfte Umschuldung oder Weiter-s...,-1.0,spiegel_2011_2015.csv,kalter krieg
2,Spiegel,6,6,2011,Wieder am Abgrund,Wieder am Abgrund. Die Griechen brauchen noch ...,-1.0,spiegel_2011_2015.csv,wieder am abgrund
3,Spiegel,27,6,2011,Die Nicht-Regierung,Die Nicht-Regierung. Angela Merkel bekommt ihr...,1.0,spiegel_2011_2015.csv,die nicht regierung
4,Spiegel,4,7,2011,„Große Defizite“,„Große Defizite“. Bundesfinanzminister Wolfgan...,-1.0,spiegel_2011_2015.csv,große defizite


## Identify and Download Missing Articles

The following code identifies 100 articles published between 2011-2016 that were annotated by Media Tenor but were not available online and could only be found in the print version of the journal. Since we were unable to scrape these articles from Spiegel's website, we identified these missing articles and then attempted to download them from Factiva and LexisNexis depending on their availability.

In [7]:
def extract_year(row):
    
    '''A function that extracts the year from the date'''
    
    return row['date'].split('.')[2]

# List of years to exclude
year_exclude = ["2017", "2018", "2019", "2020"]

# Extract the year and add it as a new column in the DataFrame
sentiment_data['year'] = sentiment_data.apply(lambda row: extract_year(row), axis=1)

# Filter the DataFrame to include only articles from 2011 to 2016
sentiment_data_2011_2016 = sentiment_data[~sentiment_data['year'].isin(year_exclude)]

# Reset the index of the DataFrame and remove the old index column
sentiment_data_2011_2016 = sentiment_data_2011_2016.reset_index(drop=True)

# Identify articles that still need to be downloaded
to_download = sentiment_data_2011_2016[~sentiment_data_2011_2016.title_clean.isin(data_match_scraped.title_clean)]

# Reset the index of the DataFrame and remove the old index column
to_download = to_download.reset_index(drop=True)

# Convert the 'date' column to datetime format for accurate sorting
to_download['date'] = pd.to_datetime(to_download['date'], format='%d.%m.%Y')

# Sort 'to_download' based on 'date'
to_download = to_download.sort_values(by='date')

# Reset the index of the DataFrame and remove the old index column
to_download = to_download.reset_index(drop=True)

# Save the result to a CSV file
to_download.to_csv('to_download_spiegel.csv', encoding='utf-8-sig', sep=',')

print(f"Number of articles to download: {len(to_download)}")

# Display the first few rows of the DataFrame to verify the result
to_download.head()

Number of articles to download: 100


Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean,sentiment,year
0,2011-02-14,201102,Spiegel,keine 5 Zeilen,Konjunktur,0,0,1,1,100,keine 5 zeilen,1.0,2011
1,2011-03-28,201103,Spiegel,Reaktorkatastrophe,Konjunktur,1,0,0,1,-100,reaktorkatastrophe,-1.0,2011
2,2011-04-23,201104,Spiegel,Geldsegen für Schäuble,Konjunktur,0,0,1,1,100,geldsegen für schäuble,1.0,2011
3,2011-10-17,201110,Spiegel,Showdown um Mitternacht,Konjunktur,2,1,0,3,-6667,showdown um mitternacht,-1.0,2011
4,2011-11-14,201111,Spiegel,Teure Riester-Pflege,Konjunktur,0,0,1,1,100,teure riester pflege,1.0,2011


## Load and Match Articles from Factiva

Next, we focus on loading Spiegel articles downloaded from Factiva and matching them with their metadata. In our first step, we convert the RTF files into TXT format. All the RTF files are stored in `MediaTenor_LexisNexis_Factiva/Spiegel_Konjunktur_Factiva_rtf`. The converted TXT files are stored in `MediaTenor_LexisNexis_Factiva/Spiegel_Konjunktur_Factiva_txt`.

In [8]:
# Import the function for converting RTF to TXT
from convert_rtf_to_txt import convert_rtf_to_txt

# Define paths for Spiegel RTF and TXT directories
spiegel_rtf_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Spiegel_Konjunktur_Factiva_rtf')
spiegel_txt_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Spiegel_Konjunktur_Factiva_txt')

# Convert RTF files to TXT format for Spiegel
convert_rtf_to_txt(spiegel_rtf_path, spiegel_txt_path)

As soon as the RTF files were transformed into TXT format, we made a few changes to the TXT files. Specifically, we corrected several titles to ensure accurate spelling and punctuation, which is important for matching them with the metadata from the Media Tenor dataset. For example:
- "Corona macht **' s** nötig" was corrected to "Corona macht**\'s** nötig"
- "Der Spalt zwischen Befürwortern und Gegnern der **CoronaMaßnahmen** wird größer" was corrected to "Der Spalt zwischen Befürwortern und Gegnern der **Corona-Maßnahmen** wird größer"

Once the TXT files were ready, we used the function `extract_article_data_spiegel_factiva` to load the text of the articles along with the journal's name, date of publication, title, and file name into a dictionary called `article_data`.

In [9]:
import extract_article_data_spiegel_factiva

# Read and extract relevant information from TXT files in Spiegel directory.
article_data = extract_article_data_spiegel_factiva.extract_article_data_spiegel_factiva(spiegel_txt_path)

We use the `article_data` dictionary to create a DataFrame `spiegel_factiva` that includes columns for the journal's name, publication date (day, month, and year), article title, text, and file name.

In [10]:
# Create a DataFrame from the collected data
spiegel_factiva = pd.DataFrame({
    'journal': article_data['journal'],
    'day': article_data['day'],
    'month': article_data['month'],
    'year': article_data['year'],
    'title': article_data['title'],
    'text': article_data['text'],
    'file': article_data['file']
})

spiegel_factiva.head()

Unnamed: 0,journal,day,month,year,title,text,file
0,Spiegel,2,Mai,2020,15 Milliarden Euro weniger pro Woche,"McKinsey hat ausgerechnet, wie stark die Wirts...",Factiva-20200814-1107 (1).txt
1,Spiegel,2,Mai,2020,"Sonne , Strand und leer",Trotz Viruskrise: Die Tourismusindustrie in Sü...,Factiva-20200814-1107.txt
2,Spiegel,2,Mai,2020,Die Wachablösung,Die gefährliche Rivalität zwischen den USA und...,Factiva-20200814-1109.txt
3,Spiegel,2,Mai,2020,Der Corona - Graben,Der Spalt zwischen Befürwortern und Gegnern de...,Factiva-20200814-1110 (1).txt
4,Spiegel,9,Mai,2020,Im Corona - Wunderland,"Bericht aus dem Land, das im Umgang mit Corona...",Factiva-20200814-1110 (2).txt


To match the full texts of the loaded articles with their sentiment annotations from the Media Tenor dataset, we follow several key steps. First, we create a date in the same format as in the `sentiment_data` DataFrame. Next, we normalize the titles to ensure accurate matching. We also remove any duplicate articles that were mistakenly downloaded twice. After pre-processing, we merge the articles loaded from Factiva with their sentiment annotations from the Media Tenor dataset. We then sort the final DataFrame `data_match_factiva` in chronological order and retain only the relevant columns. Through this process, we successfully matched 443 Spiegel articles from Factiva with their sentiment annotations.

In [11]:
# Create dictionary to transform month name into month number
name_to_number = {
    u'Januar': '01', u'Februar': '02', u'M\xe4rz': '03', u'April': '04', u'Mai': '05',
    u'Juni': '06', u'Juli': '07', u'August': '08', u'September': '09', u'Oktober': '10',
    u'November': '11', u'Dezember': '12'
}

# Transform month names into month numbers
spiegel_factiva['month_num'] = spiegel_factiva['month'].map(name_to_number)

# Create dictionary to transform single-digit day numbers
day_transform = {u'1': '01', u'2': '02', u'3': '03', u'4': '04', u'5': '05', u'6': '06', u'7': '07', u'8': '08', u'9': '09'}

# Transform single-digit day numbers into two-digit format
spiegel_factiva['day'] = spiegel_factiva['day'].map(lambda d: day_transform.get(d, d))

# Combine day, month, and year into a date string
spiegel_factiva['date'] = spiegel_factiva.apply(lambda row: f"{row['day']}.{row['month_num']}.{row['year']}", axis=1)

# Drop duplicated articles, keeping the first occurrence
# We have duplicates because we mistakenly downloaded the same article twice
spiegel_factiva = spiegel_factiva.drop_duplicates(['text', 'year', 'month', 'day'], keep='first')

# Reset the index of the DataFrame
spiegel_factiva = spiegel_factiva.reset_index(drop=True)

# Initialize the Normalize class with the titles from the spiegel_factiva DataFrame
normalizer = Normalize(spiegel_factiva.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Add the normalized titles to the spiegel_factiva DataFrame as a new column 'title_clean'
spiegel_factiva['title_clean'] = normalized_titles

# Merge with sentiment_data on title_clean and date
data_match_factiva = pd.merge(sentiment_data, spiegel_factiva, how='inner', on=['title_clean', 'date'])

# Rename the 'month_num' column to 'month'
data_match_factiva = data_match_factiva.rename(columns={'month_num': 'month'})

# Rename the 'year_y' column to 'year'
data_match_factiva = data_match_factiva.rename(columns={'year_y': 'year'})

# Convert year, month, and day to integers
data_match_factiva['year'] = data_match_factiva['year'].astype(int)
data_match_factiva['month'] = data_match_factiva['month'].astype(int)
data_match_factiva['day'] = data_match_factiva['day'].astype(int)

# Sort the data in chronological order
data_match_factiva = data_match_factiva.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Reset the index of the DataFrame
data_match_factiva = data_match_factiva.reset_index(drop=True)

# Rename the 'title_y' column to 'title' to reflect the title from the Factiva dataset
data_match_factiva = data_match_factiva.rename(columns={'title_y': 'title'})

# Select only the required columns
data_match_factiva = data_match_factiva[['journal', 'day', 'month', 'year', 'title', 'text', 'sentiment', 'file', 'title_clean']]

# Print the number of articles from Factiva
num_factiva_articles = len(data_match_factiva)

print(f"Number of articles from Factiva: {num_factiva_articles}")

# Display the first few rows of the final matched dataset
data_match_factiva.head()

Number of articles from Factiva: 443


Unnamed: 0,journal,day,month,year,title,text,sentiment,file,title_clean
0,Spiegel,23,5,2015,Hausmitteilung,Sie sind etwa so hoch wie die Außenmauern des ...,0.0,Factiva-20200827-1114.txt,hausmitteilung
1,Spiegel,12,9,2015,Der serbische Premier Vucic über die Flüchtlin...,Der serbische Ministerpräsident Aleksandar Vuč...,1.0,Factiva-20200827-1056.txt,der serbische premier vucic über die flüchtlin...
2,Spiegel,10,10,2015,Reisen hilft,Tourismus leistet einen Beitrag zur Wirtschaft...,1.0,Factiva-20200827-1050.txt,reisen hilft
3,Spiegel,24,10,2015,""" Wer nicht zahlen kann , der hat Pech """,Griechenland: Exrichter Leandros Rakintzis sol...,-1.0,Factiva-20200827-1049 (1).txt,wer nicht zahlen kann der hat pech
4,Spiegel,19,11,2015,Kaum Wirkung,Sanktionen gegen Russland haben nur begrenzte ...,-1.0,Factiva-20200827-1049.txt,kaum wirkung


## Load and Match Articles from LexisNexis

In this section, our focus is on loading Spiegel articles that were downloaded from LexisNexis and matching them with their sentiment annotations. We begin by converting the RTF files into TXT format. The original RTF files are located in `MediaTenor_LexisNexis_Factiva/Spiegel_Konjunktur_LexisNexis_rtf`, and the resulting TXT files are stored in `MediaTenor_LexisNexis_Factiva/Spiegel_Konjunktur_LexisNexis_txt`.

In [12]:
# Define paths for Spiegel RTF and TXT directories
spiegel_lexisnexis_rtf_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Spiegel_Konjunktur_LexisNexis_rtf')
spiegel_lexisnexis_txt_path = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Spiegel_Konjunktur_LexisNexis_txt_test')

# Convert RTF files to TXT format for Spiegel
convert_rtf_to_txt(spiegel_lexisnexis_rtf_path, spiegel_lexisnexis_txt_path)

Once the RTF files were converted to TXT format, we made some adjustments. Before extracting titles from the 'Search Terms' in our TXT files, we corrected a few titles to ensure they matched those in the Media Tenor dataset. For example, "**es** muss jetzt schnell gehen" was corrected to "**Es** muss jetzt schnell gehen" to ensure accurate matching.

After preparing the TXT files, we used the `extract_article_data_spiegel_lexisnexis` function to load the articles' text, along with the journal name, publication date, title, and file name, into a dictionary called a`article_data_lexisnexis`.

In [13]:
import extract_article_data_spiegel_lexisnexis

# Read and extract relevant information from TXT files in Spiegel directory.
article_data_lexisnexis = extract_article_data_spiegel_lexisnexis.extract_article_data_spiegel_lexisnexis(spiegel_lexisnexis_txt_path)

We use the `article_data_lexisnexis` dictionary to create a DataFrame `spiegel_lexisnexis` that includes columns for the journal's name, publication date (day, month, and year), article title, text, and file name.

In [14]:
# Create a DataFrame from the collected data
spiegel_lexisnexis = pd.DataFrame({
    'journal': article_data_lexisnexis['journal'],
    'day': article_data_lexisnexis['day'],
    'month': article_data_lexisnexis['month'],
    'year': article_data_lexisnexis['year'],
    'title': article_data_lexisnexis['title'],
    'text': article_data_lexisnexis['text'],
    'file': article_data_lexisnexis['file']
})

spiegel_lexisnexis.head()

Unnamed: 0,journal,day,month,year,title,text,file
0,Spiegel,5,März,2012,Zahl der Woche,besuchten im vergangenen Jahr Deutschkurse an ...,234 587 Sprachsch_ler.txt
1,Spiegel,8,Dezember,2014,37000,Fußnote. Menschen aus Eritrea haben sich in de...,37000.txt
2,Spiegel,26,Mai,2014,81 Millionäre,"Fußnote. aus dem nicht europäischen Ausland, f...",81 Million_re.txt
3,Spiegel,30,April,2012,Abschwung in Sicht,Deutschland bekommt für seine Beschäftigungspo...,Abschwung in Sicht.txt
4,Spiegel,2,Januar,2012,Arbeitgeber bestreiten Nachholbedarf,"Im Jahr 2012 werden für rund 3,6 Millionen Bes...",Arbeitgeber bestreiten Nachholbedarf.txt


To match the full texts of the loaded articles with their sentiment annotations from the Media Tenor dataset, we follow several key steps. First, we create a date in the same format as in the `sentiment_data` DataFrame. Next, we normalize the titles to ensure accurate matching. We also verify that there are no duplicate articles. After pre-processing, we merge the articles loaded from LexisNexis with their sentiment annotations from the Media Tenor dataset. We then sort the final DataFrame `data_match_lexisnexis` in chronological order and retain only the relevant columns. Through this process, we successfully matched 72 Spiegel articles from LexisNexis with their sentiment annotations.

In [15]:
# Transform month names into month numbers
spiegel_lexisnexis['month_num'] = spiegel_lexisnexis['month'].map(name_to_number)

# Transform single-digit day numbers into two-digit format
spiegel_lexisnexis['day'] = spiegel_lexisnexis['day'].map(lambda d: day_transform.get(d, d))

# Combine day, month, and year into a date string
spiegel_lexisnexis['date'] = spiegel_lexisnexis.apply(lambda row: f"{row['day']}.{row['month_num']}.{row['year']}", axis=1)

# Initialize the Normalize class with the titles from the spiegel_lexisnexis DataFrame
normalizer = Normalize(spiegel_lexisnexis.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Add the normalized titles to the spiegel_lexisnexis DataFrame as a new column 'title_clean'
spiegel_lexisnexis['title_clean'] = normalized_titles

# Merge with sentiment_data on title_clean and date
data_match_lexisnexis = pd.merge(sentiment_data, spiegel_lexisnexis, how='inner', on=['title_clean', 'date'])

# Rename the 'month_num' column to 'month'
data_match_lexisnexis = data_match_lexisnexis.rename(columns={'month_num': 'month'})

# Rename the 'year_y' column to 'year'
data_match_lexisnexis = data_match_lexisnexis.rename(columns={'year_y': 'year'})

# Convert year, month, and day to integers
data_match_lexisnexis['year'] = data_match_lexisnexis['year'].astype(int)
data_match_lexisnexis['month'] = data_match_lexisnexis['month'].astype(int)
data_match_lexisnexis['day'] = data_match_lexisnexis['day'].astype(int)

# Sort the data in chronological order
data_match_lexisnexis = data_match_lexisnexis.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Reset the index of the DataFrame
data_match_lexisnexis = data_match_lexisnexis.reset_index(drop=True)

# Rename the 'title_y' column to 'title' to reflect the title from the LexisNexis dataset
data_match_lexisnexis = data_match_lexisnexis.rename(columns={'title_y': 'title'})

# Select only the required columns
data_match_lexisnexis = data_match_lexisnexis[['journal', 'day', 'month', 'year', 'title', 'text', 'sentiment', 'file', 'title_clean']]

# Print the number of articles from LexisNexis
num_lexisnexis_articles = len(data_match_lexisnexis)

print(f"Number of articles from LexisNexis: {num_lexisnexis_articles}")

# Display the last few rows of the final matched dataset
data_match_lexisnexis.tail()

Number of articles from LexisNexis: 72


Unnamed: 0,journal,day,month,year,title,text,sentiment,file,title_clean
67,Spiegel,25,7,2015,Bremsen Banken das Wachstum?,Die Samstagsfrage. Zu den vergleichsweise sich...,-1.0,Bremsen Banken das Wachstum_.txt,bremsen banken das wachstum
68,Spiegel,26,9,2015,Skeptischer IWF,Wachstum. Der Internationale Währungsfonds (IW...,-1.0,Skeptischer IWF.txt,skeptischer iwf
69,Spiegel,30,7,2016,Der zerplatzte Traum,Am Vorabend der Olympischen Spiele befindet si...,-1.0,Der zerplatzte Traum.txt,der zerplatzte traum
70,Spiegel,3,6,2017,Krieg auf den Hügeln,Einige Jahre lang war Rio das Schaufenster ein...,-1.0,Krieg auf den H_geln.txt,krieg auf den hügeln
71,Spiegel,30,9,2017,Ein Angriff auf die Demokratie,Der spanische Schriftsteller Javier Cercas übe...,-1.0,Ein Angriff auf die Demokratie.txt,ein angriff auf die demokratie


## Combine All Articles

As the final step, we consolidate all Spiegel articles, including those that were scraped and those downloaded from Factiva and LexisNexis, into a single DataFrame called `spiegel_all`. This combined DataFrame is then saved as a CSV file named `spiegel.csv`.

In [16]:
# Combine all articles from scraped, Factiva, and LexisNexis datasets into a single DataFrame
spiegel_all = pd.concat([data_match_scraped, data_match_factiva, data_match_lexisnexis], sort=False)

# Reset the index of the combined DataFrame
spiegel_all = spiegel_all.reset_index(drop=True)

# Print the total number of articles in the combined DataFrame
total_articles = len(spiegel_all)
print(f"Total number of articles: {total_articles}")

# Display the first few rows of the combined DataFrame to verify the merge
spiegel_all.head()

Total number of articles: 1020


Unnamed: 0,journal,day,month,year,title,text,sentiment,file,title_clean
0,Spiegel,21,3,2011,Schockwellen aus Fernost,Schockwellen aus Fernost. Die Katastrophe in J...,-1.0,spiegel_2011_2015.csv,schockwellen aus fernost
1,Spiegel,30,5,2011,Kalter Krieg,Kalter Krieg. Sanfte Umschuldung oder Weiter-s...,-1.0,spiegel_2011_2015.csv,kalter krieg
2,Spiegel,6,6,2011,Wieder am Abgrund,Wieder am Abgrund. Die Griechen brauchen noch ...,-1.0,spiegel_2011_2015.csv,wieder am abgrund
3,Spiegel,27,6,2011,Die Nicht-Regierung,Die Nicht-Regierung. Angela Merkel bekommt ihr...,1.0,spiegel_2011_2015.csv,die nicht regierung
4,Spiegel,4,7,2011,„Große Defizite“,„Große Defizite“. Bundesfinanzminister Wolfgan...,-1.0,spiegel_2011_2015.csv,große defizite


In [17]:
# Sort the combined DataFrame in chronological order
spiegel_all = spiegel_all.sort_values(['year', 'month', 'day'], ascending=[True, True, True])

# Reset the index of the DataFrame
spiegel_all = spiegel_all.reset_index(drop=True)

# Drop the 'title_clean' column as it is no longer needed
spiegel_all = spiegel_all.drop(columns=['title_clean'])

# Save the combined DataFrame to a CSV file
spiegel_all.to_csv('spiegel.csv', encoding='utf-8-sig', sep=';')