Since 25% of the annotated articles (1,020 annotations) came from Spiegel, we found a way to speed up the downloading process by web scraping 505 of these articles (50% of all annotations) directly from Spiegel's website. When we were scraping the articles, content up to and including 2016 was freely available. Now, only articles up to and including 2009 are available without a subscription.

## Media Tenor dataset

First, we need to load the dataset we obtained from Media Tenor. This is essential because we will be scraping articles from Spiegel for the period 2011-2016, but only if their titles match those in our Media Tenor dataset.

In [1]:
import pandas as pd

# Load the dataset acquired from Media Tenor
sentiment_data = pd.read_csv('Daten_Wirtschaftliche_Lage.csv', encoding='utf-8', sep=';')

# Filter out rows with empty titles, as we cannot identify and download the articles without titles
sentiment_data = sentiment_data[sentiment_data['title'].notnull()]

# Reset the index of the DataFrame
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the DataFrame
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0
1,01.01.2017,201701,FAS,Habt bloß keine Angst vor China !,Internationale Wirtschaft,0,0,1,1,100
2,01.01.2017,201701,BamS,Wir leben in einer Zeit der Wohlstands-Halluzi...,Konjunktur,0,0,1,1,100
3,01.02.2015,201502,WamS,Teheran ruft,Wettbewerbsfähigkeit/Nachfrage,1,3,0,4,-25
4,01.01.2017,201701,BamS,"Geht es und wirklich so gut, wie es uns Merkel...",Internationale Wirtschaft,0,1,0,1,0


The titles in the Media Tenor dataset were manually entered, leading to potential inconsistencies in punctuation and spacing. To address this issue and ensure accurate matching with the titles of the articles we scrape from the website, we normalize the titles in the dataset.

In [2]:
# Import the Normalize class from the normalize module
from normalize import Normalize

# Initialize the Normalize class with the titles from the sentiment_data DataFrame
normalizer = Normalize(sentiment_data.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Print the first three normalized titles to verify the results
print(normalized_titles[:3])

# Add the normalized titles to the sentiment_data DataFrame as a new column 'title_clean'
sentiment_data['title_clean'] = normalized_titles

['koalition', 'habt bloß keine angst vor china', 'wir leben in einer zeit der wohlstands halluzination']


We need to focus on annotated articles from Spiegel related to business cycle conditions, as these are the specific articles we aim to scrape from the website based on their availability.

In [3]:
# Filter the dataset to include only articles from Spiegel
sentiment_data = sentiment_data[sentiment_data['medium'] == 'Spiegel']

# Reset the index of the DataFrame and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Further filter the dataset to include only articles related to the business cycle conditions (Konjunktur)
sentiment_data = sentiment_data[sentiment_data['topicgroup'] == 'Konjunktur']

# Reset the index of the DataFrame again and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the filtered DataFrame to verify the results
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean
0,01.02.2020,202002,Spiegel,Keim der Angst,Konjunktur,1,2,0,3,-3333,keim der angst
1,01.02.2020,202002,Spiegel,»Du musst die Gesellschaft verändern wollen«,Konjunktur,1,0,0,1,-100,du musst die gesellschaft verändern wollen
2,01.04.2017,201704,Spiegel,Eine Stunde Applaus. Und dann?,Konjunktur,0,1,0,1,0,eine stunde applaus und dann
3,01.08.2015,201508,Spiegel,Chinesische Heuschrecke,Konjunktur,1,0,0,1,-100,chinesische heuschrecke
4,01.07.2013,201307,Spiegel,Spirale nach unten,Konjunktur,1,0,0,1,-100,spirale nach unten


## Web scraping: example

The following code demonstrates how to scrape a few articles from a specific issue of Spiegel. In the past, we used similar code to scrape articles from 2011-2016 when all these articles were publicly available without a subscription.
Now, a similar approach can be used to scrape articles up to and including 2009, which are still available without a subscription.

This example focuses on scraping the first few articles from the 46th issue of Spiegel in 2008. It involves extracting article links from the issue's index page, opening each link, and retrieving the article's title, publication date, and content.

In [4]:
import codecs
import string
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Define the URL for a specific issue of Spiegel from 2008
spiegel_issue_url = "https://www.spiegel.de/spiegel/print/index-2008-46.html"

# Open the URL and create a BeautifulSoup object
spiegel_issue = urlopen(spiegel_issue_url)
spiegel_issue_soup = BeautifulSoup(spiegel_issue, 'html.parser')

# Print the first 500 characters of the HTML content for a brief overview
print("HTML Content Preview:\n", spiegel_issue_soup.prettify()[:500])

HTML Content Preview:
 <!DOCTYPE html>
<html lang="de">
 <head>
  <title>
   DER SPIEGEL 46/2008 - Inhaltsverzeichnis
  </title>
  <meta charset="utf-8"/>
  <meta content="width=device-width,initial-scale=1,user-scalable=yes" name="viewport"/>
  <meta content="true" name="MSSmartTagsPreventParsing"/>
  <meta content="no" http-equiv="imagetoolbar"/>
  <meta content="origin-when-cross-origin" name="referrer"/>
  <meta content="app-id=424881832" name="apple-itunes-app"/>
  <link href="https://www.spiegel.de/public/spon/j


Here, we demonstrate how to extract the issue number and year from the title of a Spiegel issue.

In [5]:
# Extract the title of the issue
title = spiegel_issue_soup.title.string
print(title)

# Parse details from the issue title
journal_title = title.rsplit(' ', 3)[0]
issue_number = int(title.rsplit(' ', 3)[1].rsplit('/', 1)[0])
year = int(title.rsplit(' ', 3)[1].rsplit('/', 1)[1])

# Display parsed details
print(journal_title)
print(issue_number)
print(year)

DER SPIEGEL 46/2008 - Inhaltsverzeichnis
DER SPIEGEL
46
2008


Next, we extract the links and titles of all the articles from a particular issue. We find all the `<a>` tags in the HTML document and check if they contain a valid link. 

In [6]:
# Extract all article links and their headings from the issue
links_journal = []
headings = []
all_links_spiegel = spiegel_issue_soup.findAll("a")

for link in all_links_spiegel:
    # Check if there is a link part in the tag
    if link.get("href") is not None:
        # Check if the link contains 'context=issue'
        if link.get("href").find('context=issue') != -1:
            # Append the link to links_journal list
            links_journal.append(link.get("href"))
            # Append the heading to headings list
            headings.append(link.get("title"))

# Display only the first 10 extracted links and headings for verification
print("Extracted Links (first 10):", links_journal[:10])
print("Extracted Headings (first 10):", headings[:10])

Extracted Links (first 10): ['https://www.spiegel.de/politik/das-projekt-obama-a-90d661ad-0002-0001-0000-000062013393?context=issue', 'https://www.spiegel.de/politik/das-projekt-obama-a-90d661ad-0002-0001-0000-000062013393?context=issue', 'https://www.spiegel.de/politik/im-gelobten-land-a-0eb2f40e-0002-0001-0000-000062013394?context=issue', 'https://www.spiegel.de/politik/ein-weltkrieg-ohne-krieg-a-79b5c498-0002-0001-0000-000062013395?context=issue', 'https://www.spiegel.de/politik/10-november-2008-titel-a-ea2be9dc-0002-0001-0000-000062013320?context=issue', 'https://www.spiegel.de/politik/10-november-2008-cayman-islands-a-9d2cd94a-0002-0001-0000-000062013321?context=issue', 'https://www.spiegel.de/politik/10-november-2008-eishockey-a-042588f0-0002-0001-0000-000062013322?context=issue', 'https://www.spiegel.de/politik/spaete-einheit-a-bc714e5a-0002-0001-0000-000062013340?context=issue', 'https://www.spiegel.de/politik/aufklaerer-versetzt-a-cb767ecc-0002-0001-0000-000062013341?context=i

Finally, we show how to download the full texts of articles, as well as their titles and publication dates, from a particular issue. We process each article link, extract the relevant details, and save each article as a TXT file. Additionally, we gather all the articles into a DataFrame.

In [7]:
import os

# Define the set of punctuation characters to exclude
exclude = set(string.punctuation)

# Lists to store extracted data
texts = []
titles = []
titles_clean = []
dates = []

# Identifier for the articles
id = 0

# Ensure the directory exists for saving articles
save_dir = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Spiegel_scrape_2008_txt')
os.makedirs(save_dir, exist_ok=True)

# Process each article link (here we limit to the first 5 for demonstration)
for link in links_journal[:5]:
    id += 1
    id_fixed = str(id)
    
    # Open the article link
    try: 
        article = urlopen(link)
        # Create a BeautifulSoup object
        article = BeautifulSoup(article, 'html.parser')
        # Extract paragraphs with the class 'RichText'
        paragraphs = article.findAll("div", {"class": "RichText"})
        
        # Proceed only if paragraphs are found
        if paragraphs:
            # Extract the title
            title = article.find("span", {"class": "font-brandUI font-extrabold lg:text-7xl md:text-5xl sm:text-4xl leading-tight"}).find("span", {"class": "align-middle"}).get_text()
            title = title.strip().replace("\n", '')
            # Normalize the title
            # Remove specific punctuation, replace hyphens with spaces, convert to lowercase, and strip leading/trailing spaces
            title_clean = ''.join(ch for ch in title.replace('-', ' ').lower() if (ch not in exclude) and (ch not in ['"', '„', '“', '»', '«'])).strip()
            # Standardize spaces to a single space
            title_clean = " ".join(title_clean.split())
          
            # Extract the publication date
            date = article.find("time", {"class": "timeformat"}).get_text().split(',')[0]

            # Extract the text from the paragraphs
            text_new = ''

            for par in paragraphs:
                text_new = text_new + par.get_text()

            text_new = text_new.strip().replace("\n", ' ')

            if title[-1] not in ['.', '!', ':', ';','?', '"']:
                text_new = title + '. ' + text_new
            else:
                text_new = title + ' ' + text_new

        else:
            text_new = ''
            title = ''
            title_clean = ''
            date = ''
        
    # Handle any IOError (e.g., page not found)   
    except IOError:
        text_new = ''
        title = ''
        title_clean = ''
        date = ''
        
    # Add the extracted data to the lists if the title is found
    if title:
        texts.append(text_new)
        titles.append(title)
        titles_clean.append(title_clean)
        dates.append(date)
        
        # Save the article as a .txt file
        issue_string = str(issue_number)
        year_string = str(year)
        file_path = os.path.join(save_dir, f"{year_string} {issue_string} {id_fixed}.txt")
        with codecs.open(file_path, "w", "utf-8-sig") as temp:
            temp.write(text_new)
            
# Create a DataFrame from the extracted data
data = pd.DataFrame({
    'dates': dates,
    'titles': titles,
    'titles_clean': titles_clean,
    'text': texts
})

data

Unnamed: 0,dates,titles,titles_clean,text
0,09.11.2008,Das Projekt Obama,das projekt obama,Das Projekt Obama. Gewaltig sind die Herausfor...
1,09.11.2008,Das Projekt Obama,das projekt obama,Das Projekt Obama. Gewaltig sind die Herausfor...
2,09.11.2008,Im Gelobten Land,im gelobten land,Im Gelobten Land. Obamas Sieg ist der Triumph ...
3,09.11.2008,»Ein Weltkrieg ohne Krieg«,ein weltkrieg ohne krieg,»Ein Weltkrieg ohne Krieg«. Der britische Hist...
4,09.11.2008,10. November 2008 Titel,10 november 2008 titel,10. November 2008 Titel. Es waren bewegende Sz...


## Web scraping: issues 1-52 (53), 2011 - 2015.

We used the following code to download the full texts of articles, along with their titles and publication dates, from issues 1-53 for the years 2011-2015. We only scraped an article if its title was present in the Media Tenor dataset. This approach allowed us to collect candidate articles that potentially matched the Media Tenor metadata.

We downloaded 572 candidates from 2011-2015 and 182 candidates from 2016, totaling 754 articles. Later, we matched these articles with the Media Tenor metadata based on both title and publication date. As a result, we successfully scraped 505 articles from the Media Tenor dataset. We matched the scraped texts with metadata in the notebook `Spiegel.ipynb`. 

Although this code cannot be used now to scrape these articles due to the current restriction that only articles up to and including 2009 are available without a subscription, we publish this code to demonstrate our research process and to aid anyone needing articles published in 2009 or earlier.

In [42]:
import string
import os
# Import codecs in order to save articles as .txt files
import codecs
from urllib.request import urlopen
# Import the library that pulls out HTML data
from bs4 import BeautifulSoup
import pandas as pd

# Define the set of punctuation characters to exclude
exclude = set(string.punctuation)

# Define the directory to save the Excel file
save_dir = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva')

# Define the path to save the Excel file
excel_file_path = os.path.join(save_dir, 'spiegel_2011_2015.xlsx')

# Excel file where we save the data
data_excel = pd.ExcelWriter(excel_file_path)

# Directory to save text files
text_save_dir = os.path.join(save_dir, 'Spiegel_scrape_2011_2015_txt')
os.makedirs(text_save_dir, exist_ok=True)

# Common part in all the links
common = "http://www.spiegel.de/spiegel/print/index-"
magazine_year = 2010
issue = 0
number_of_rows = 0

# Iterate over the years (2011 to 2015)
for i in range(0, 5):
    magazine_year += 1
    magazine_year_fixed = str(magazine_year)
    
    # Iterate over the issues (1 to 53)
    for i in range(0, 53):
        issue += 1
        # Convert issue number to a zero-padded string
        issue_fixed = str(issue).zfill(2)
        # Construct the URL for the issue
        spiegel_issue = common + magazine_year_fixed + "-" + issue_fixed + ".html"
        
        
        try:
            # Open the URL and create a BeautifulSoup object
            spiegel_issue = urlopen(spiegel_issue)
            spiegel_issue_soup = BeautifulSoup(spiegel_issue, 'html.parser')
            
            # Extract the title of the issue
            title = spiegel_issue_soup.title.string
            # Extract issue number and year from the title
            issue_number = int(title.rsplit(' ', 3)[1].rsplit('/', 1)[0])
            year = int(title.rsplit(' ', 3)[1].rsplit('/', 1)[1])
            
            links_journal = []
            # Find all links in the HTML document
            all_links_spiegel = spiegel_issue_soup.findAll("a")

            # Extract the link of each article and the heading of each article
            for link in all_links_spiegel:
                # Check if there is a link part in the tag
                if link.get("href") is not None:
                    # Check if the link contains '/print/d-'
                    if link.get("href").find('/print/d-') != -1:
                        # Append the link to links_journal list
                        links_journal.append(link.get("href"))

            texts = []
            titles = []
            titles_clean = []
            dates = []
            # Identifier of the article in the name
            id = 0
            
            # Iterate over each article link
            for link in links_journal:
                id += 1
                # Convert id to a string for use in the text document name
                id_fixed = str(id)
                
                try:
                    # Open the article URL and create a BeautifulSoup object
                    article = urlopen(link)
                    article = BeautifulSoup(article, 'html.parser')
                    
                    # Extract the text
                    paragraphs = article.findAll("div", {"class": "RichText"})
                    if paragraphs != []:
                        title = article.find("span", {"class": "align-middle"}).get_text()
                        title = title.strip().replace("\n", '')
                        # Normalize the title
                        # Remove specific punctuation, replace hyphens with spaces, convert to lowercase, and strip leading/trailing spaces
                        title_clean = ''.join(ch for ch in title.replace('-', ' ').lower() if (ch not in exclude) and (ch not in ['"', '„', '“', '»', '«'])).strip()
                        # Standardize spaces to a single space
                        title_clean = " ".join(title_clean.split())
                                              
                        # Check if the normalized title of the current article is among the normalized titles of articles 
                        # from the MediaTenor dataset
                        if any(title_clean in s for s in sentiment_data['title_clean']):
                            
                            # Extract the publication date
                            date = article.find("time", {"class": "timeformat"}).get_text()
                            
                            # Extract the text from the paragraphs
                            text_new = ''
                            for par in paragraphs:
                                text_new = text_new + par.get_text()
                            text_new = text_new.strip().replace("\n", ' ')
                            if title[-1] not in ['.', '!', ':', ';','?', '"']:
                                text_new = title + '. ' + text_new
                            else:
                                text_new = title + ' ' + text_new
                                
                    else:
                        text_new = ''
                        title = ''
                        title_clean = ''
                        date = ''

                # Handle errors (e.g., if the page doesn't exist)
                except IOError:
                    text_new = ''
                    title = ''
                    title_clean = ''
                    date = ''

                # Add the extracted data to the lists if the title is valid and exists in sentiment_data
                if any(title_clean in s for s in sentiment_data['title_clean']) and (title != ''):
                    texts.append(text_new)
                    titles.append(title)
                    titles_clean.append(title_clean)
                    dates.append(date)
                    
                    # Save the article as a .txt file
                    issue_string = str(issue_number)
                    year_string = str(year)
                    file_path = os.path.join(text_save_dir, f"{year_string} {issue_string} {id_fixed}.txt")
                    with codecs.open(file_path, "w", "utf-8-sig") as temp:
                        temp.write(text_new) 
                        
            # Create a DataFrame from the extracted data               
            data = pd.DataFrame({
                'dates': dates,
                'text': texts,
                'titles': titles,
                'titles_clean': titles_clean
            })            

            # Save the data from the current issue to the Excel file
            data.to_excel(data_excel, 'Sheet1', header = False, startrow = number_of_rows)
            # Update the row counter
            number_of_rows = data.shape[0] + number_of_rows
            
        except IOError:
            pass
        
    # Reset the issue counter for the next year    
    issue = 0
    
# Save the Excel file  
data_excel.save()
data_excel.close()

## Web scraping: issues 1-52 (53), 2016.

In 2016, an article without a title caused an error during scraping. To handle this, we added a condition to skip articles without titles, ensuring that only articles with both text and titles are processed. This approach prevents errors and ensures that the articles can be matched with the Media Tenor dataset.

In [43]:
# Define the directory to save the Excel file
save_dir = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva')

# Define the path to save the Excel file
excel_file_path = os.path.join(save_dir, 'spiegel_2016.xlsx')

# Excel file where we save the data
data_excel_2016 = pd.ExcelWriter(excel_file_path)

# Directory to save text files
text_save_dir = os.path.join(save_dir, 'Spiegel_scrape_2016_txt')
os.makedirs(text_save_dir, exist_ok=True)

# Common part in all the links
common = "http://www.spiegel.de/spiegel/print/index-"
magazine_year = 2015
issue = 0
number_of_rows = 0

# Iterate over the years (2016)
for i in range(0, 1):
    magazine_year += 1
    magazine_year_fixed = str(magazine_year)
    
    # Iterate over the issues (1 to 53)
    for i in range(0, 53):
        issue += 1
        # Convert issue number to a zero-padded string
        issue_fixed = str(issue).zfill(2)
        # Construct the URL for the issue
        spiegel_issue = common + magazine_year_fixed + "-" + issue_fixed + ".html"
        
        
        try:
            # Open the URL and create a BeautifulSoup object
            spiegel_issue = urlopen(spiegel_issue)
            spiegel_issue_soup = BeautifulSoup(spiegel_issue, 'html.parser')
            
            # Extract the title of the issue
            title = spiegel_issue_soup.title.string
            # Extract issue number and year from the title
            issue_number = int(title.rsplit(' ', 3)[1].rsplit('/', 1)[0])
            year = int(title.rsplit(' ', 3)[1].rsplit('/', 1)[1])
            
            links_journal = []
            # Find all links in the HTML document
            all_links_spiegel = spiegel_issue_soup.findAll("a")

            # Extract the link of each article and the heading of each article
            for link in all_links_spiegel:
                # Check if there is a link part in the tag
                if link.get("href") is not None:
                    # Check if the link contains '/print/d-'
                    if link.get("href").find('/print/d-') != -1:
                        # Append the link to links_journal list
                        links_journal.append(link.get("href"))
                        
            texts = []
            titles = []
            titles_clean = []
            dates = []
            # Identifier of the article in the name
            id = 0
            
            # Iterate over each article link
            for link in links_journal:
                id += 1
                # Convert id to a string for use in the text document name
                id_fixed = str(id)
                
                try:
                    # Open the article URL and create a BeautifulSoup object
                    article = urlopen(link)
                    article = BeautifulSoup(article, 'html.parser')
                    
                    # Extract the text
                    paragraphs = article.findAll("div", {"class": "RichText"})
                    # NEW CONDITION
                    # Check if the paragraphs list is not empty and the article has a title element with the class "align-middle"
                    if (paragraphs != []) and (article.find("span", {"class": "align-middle"}) is not None):

                        title = article.find("span", {"class": "align-middle"}).get_text()
                        title = title.strip().replace("\n", '')
                        # Normalize the title
                        # Remove specific punctuation, replace hyphens with spaces, convert to lowercase, and strip leading/trailing spaces
                        title_clean = ''.join(ch for ch in title.replace('-', ' ').lower() if (ch not in exclude) and (ch not in ['"', '„', '“', '»', '«'])).strip()
                        # Standardize spaces to a single space
                        title_clean = " ".join(title_clean.split())
                        
                        # Check if the normalized title of the current article is among the normalized titles of articles 
                        # from the MediaTenor dataset     
                        if any(title_clean in s for s in sentiment_data['title_clean']):
                            
                             # Extract the publication date
                            date = article.find("time", {"class": "timeformat"}).get_text()
                            
                            # Extract the text from the paragraphs
                            text_new = ''
                            for par in paragraphs:
                                text_new = text_new + par.get_text()
                            text_new = text_new.strip().replace("\n", ' ')
                            if title[-1] not in ['.', '!', ':', ';','?', '"']:
                                text_new = title + '. ' + text_new
                            else:
                                text_new = title + ' ' + text_new

                    else:
                        text_new = ''
                        title = ''
                        title_clean = ''
                        date = ''

                # Handle errors (e.g., if the page doesn't exist)   
                except IOError:
                    text_new = ''
                    title = ''
                    title_clean = ''
                    date = ''

                # Add the extracted data to the lists if the title is valid and exists in sentiment_data
                if any(title_clean in s for s in sentiment_data['title_clean']) and (title != ''):
                    texts.append(text_new)
                    titles.append(title)
                    titles_clean.append(title_clean)
                    dates.append(date)
                    
                    # Save the article as a .txt file
                    issue_string = str(issue_number)
                    year_string = str(year)
                    file_path = os.path.join(text_save_dir, f"{year_string} {issue_string} {id_fixed}.txt")
                    with codecs.open(file_path, "w", "utf-8-sig") as temp:
                        temp.write(text_new) 
                        
            # Create a DataFrame from the extracted data                 
            data = pd.DataFrame({
                'dates': dates,
                'text': texts,
                'titles': titles,
                'titles_clean': titles_clean
            })       

            # Save the data from the current issue to the Excel file
            data.to_excel(data_excel_2016, 'Sheet1', header = False, startrow = number_of_rows)
            # Update the row counter
            number_of_rows = data.shape[0] + number_of_rows
            
        except IOError:
            pass
    
    # Reset the issue counter for the next year
    issue = 0
    
# Save the Excel file
data_excel_2016.save()
data_excel_2016.close() 