Given that 18% of the annotated articles (730 annotations) came from Focus, we decided to speed up the downloading process by web scraping 223 of these articles (31% of all annotations) directly from the Focus website. Articles published between 2011 and 2019 are freely accessible without a subscription.

## Media Tenor dataset

First, we need to load the dataset provided by Media Tenor. This step is important because we will be scraping articles from Focus for the period 2011-2019, but only if their titles at least partially match those in the Media Tenor dataset.

In [1]:
import pandas as pd

# Load the dataset acquired from Media Tenor
sentiment_data = pd.read_csv('Daten_Wirtschaftliche_Lage.csv', encoding='utf-8', sep=';')

# Filter out rows with empty titles, as we cannot identify and download the articles without titles
sentiment_data = sentiment_data[sentiment_data['title'].notnull()]

# Reset the index of the DataFrame
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the DataFrame
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating
0,01.01.2014,201401,WamS,Koalition,Konjunktur,0,1,0,1,0
1,01.01.2017,201701,FAS,Habt bloß keine Angst vor China !,Internationale Wirtschaft,0,0,1,1,100
2,01.01.2017,201701,BamS,Wir leben in einer Zeit der Wohlstands-Halluzi...,Konjunktur,0,0,1,1,100
3,01.02.2015,201502,WamS,Teheran ruft,Wettbewerbsfähigkeit/Nachfrage,1,3,0,4,-25
4,01.01.2017,201701,BamS,"Geht es und wirklich so gut, wie es uns Merkel...",Internationale Wirtschaft,0,1,0,1,0


Due to the manual entry of titles in the Media Tenor dataset, there are potential inconsistencies in punctuation and spacing. To resolve this and achieve precise matching with the titles of the articles we scrape from the website, we normalize the titles in the dataset.

In [2]:
# Import the Normalize class from the normalize module
from normalize import Normalize

# Initialize the Normalize class with the titles from the sentiment_data DataFrame
normalizer = Normalize(sentiment_data.title)

# Apply the normalization process to the titles
normalized_titles = normalizer.normalized()

# Print the first three normalized titles to verify the results
print(normalized_titles[:3])

# Add the normalized titles to the sentiment_data DataFrame as a new column 'title_clean'
sentiment_data['title_clean'] = normalized_titles

['koalition', 'habt bloß keine angst vor china', 'wir leben in einer zeit der wohlstands halluzination']


Our focus is on annotated articles from Focus that pertain to business cycle conditions, as these are the specific articles we intend to scrape from the website, depending on their availability.

In [3]:
# Filter the dataset to include only articles from Focus
sentiment_data = sentiment_data[sentiment_data['medium'] == 'Focus']

# Reset the index of the DataFrame and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Further filter the dataset to include only articles related to the business cycle conditions (Konjunktur)
sentiment_data = sentiment_data[sentiment_data['topicgroup'] == 'Konjunktur']

# Reset the index of the DataFrame again and remove the old index column
sentiment_data = sentiment_data.reset_index(drop=True)

# Display the first few rows of the filtered DataFrame to verify the results
sentiment_data.head()

Unnamed: 0,date,month,medium,title,topicgroup,negative,no_clear_tone,positive,Number_of_reports,AverageRating,title_clean
0,01.04.2017,201704,Focus,Die Kündigung,Konjunktur,1,1,0,2,-50,die kündigung
1,01.07.2017,201707,Focus,Läuft.,Konjunktur,0,0,1,1,100,läuft
2,01.04.2017,201704,Focus,Anlegen in Zeiten von Trump,Konjunktur,0,0,2,2,100,anlegen in zeiten von trump
3,01.07.2013,201307,Focus,Gewinne trotz Wackelbörse,Konjunktur,0,0,3,3,100,gewinne trotz wackelbörse
4,01.10.2011,201110,Focus,Macht Europa nicht kaputt,Konjunktur,0,1,0,1,0,macht europa nicht kaputt


## Web scraping: example

The following code illustrates the process of scraping articles from a specific issue of Focus (19th issue of 2019) that have titles at least partially matching those listed in the Media Tenor dataset. These articles are publicly accessible without a subscription. In our example, we demonstrate how to extract article URLs from the issue's index page, access each link, and retrieve the article's title, publication date, and content.

In [4]:
import codecs
import string
from urllib.request import urlopen
from bs4 import BeautifulSoup

# Define the URL for a specific issue of Focus from 2019
focus_issue_url = "https://www.focus.de/magazin/archiv/jahrgang_2019/ausgabe_19/"

# Open the URL and create a BeautifulSoup object
focus_issue = urlopen(focus_issue_url)
focus_issue_soup = BeautifulSoup(focus_issue, 'html.parser')

# Print the first 500 characters of the HTML content for a brief overview
print("HTML Content Preview:\n", focus_issue_soup.prettify()[:500])

HTML Content Preview:
 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html itemscope="" itemtype="https://schema.org/CreativeWork" lang="de" xml:lang="de" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#" xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
  <title>
   FOCUS Magazin Heft-Archiv - Alle Jahrgänge
  </title>
  <meta content="de" http-equiv="content-language"/>
  <meta content="Im FOCUS-Archiv auf FOCUS Online finden S


Here, we show how to extract the issue number and year from the URL of a Focus issue.

In [5]:
# Extract the issue number from the URL
issue_number = int(focus_issue_url.split('/')[-2].replace("ausgabe_", ''))

# Extract the year from the URL
year = int(focus_issue_url.split('/')[-3].replace("jahrgang_", ''))

# Display the extracted issue number and year
print(f"Issue Number: {issue_number}, Year: {year}")

Issue Number: 19, Year: 2019


Next, we extract the links and titles of all the articles from a particular issue. We find all the `<a>` tags in the HTML document and check if they contain a valid link. We then filter these links based on specific conditions to ensure they point to actual articles, excluding irrelevant links.

In [6]:
# Initialize lists to store article links and headings
links_journal = []
headings = []

# Find all <a> tags in the HTML document
all_links_focus = focus_issue_soup.findAll("a")

# List of links to exclude
exclude_links = [
    'https://www.focus.de/politik/', 'https://www.focus.de/finanzen/', 'https://www.focus.de/kultur/',
    'https://www.focus.de/finanzen/focus-online-kooperationen-services-vergleiche-rechner_id_8615608.html',
    'https://www.focus.de/politik/deutschland/', 'https://www.focus.de/politik/sicherheitsreport/',
    'https://www.focus.de/politik/gerichte-in-deutschland/', 'https://www.focus.de/politik/ausland/',
    'https://www.focus.de/politik/videos/', 'https://www.focus.de/politik/experten/', 
    'https://www.focus.de/politik/praxistipps/', 'https://www.focus.de/politik/', 
    'https://www.focus.de/finanzen/boerse/', 'https://www.focus.de/finanzen/altersvorsorge/',
    'https://www.focus.de/finanzen/news/', 'https://www.focus.de/finanzen/banken/', 
    'https://www.focus.de/finanzen/versicherungen/', 'https://www.focus.de/finanzen/recht/', 
    'https://www.focus.de/finanzen/karriere/', 'https://www.focus.de/finanzen/experten/', 
    'https://www.focus.de/finanzen/steuern/', 'https://www.focus.de/finanzen/videos/', 
    'https://www.focus.de/finanzen/', 
    'https://www.focus.de/finanzen/boersenbriefe/finanzen100-boersenbriefe-die-wichtigsten-infos-fuer-ihren-boersenerfolg_id_7150217.html',
    'https://www.focus.de/finanzen/banken/waehrungsrechner-die-wechselkurse-am-bankschalter_aid_53741.html',
    'https://www.focus.de/finanzen/banken/kredit/tid-8141/kreditrechner_aid_210205.html', 
    'https://www.focus.de/finanzen/versicherungen/krankenversicherung/der-grosse-kassenvergleich-krankenversicherung_id_1725907.html', 
    'https://www.focus.de/finanzen/banken/kreditkarten/kreditkarten-100-angebote-im-vergleich_id_2306740.html',
    'https://www.focus.de/finanzen/steuern/gehaltsplaner/brutto-netto-rechner-was-ihnen-vom-gehalt-uebrig-bleibt_id_2297045.html', 
    'https://www.focus.de/finanzen/altersvorsorge/rente/tid-8425/rentenrechner_aid_68489.html'
]

# Extract the link and heading of each article
for link in all_links_focus:
    # Check if there is a link part in the tag
    if link.get("href") is not None:
        href = link.get("href")
        # Check if the link contains any valid section and include only those links that point to actual articles
        if (any(section in href for section in ['/magazin/archiv', '/politik/', '/kultur/', '/finanzen/', '/wissen/', 
                                    '/sport/', '/gesundheit/', '/auto/', '/reisen/']) and 
            'login/' not in href and 
            'rss.focus.de' not in href and 
            href != 'https://www.focus.de/magazin/archiv/' and 
            href not in exclude_links and 
            '.html' in href):
            # Append the link to links_journal list
            links_journal.append(href)
            # Append the cleaned heading to headings list
            remove = link.text.split(':')[0]
            headings.append(link.text.replace(remove + ':', '').strip('\xa0').replace('\ufeff', '').replace(":\xa0", "").lstrip())

# Display the first 10 extracted links and headings for verification
print("Extracted Links (first 10):", links_journal[:10])
print("Extracted Headings (first 10):", headings[:10])

Extracted Links (first 10): ['https://www.focus.de/finanzen/boerse/automatische-geldanlage-mit-einem-roboadvisor-so-finden-sie-den-besten-digitalen-anlagehelfer-fuer-ihren-vermoegensaufbau_id_9142131.html', 'https://www.focus.de/finanzen/versicherungen/schutz-fuer-teure-schaetzchen-hausratversicherung_id_1743846.html', 'https://www.focus.de/finanzen/versicherungen/bein-ab-arm-dran-unfallversicherung_id_2262359.html', 'https://www.focus.de/magazin/archiv/politik-die-aufs-und-abs-der-woche_id_10664136.html', 'https://www.focus.de/magazin/archiv/politik-afd-will-fraktionsspitze-umbauen-jeder-zweite-soll-gehen_id_10664138.html', 'https://www.focus.de/magazin/archiv/der-auferstandene-zurueck-auf-der-bildflaeche_id_10664151.html', 'https://www.focus.de/magazin/archiv/der-newcomer-zurueck-auf-der-buehne_id_10664153.html', 'https://www.focus.de/magazin/archiv/power-paare-wer-mit-wem-wer-gegen-wen-ruestungsstopp-zerschiesst-bilanz_id_10664159.html', 'https://www.focus.de/magazin/archiv/der-abst

We also demonstrate how to extract the publication date, annotation, and main text from a specific Focus article.

In [7]:
# Define the URL for a specific article
focus_article_url = links_journal[4]

# Open the URL and create a BeautifulSoup object
focus_article = urlopen(focus_article_url)
focus_article_soup = BeautifulSoup(focus_article, 'html.parser')

# Extract the publication date of the article
date = focus_article_soup.find("div", {"class": "displayDate"}).get_text().split(' ')[1].replace(',', '')

# Extract the annotation of the article
annotation = focus_article_soup.find("div", {"class": "leadIn"}).get_text()

# Extract the paragraphs of the main text of the article
paragraphs = list(focus_article_soup.find("div", {"class": "textBlock"}).children)

# Initialize the text with the annotation
text_new = ''
if annotation.strip()[-1] not in ['.', '!', ':', ';', '?', '"', "'", '...', '…']:
    text_new = text_new + annotation.strip() + '. '
else:
    text_new = text_new + annotation.strip()

# Process each paragraph of the main text
for par in paragraphs:
    if par.get_text().replace('\n', ''):
        if (par.get_text().strip()[-1] not in ['.', '!', ':', ';', '?', '"', "'", '...', '…']):
            text_new = text_new + ' ' + par.get_text(separator=' ').strip() + '.'
        else:
            text_new = text_new + ' ' + par.get_text(separator=' ').strip()

# Clean up the final text
text_new = text_new.strip()
text_new = text_new.replace("\n", '')

# Print the extracted publication date
print("Publication Date:", date)

# Print the extracted annotation
print("\nAnnotation:", "\n" + annotation.strip())

# Print the cleaned and formatted main text
print("\nMain Text:", "\n" + text_new)

Publication Date: 03.05.2019

Annotation: 
Die Wahl des neuen Vorstands muss um drei Monate vorgezogen werden, fordern unzufriedene Abgeordnete

Main Text: 
Die Wahl des neuen Vorstands muss um drei Monate vorgezogen werden, fordern unzufriedene Abgeordnete.  Weil es massive Kritik an ihrer Amtsführung gibt, stehen zahlreiche Mitglieder  der AfD -Fraktionsführung im Bundestag vor dem Aus. Mehrere Abgeordnete haben beantragt, die turnusgemäß für September geplante Neuwahl der Fraktionsführung auf den 4. Juni vorzuverlegen. Dabei könnten von den elf Vorstandsposten bis zu fünf neu besetzt werden, heißt es von AfD-Abgeordneten. „In der AfD gibt es keine Erbpacht auf Ämter und Funktionen“, sagt Fraktionschefin Alice Weidel. Deshalb gelte niemand vor einer Wahl als gesetzt. Allerdings können Weidel und Co-Chef Alexander Gauland bislang mit einer Wiederwahl rechnen. Dagegen droht zwei der fünf stellvertretenden Fraktionschefs die Abwahl: Peter Felser und Beatrix von Storch. Von Storch hatte 

Finally, we demonstrate how to download the full texts, titles, and publication dates of articles from a specific issue of Focus that have titles at least partially matching those in the Media Tenor dataset. We process each article link, extract the necessary details, and save each article as a TXT file. Furthermore, we compile all the articles into a DataFrame.

In [8]:
import os

# Define the set of punctuation characters to exclude
exclude = set(string.punctuation)

# Lists to store extracted data
texts = []
titles = []
titles_clean = []
dates = []

# Identifier for the articles
id = 0

# Ensure the directory exists for saving articles
save_dir = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva', 'Focus_scrape_example_txt')
os.makedirs(save_dir, exist_ok=True)

# Iterate through the list of headings
for heading in headings:
    id += 1
    id_fixed = str(id)
    
    # Retrieve the title of the current article from the headings list
    title = headings[(id-1)]
    title = title.strip().replace("\n", '')
    # Normalize the title
    # Remove specific punctuation, replace hyphens with spaces, convert to lowercase, and strip leading/trailing spaces
    title_clean = ''.join(ch for ch in title.replace('-', ' ').lower() if (ch not in exclude) and (ch not in ['"', '„', '“', '»', '«'])).strip()
    # Standardize spaces to a single space
    title_clean = " ".join(title_clean.split())
    
    # Check if the normalized title of the current article at least partially matches 
    # any normalized title from the MediaTenor dataset
    if any(title_clean in s for s in sentiment_data['title_clean']):
        # Open the article link
        try: 
            article = urlopen(links_journal[(id-1)])
            # Create a BeautifulSoup object
            article = BeautifulSoup(article, 'html.parser')
            # Extract the paragraphs of the main text of the article
            paragraphs = list(article.find("div", {"class": "textBlock"}).children)
            
            # Proceed only if paragraphs are found
            if paragraphs:
                
                # Extract the publication date of the article
                date = article.find("div", {"class": "displayDate"}).get_text().split(' ')[1].replace(',', '')
                               
                # Initialize the text with the annotation
                text_new = ''                
                if article.find("div", {"class": "leadIn"})  is not None: 
                    # Extract the annotation of the article
                    annotation = article.find("div", {"class": "leadIn"}).get_text()
                    
                    if annotation.strip()[-1] not in ['.', '!', ':', ';', '?', '"', "'", '...', '…']:
                        text_new = text_new + annotation.strip() + '. '
                    else:
                        text_new = text_new + annotation.strip()

                # Extract the text from the paragraphs
                for par in paragraphs:
                    if par.get_text().replace('\n', ''):
                        if (par.get_text().strip()[-1] not in ['.', '!', ':', ';', '?', '"', "'", '...', '…']):
                            text_new = text_new + ' ' + par.get_text(separator=' ').strip() + '.'
                        else:
                            text_new = text_new + ' ' + par.get_text(separator=' ').strip()
                
                # Clean up the final text
                text_new = text_new.strip().replace("\n", '')
                
                if title[-1] not in ['.', '!', ':', ';','?', '"']:
                    text_new = title + '. ' + text_new
                else:
                    text_new = title + ' ' + text_new

            else:
                text_new = ''
                title = ''
                title_clean = ''
                date = ''

        # Handle any IOError (e.g., page not found)     
        except IOError:
            text_new = ''
            title = ''
            title_clean = ''
            date = ''

        # Add the extracted data to the lists if the title is valid and at least partially matches 
        # any title in sentiment_data
        if any(title_clean in s for s in sentiment_data['title_clean']) and title:
            texts.append(text_new)
            titles.append(title)
            titles_clean.append(title_clean)
            dates.append(date)
            
            # Save the article as a .txt file
            issue_string = str(issue_number)
            year_string = str(year)
            file_path = os.path.join(save_dir, f"{year_string} {issue_string} {id_fixed}.txt")
            with codecs.open(file_path, "w", "utf-8-sig") as temp:
                temp.write(text_new) 

# Create a DataFrame from the extracted data
data = pd.DataFrame({
    'dates': dates,
    'titles': titles,
    'titles_clean': titles_clean,
    'text': texts
})

data

Unnamed: 0,dates,titles,titles_clean,text
0,03.05.2019,Wer nächste Woche wichtig wird,wer nächste woche wichtig wird,Wer nächste Woche wichtig wird. So. Papst Fran...
1,03.05.2019,Zeugnistag,zeugnistag,Zeugnistag. Schlapper Start Stuttgart Das Jahr...
2,03.05.2019,Neues Milliardenloch,neues milliardenloch,Neues Milliardenloch. Der Bundesregierung droh...
3,03.05.2019,Musik,musik,Musik. (Rang Vorwoche/Anzahl der Wochen). 1 V...
4,03.05.2019,"Zahlen, bitte",zahlen bitte,"Zahlen, bitte. 31 Pensionskassen von insgesamt..."


## Web scraping: issues 1-52 (53), 2011 - 2019.

We employed the following code to download the full texts of articles, including their titles and publication dates, from issues 1-53 for the years 2011-2019. We only scraped an article if its title at least partially matched any title in the Media Tenor dataset. For instance, if the title of the scraped article was "Musik" and the Media Tenor dataset contained a title "Fahrrad, Musik & Cowboyhut," the article titled "Musik" would be scraped as a candidate. This method allowed us to collect candidate articles that potentially matched the Media Tenor metadata.

We downloaded 1,746 candidate articles from 2011-2019. Subsequently, we matched these articles with the Media Tenor metadata based on both title and publication date. As a result, we successfully scraped 223 articles from the Media Tenor dataset. The scraped texts were matched with their metadata in the notebook `Focus.ipynb`.

In [9]:
import string
import os
# Import codecs in order to save articles as .txt files
import codecs
from urllib.request import urlopen
# Import the library that pulls out HTML data
from bs4 import BeautifulSoup
import pandas as pd

# Define the set of punctuation characters to exclude
exclude = set(string.punctuation)

# Define the directory to save the Excel file
save_dir = os.path.join(os.getcwd(), 'MediaTenor_LexisNexis_Factiva')

# Define the path to save the Excel file
excel_file_path = os.path.join(save_dir, 'focus_2011_2019.xlsx')

# Excel file where we save the data
data_excel = pd.ExcelWriter(excel_file_path)

# Directory to save text files
text_save_dir = os.path.join(save_dir, 'Focus_scrape_2011_2019_txt')
os.makedirs(text_save_dir, exist_ok=True)

# Common part in all the links
common = "https://www.focus.de/magazin/archiv/jahrgang_"
magazine_year = 2010
issue = 0
number_of_rows = 0

# Iterate over the years (2011 to 2019)
for i in range(0, 9):
    magazine_year += 1
    magazine_year_fixed = str(magazine_year)

    # Iterate over the issues (1 to 53)
    for i in range(0, 53):
        issue += 1
        # Convert issue number to a zero-padded string
        issue_fixed = str(issue).zfill(2)
        # Construct the URL for the issue
        focus_issue_url = common + magazine_year_fixed + "/ausgabe_" + issue_fixed + "/"
        
        
        try:
            # Open the URL and create a BeautifulSoup object
            focus_issue = urlopen(focus_issue_url)
            focus_issue_soup = BeautifulSoup(focus_issue, 'html.parser')
                        
            # Extract the issue number from the URL
            issue_number = int(focus_issue_url.split('/')[-2].replace("ausgabe_", ''))
            # Extract the year from the URL
            year = int(focus_issue_url.split('/')[-3].replace("jahrgang_", ''))
            
            # Initialize lists to store article links and headings
            links_journal = []
            headings = []
            # Find all <a> tags in the HTML document
            all_links_focus = focus_issue_soup.findAll("a")

            # Extract the link and heading of each article
            for link in all_links_focus:
                # Check if there is a link part in the tag
                if link.get("href") is not None:
                    href = link.get("href")
                    # Check if the link contains any valid section and include only those links that point to actual articles
                    if (any(section in href for section in ['/magazin/archiv', '/politik/', '/kultur/', '/finanzen/', '/wissen/', 
                                    '/sport/', '/gesundheit/', '/auto/', '/reisen/']) and 
                        'login/' not in href and 
                        'rss.focus.de' not in href and 
                        href != 'https://www.focus.de/magazin/archiv/' and 
                        href not in exclude_links and 
                        '.html' in href):
                        # Append the link to links_journal list
                        links_journal.append(href)
                        # Append the cleaned heading to headings list
                        remove = link.text.split(':')[0]
                        headings.append(link.text.replace(remove + ':', '').strip('\xa0').replace('\ufeff', '').replace(":\xa0", "").lstrip())
                    
            texts = []
            titles = []
            titles_clean = []
            dates = []            
            # Identifier of the article in the name
            id = 0                  
            
            # Iterate over each article link
            for heading in headings:
                id += 1
                # Convert id to a string for use in the text document name
                id_fixed = str(id)
                
                # Retrieve the title of the current article from the headings list
                title = headings[(id-1)]
                title = title.strip().replace("\n", '')
                # Normalize the title
                # Remove specific punctuation, replace hyphens with spaces, convert to lowercase, and strip leading/trailing spaces
                title_clean = ''.join(ch for ch in title.replace('-', ' ').lower() if (ch not in exclude) and (ch not in ['"', '„', '“', '»', '«'])).strip()
                # Standardize spaces to a single space
                title_clean = " ".join(title_clean.split())
            
                # Check if the normalized title of the current article at least partially matches 
                # any normalized title from the MediaTenor dataset
                if any(title_clean in s for s in sentiment_data['title_clean']):
                    # Open the article link
                    try:
                        article = urlopen(links_journal[(id-1)])
                        # Create a BeautifulSoup object
                        article = BeautifulSoup(article, 'html.parser')
                        # Extract the paragraphs of the main text of the article
                        paragraphs = list(article.find("div", {"class": "textBlock"}).children)
                        
                        # Proceed only if paragraphs are found
                        if paragraphs:
                            
                            # Extract the publication date of the article
                            date = article.find("div", {"class": "displayDate"}).get_text().split(' ')[1].replace(',', '')
                            
                            # Initialize the text with the annotation
                            text_new = ''
                            if article.find("div", {"class": "leadIn"})  is not None:
                                # Extract the annotation of the article
                                annotation = article.find("div", {"class": "leadIn"}).get_text()
                                
                                if annotation.strip()[-1] not in ['.', '!', ':', ';', '?', '"', "'", '...', '…']:
                                    text_new = text_new + annotation.strip() + '. '
                                else:
                                    text_new = text_new + annotation.strip()
                                    
                            # Extract the text from the paragraphs
                            for par in paragraphs:
                                if par.get_text().replace('\n', ''):
                                    if (par.get_text().strip()[-1] not in ['.', '!', ':', ';', '?', '"', "'", '...', '…']):
                                        text_new = text_new + ' ' + par.get_text(separator=' ').strip() + '.'
                                    else:
                                        text_new = text_new + ' ' + par.get_text(separator=' ').strip()

                            # Clean up the final text
                            text_new = text_new.strip().replace("\n", '')

                            if title[-1] not in ['.', '!', ':', ';','?', '"']:
                                text_new = title + '. ' + text_new
                            else:
                                text_new = title + ' ' + text_new
                                    
                        else:
                            text_new = ''
                            title = ''
                            title_clean = ''
                            date = ''

                    # Handle any IOError (e.g., page not found)         
                    except IOError:
                        text_new = ''
                        title = ''
                        title_clean = ''
                        date = ''
                    
                    # Add the extracted data to the lists if the title is valid and at least partially matches 
                    # any title in sentiment_data
                    if any(title_clean in s for s in sentiment_data['title_clean']) and title:
                        texts.append(text_new)
                        titles.append(title)
                        titles_clean.append(title_clean)
                        dates.append(date)

                        # Save the article as a .txt file
                        issue_string = str(issue_number)
                        year_string = str(year)
                        file_path = os.path.join(text_save_dir, f"{year_string} {issue_string} {id_fixed}.txt")
                        with codecs.open(file_path, "w", "utf-8-sig") as temp:
                            temp.write(text_new) 
                
            # Create a DataFrame from the extracted data               
            data = pd.DataFrame({
                'dates': dates,
                'text': texts,
                'titles': titles,
                'titles_clean': titles_clean
            })                

            # Save the data from the current issue to the Excel file
            data.to_excel(data_excel, 'Sheet1', header = False, startrow = number_of_rows)
            # Update the row counter
            number_of_rows = data.shape[0] + number_of_rows

        except IOError:
            pass
        
    # Reset the issue counter for the next year    
    issue = 0
    
# Save the Excel file  
data_excel.save()
data_excel.close()                        

  data_excel.save()
  warn("Calling close() on already closed file.")
