## Thesis Data Collecting & Cleaning

In [72]:
# Install the pandas library for data manipulation and analysis.
! pip install pandas

# Install the openpyxl library to read and write Excel 2010 xlsx/xlsm/xltx/xltm files.
! pip install openpyxl

# Install the requests library for making HTTP requests and beautifulsoup4 for parsing HTML and XML documents.
! pip install requests beautifulsoup4

# Install the selenium library for automating web browser interaction from Python.
! pip install selenium

# Install the webdriver_manager library, which automatically manages drivers for browsers for Selenium.
! pip install webdriver_manager

# Upgrade the selenium library to the latest version, ensuring access to the latest features and bug fixes.
! pip install -U selenium




In [151]:
# Import os module to interact with the underlying operating system; it can handle file and directory operations.
import os

# Import pandas and give it the alias 'pd'. This library is crucial for data manipulation and analysis, providing DataFrame structures.
import pandas as pd

# Import datetime from the datetime module, which provides classes for manipulating dates and times.
from datetime import datetime

# Import openpyxl to enable Python to read and write Excel 2010 xlsx/xlsm files, useful for handling Excel files.
import openpyxl

# Import BeautifulSoup from bs4, a Python library for pulling data out of HTML and XML files.
from bs4 import BeautifulSoup

# Import webdriver from selenium, a tool for automated testing of web applications. It allows Python to control the browser via program.
from selenium import webdriver

# Import By from selenium.webdriver.common, which is used to define methods of locating elements within a page (like by ID, Xpath).
from selenium.webdriver.common.by import By

# Import Service from selenium.webdriver.firefox.service which allows the use of Firefox in Selenium through a background service.
from selenium.webdriver.firefox.service import Service

# Import GeckoDriverManager from webdriver_manager.firefox to handle the automatic management of GeckoDriver, which is required by Firefox.
from webdriver_manager.firefox import GeckoDriverManager

# Import WebDriverWait to allow Selenium to wait for certain conditions (like elements becoming available) before proceeding.
from selenium.webdriver.support.ui import WebDriverWait

# Import expected_conditions as EC, which provides a set of predefined conditions to use with WebDriverWait.
from selenium.webdriver.support import expected_conditions as EC

# Import Service as FirefoxService from selenium.webdriver.firefox.service to specify that this service is specifically for Firefox.
from selenium.webdriver.firefox.service import Service as FirefoxService

# Import NoSuchElementException and TimeoutException to catch these common exceptions when elements are not found or actions time out.
from selenium.common.exceptions import NoSuchElementException, TimeoutException

# Import nltk, a suite of libraries and programs for symbolic and statistical natural language processing.
import nltk

# Import word_tokenize from nltk.tokenize to split strings into words (tokens).
from nltk.tokenize import word_tokenize

# Import stopwords from nltk.corpus to access a list of stopwords, commonly used words that may be excluded from processing.
from nltk.corpus import stopwords

# Import WordNetLemmatizer from nltk.stem to reduce words to their base or root form (lemmatizing).
from nltk.stem import WordNetLemmatizer

import matplotlib.pyplot as plt

The provided code snippet is used to load datasets from Excel files into separate pandas DataFrames for data analysis purposes. Each DataFrame corresponds to data collected from different newspapers or possibly different sections or issues of the newspapers. Here is a detailed explanation of each line:

1. **Loading DataFrames**:
   - Each line of the code uses the `pd.read_excel()` function from the pandas library to load an Excel file into a DataFrame. Each DataFrame is named according to the source of the data, indicating the name of the newspaper and sometimes a numerical identifier when multiple files from the same newspaper are used. 
   
   Here are the DataFrames and their corresponding files:
   - `Trouw_df` loads from `"Trouw.xlsx"` — Data from the Trouw newspaper.
   - `Telegraaf_df` loads from `"De Telegraaf.xlsx"` — Data from De Telegraaf newspaper.
   - `AD_df` loads from `"Algemeen Dagblad.xlsx"` — Data from Algemeen Dagblad.
   - `Volkskrant1_df` and `Volkskrant2_df` load from `"De Volkskrant 1.xlsx"` and `"De Volkskrant 2.xlsx"` respectively — Data from two different datasets or sections of De Volkskrant.
   - `NRC1_df` through `NRC4_df` load from `"NRC 1.xlsx"` to `"NRC 4.xlsx"` — These might represent different sections, editions, or time periods of the NRC Handelsblad newspaper.
   - `Financieele1_df` through `Financieele5_df` load from `"Financieele 1.xlsx"` to `"Financieele 5.xlsx"` — These DataFrames are likely from different issues or sections of the Financieele Dagblad.

2. **Purpose**:
   - The purpose of loading these files into separate DataFrames is typically to perform individual or comparative analyses across different publications or to track changes over time within the same publication. This setup is crucial for tasks such as sentiment analysis, trend detection, or gathering statistical data about the coverage of specific topics.

In [149]:
# Ensure the NLTK packages are downloaded
nltk.download('punkt')  # Tokenizer model
nltk.download('stopwords')  # Stopwords list
nltk.download('wordnet')  # Lexical database for lemmatization
nltk.download('omw-1.4')  # Open Multilingual Wordnet, needed for lemmatization in multiple languages

# Define a preprocess_text function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Tokenize text
    words = word_tokenize(text)
    # Remove numbers
    words = [re.sub(r'\d+', '', word) for word in words]
    # Remove punctuation and special characters
    words = [word for word in words if word.isalnum()]
    # Remove stopwords
    stop_words = set(stopwords.words('Dutch'))
    stop_words.update(['trouw', 'volkskrant', 'financieele', 'algemeen', 'dagblad', 'nrc', 'telegraaf'])
    words = [word for word in words if word not in stop_words]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Join words back to text
    text = ' '.join(words)
    return text

# Assuming 'updated_merged_data' is a DataFrame loaded with your data
# Create the Combined column and preprocess it
updated_merged_data["Combined"] = updated_merged_data["Headline"].fillna("") + " " + updated_merged_data["Body"].fillna("")
updated_merged_data["Combined"] = updated_merged_data["Combined"].apply(preprocess_text)

# Ensure there are no NaN values in the Combined column
updated_merged_data.dropna(subset=["Combined"], inplace=True)
updated_merged_data.reset_index(drop=True, inplace=True)

# Display the first few rows of the processed data
print(updated_merged_data.head())


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/helgegeurtjacobusmoes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/helgegeurtjacobusmoes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/helgegeurtjacobusmoes/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/helgegeurtjacobusmoes/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


                                            Headline   
0  Nee, kunstmatige intelligentie gaat ons niet u...  \
1  Wereldleiders zoeken grip op kunstmatige intel...   
2       Kunstmatige intelligentie is best bedreigend   
3  Mensen zijn een stuk efficiënter dan kunstmati...   
4  Bedreigt kunstmatige intelligentie ons godsbeeld?   

                                         Publication   
0           Trouw, Verdieping; Blz. 4, 5, 2044 words  \
1                  Trouw, Vandaag; Blz. 6, 528 words   
2                Trouw, Tijdgeest; Blz. 8, 576 words   
3                  Trouw, Vandaag; Blz. 3, 741 words   
4  Trouw, Religie en Filosofie; Blz. 8, 9, 1367 w...   

                                                 URL News Outlet   
0  https://advance.lexis.com/api/document?collect...       Trouw  \
1  https://advance.lexis.com/api/document?collect...       Trouw   
2  https://advance.lexis.com/api/document?collect...       Trouw   
3  https://advance.lexis.com/api/document?collect...  

In [150]:
# Load the files into DataFrames
Trouw_df = pd.read_excel("Trouw.xlsx")
Telegraaf_df = pd.read_excel("De Telegraaf.xlsx")
AD_df = pd.read_excel("Algemeen Dagblad.xlsx")
Volkskrant1_df = pd.read_excel("De Volkskrant 1.xlsx")
Volkskrant2_df = pd.read_excel("De Volkskrant 2.xlsx")
NRC1_df = pd.read_excel("NRC 1.xlsx")
NRC2_df = pd.read_excel("NRC 2.xlsx")
NRC3_df = pd.read_excel("NRC 3.xlsx")
NRC4_df = pd.read_excel("NRC 4.xlsx")
Financieele1_df = pd.read_excel("Financieele 1.xlsx")
Financieele2_df = pd.read_excel("Financieele 2.xlsx")
Financieele3_df = pd.read_excel("Financieele 3.xlsx")
Financieele4_df = pd.read_excel("Financieele 4.xlsx")
Financieele5_df = pd.read_excel("Financieele 5.xlsx")

The code snippet provided is used to count and display the number of articles in several DataFrames, each presumably containing articles from different sections or periods of various Dutch newspapers. Here's a breakdown of what the code does:

1. **Counting Articles**: The code calculates the number of articles (rows) in each DataFrame using `len()`. This function is applied to various DataFrames named according to the newspaper and possibly their sections or different datasets:
   - `Trouw_df`
   - `Telegraaf_df`
   - `AD_df` (Algemeen Dagblad)
   - `Volkskrant1_df`, `Volkskrant2_df`
   - `NRC1_df`, `NRC2_df`, `NRC3_df`, `NRC4_df`
   - `Financieele1_df` through `Financieele5_df` (Financieele Dagblad)

2. **Printing Counts**: It prints out the count of articles in each DataFrame, with a clear label indicating from which newspaper or dataset the articles are coming.

3. **Calculate Total Articles**: After obtaining individual counts, the code sums these counts to get a total number of articles across all the DataFrames.

4. **Print Total Articles**: Finally, it prints out the total count of articles collected from all DataFrames.

In [4]:
# Count the number of articles in each DataFrame
trouw_count = len(Trouw_df)
telegraaf_count = len(Telegraaf_df)
ad_count = len(AD_df)
volkskrant1_count = len(Volkskrant1_df)
volkskrant2_count = len(Volkskrant2_df)
nrc1_count = len(NRC1_df)
nrc2_count = len(NRC2_df)
nrc3_count = len(NRC3_df)
nrc4_count = len(NRC4_df)
financieele1_count = len(Financieele1_df)
financieele2_count = len(Financieele2_df)
financieele3_count = len(Financieele3_df)
financieele4_count = len(Financieele4_df)
financieele5_count = len(Financieele5_df)

# Print the number of rows in each DataFrame
print("Length of Trouw DataFrame:", trouw_count)
print("Length of De Telegraaf DataFrame:", telegraaf_count)
print("Length of Algemeen Dagblad DataFrame:", ad_count)
print("Length of De Volkskrant 1 DataFrame:", volkskrant1_count)
print("Length of De Volkskrant 2 DataFrame:", volkskrant2_count)
print("Length of NRC 1 DataFrame:", nrc1_count)
print("Length of NRC 2 DataFrame:", nrc2_count)
print("Length of NRC 3 DataFrame:", nrc3_count)
print("Length of NRC 4 DataFrame:", nrc4_count)
print("Length of Financieele1_df 1 DataFrame:", financieele1_count)
print("Length of Financieele2_df 2 DataFrame:", financieele2_count)
print("Length of Financieele3_df 3 DataFrame:", financieele3_count)
print("Length of Financieele4_df 4 DataFrame:", financieele4_count)
print("Length of Financieele5_df 5 DataFrame:", financieele5_count)

# Calculate the total number of articles
total_articles = (trouw_count + telegraaf_count + ad_count +
                  volkskrant1_count + volkskrant2_count +
                  nrc1_count + nrc2_count + nrc3_count + nrc4_count +
                  financieele1_count + financieele2_count +
                  financieele3_count + financieele4_count + financieele5_count)

# Print the total number of articles
print("Total number of articles:", total_articles)

Length of Trouw DataFrame: 689
Length of De Telegraaf DataFrame: 809
Length of Algemeen Dagblad DataFrame: 482
Length of De Volkskrant 1 DataFrame: 1000
Length of De Volkskrant 2 DataFrame: 69
Length of NRC 1 DataFrame: 1000
Length of NRC 2 DataFrame: 150
Length of NRC 3 DataFrame: 100
Length of NRC 4 DataFrame: 83
Length of Financieele1_df 1 DataFrame: 1000
Length of Financieele2_df 2 DataFrame: 300
Length of Financieele3_df 3 DataFrame: 300
Length of Financieele4_df 4 DataFrame: 300
Length of Financieele5_df 5 DataFrame: 169
Total number of articles: 6451


In [186]:
# Define the number of articles for each news outlet
articles = {
    "Algemeen Dagblad": 481,
    "Trouw": 688,
    "De Telegraaf": 808,
    "De Volkskrant": 1067,
    "NRC": 1332,
    "Het Financieele Dagblad": 2069,
}

# Total number of articles
total_articles = 6445

# Calculate and print the percentage for each news outlet
percentages = {outlet: (count / total_articles) * 100 for outlet, count in articles.items()}

for outlet, percentage in percentages.items():
    print(f"{outlet}: {percentage:.2f}%")


Algemeen Dagblad: 7.46%
Trouw: 10.67%
De Telegraaf: 12.54%
De Volkskrant: 16.56%
NRC: 20.67%
Het Financieele Dagblad: 32.10%


In [76]:
# Assuming you have the following DataFrames loaded
dataframes = [Trouw_df, Telegraaf_df, AD_df, Volkskrant1_df, Volkskrant2_df,
              NRC1_df, NRC2_df, NRC3_df, NRC4_df, Financieele1_df, Financieele2_df,
              Financieele3_df, Financieele4_df]

# Concatenate all DataFrames into a single DataFrame
all_articles = pd.concat(dataframes, ignore_index=True)

# Count total duplicates (excluding the first occurrence)
duplicate_counts = all_articles.duplicated(keep='first').sum()

# Optionally, you can also view the duplicate rows
duplicates = all_articles[all_articles.duplicated(keep=False)]

# Print the number of duplicates
print("Number of duplicate articles:", duplicate_counts)

Number of duplicate articles: 6


In [6]:
def load_data_with_urls(file_path):
    # Load the workbook
    wb = openpyxl.load_workbook(file_path, data_only=True)
    sheet = wb.active

    # List to store data along with URLs
    data = []
    for row in sheet.iter_rows(min_row=2, max_row=sheet.max_row, min_col=1, max_col=2):  # Assuming headlines and publication info are in the first two columns
        headline_cell, publication_cell = row
        headline = headline_cell.value or ""  # Use empty string if headline is None
        publication = publication_cell.value or ""  # Use empty string if publication is None
        url = headline_cell.hyperlink.target if headline_cell.hyperlink else None
        data.append({
            'Headline': headline,
            'Publication': publication,
            'URL': url
        })

    # Convert list to DataFrame
    return pd.DataFrame(data)

# Load data from each Excel file using list comprehension
files = ["Trouw", "De Telegraaf", "Algemeen Dagblad", "De Volkskrant 1", "De Volkskrant 2", "NRC 1", "NRC 2", "NRC 3", "NRC 4", "Financieele 1", "Financieele 2", "Financieele 3", "Financieele 4", "Financieele 5"]
data_frames = [load_data_with_urls(f"{file}.xlsx") for file in files]

# Concatenate the dataframes
merged_df = pd.concat(data_frames, ignore_index=True)

# Check if 'Summary' column exists and drop it
if 'Summary' in merged_df.columns:
    merged_df.drop(columns='Summary', inplace=True)

# Check for and handle NaN values in 'Publication' if necessary before regex application
merged_df['Publication'].fillna("Unknown Publication", inplace=True)

# Apply regex pattern to extract 'News Outlet', 'Type of News', and 'Word Count'
regex_pattern = r'^(.*?), (.*?)(?:; Blz\. (?:NaN|\d+))?, (\d+) words$'
merged_df[['News Outlet', 'Type of News', 'Word Count']] = merged_df['Publication'].str.extract(regex_pattern, expand=True)

# Clean up 'Type of News'
merged_df['Type of News'] = merged_df['Type of News'].str.replace(r'Blz\. \d+', '', regex=True).replace(r'[,0-9]+', '', regex=True).str.strip().str.rstrip(';')

# Remove duplicates and drop rows with NaN values in critical columns
merged_df.drop_duplicates(inplace=True)
merged_df.dropna(subset=['News Outlet', 'Type of News', 'Word Count'], inplace=True)

# Display DataFrame
print(merged_df.head())

                                            Headline   
0  Nee, kunstmatige intelligentie gaat ons niet u...  \
1  Wereldleiders zoeken grip op kunstmatige intel...   
2       Kunstmatige intelligentie is best bedreigend   
3  Mensen zijn een stuk efficiënter dan kunstmati...   
4  Bedreigt kunstmatige intelligentie ons godsbeeld?   

                                         Publication   
0           Trouw, Verdieping; Blz. 4, 5, 2044 words  \
1                  Trouw, Vandaag; Blz. 6, 528 words   
2                Trouw, Tijdgeest; Blz. 8, 576 words   
3                  Trouw, Vandaag; Blz. 3, 741 words   
4  Trouw, Religie en Filosofie; Blz. 8, 9, 1367 w...   

                                                 URL News Outlet   
0  https://advance.lexis.com/api/document?collect...       Trouw  \
1  https://advance.lexis.com/api/document?collect...       Trouw   
2  https://advance.lexis.com/api/document?collect...       Trouw   
3  https://advance.lexis.com/api/document?collect...  

In [7]:
# Check for NaN values to see if the issue persists
nans_after_extraction = merged_df.isnull().sum()
print("NaN counts after extraction:", nans_after_extraction)

# Check for NaNs in any of the extracted columns
nan_rows = merged_df[merged_df[['News Outlet', 'Type of News', 'Word Count']].isna().any(axis=1)]
print("Rows with NaNs after regex extraction:")
print(nan_rows['Publication'])


NaN counts after extraction: Headline        0
Publication     0
URL             0
News Outlet     0
Type of News    0
Word Count      0
dtype: int64
Rows with NaNs after regex extraction:
Series([], Name: Publication, dtype: object)


In [8]:
# Assuming each DataFrame represents articles from different news outlets
dataframes = {
    'Trouw': Trouw_df,
    'De Telegraaf': Telegraaf_df,
    'Algemeen Dagblad': AD_df,
    'De Volkskrant 1': Volkskrant1_df,
    'De Volkskrant 2': Volkskrant2_df,
    'NRC 1': NRC1_df,
    'NRC 2': NRC2_df,
    'NRC 3': NRC3_df,
    'NRC 4': NRC4_df,
    'Financieele Dagblad 1': Financieele1_df,
    'Financieele Dagblad 2': Financieele2_df,
    'Financieele Dagblad 3': Financieele3_df,
    'Financieele Dagblad 4': Financieele4_df,
    'Financieele Dagblad 5': Financieele5_df
}

# Label each DataFrame with its source (news outlet)
for outlet, df in dataframes.items():
    df['Source'] = outlet

# Concatenate all DataFrames into a single DataFrame
all_articles = pd.concat(dataframes.values(), ignore_index=True)

# Record the initial counts of articles per source
initial_counts = all_articles['Source'].value_counts()

# Remove duplicates, keeping the first occurrence
all_articles.drop_duplicates(subset=all_articles.columns.difference(['Source']), keep='first', inplace=True)

# Record the final counts of articles per source after duplicates are removed
final_counts = all_articles['Source'].value_counts()

# Calculate the number of duplicates removed per news outlet
duplicates_removed = initial_counts - final_counts

# Print the number of duplicates removed per news outlet
print("Duplicates removed per news outlet:\n", duplicates_removed)

Duplicates removed per news outlet:
 Source
Algemeen Dagblad         0
De Telegraaf             1
De Volkskrant 1          0
De Volkskrant 2          0
Financieele Dagblad 1    0
Financieele Dagblad 2    0
Financieele Dagblad 3    0
Financieele Dagblad 4    1
Financieele Dagblad 5    0
NRC 1                    4
NRC 2                    1
NRC 3                    0
NRC 4                    0
Trouw                    0
Name: count, dtype: int64


In [9]:
# Save the DataFrame to an Excel file
merged_df.to_excel("Merged_Data.xlsx", index=False)

In [10]:
# Print the DataFrame
print(merged_df)

                                               Headline   
0     Nee, kunstmatige intelligentie gaat ons niet u...  \
1     Wereldleiders zoeken grip op kunstmatige intel...   
2          Kunstmatige intelligentie is best bedreigend   
3     Mensen zijn een stuk efficiënter dan kunstmati...   
4     Bedreigt kunstmatige intelligentie ons godsbeeld?   
...                                                 ...   
6546                                 De rauwe realiteit   
6547                            No Headline In Original   
6548               Groeten uit het hart van de hightech   
6549              De complete lijst Jonge Talenten 2019   
6550                            No Headline In Original   

                                            Publication   
0              Trouw, Verdieping; Blz. 4, 5, 2044 words  \
1                     Trouw, Vandaag; Blz. 6, 528 words   
2                   Trouw, Tijdgeest; Blz. 8, 576 words   
3                     Trouw, Vandaag; Blz. 3, 741 words

# Body & Publication Date

In [12]:
def find_geckodriver(start_path):
    # Define the name of the geckodriver file you're looking for
    geckodriver_filename = "geckodriver"  # Use "geckodriver.exe" on Windows

    # Walk through the directory
    for root, dirs, files in os.walk(start_path):
        if geckodriver_filename in files:
            return os.path.join(root, geckodriver_filename)
    return None  # Return None if not found

# Use os.path.expanduser to start at the current user's desktop if you want it more dynamic
start_path = os.path.expanduser('~/Desktop/thesis data')

# Find geckodriver
geckodriver_path = find_geckodriver(start_path)

if geckodriver_path:
    print("geckodriver found at:", geckodriver_path)
else:
    print("geckodriver not found.")

geckodriver found at: /Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver


In [13]:
# Load the Excel file into a DataFrame
excel_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx'
data = pd.read_excel(excel_path)

# Assuming URLs are in a column named 'URL', adjust if it's named differently
print(data['URL'].head())  # Display first few URLs to confirm correct loading

0    https://advance.lexis.com/api/document?collect...
1    https://advance.lexis.com/api/document?collect...
2    https://advance.lexis.com/api/document?collect...
3    https://advance.lexis.com/api/document?collect...
4    https://advance.lexis.com/api/document?collect...
Name: URL, dtype: object


## Full Data Scraping Code

The following lines were not implemented, since the Nexis page would not facilitate the number of articles that were retrieved in one iteration. Therefore, the following code has been adjusted in order to start at each point where the data retrieval was stopped.

,,,

def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = Service(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        # Check if we have the ending phrase to stop at
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")

    return body, publication_date

def main():
    driver = setup_driver()

    # Load your data here
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')

    # Add new columns for body and publication date
    data['Body'] = None
    data['Publication Date'] = None

    # Process all entries in the dataframe
    for index, row in data.iterrows():
        if pd.notna(row['URL']):
            body, publication_date = fetch_article_details(driver, row['URL'])
            data.at[index, 'Body'] = body
            data.at[index, 'Publication Date'] = publication_date
            print(f"Processed index {index}: {row['URL']}")
        else:
            print(f"URL missing for index {index}")

    driver.quit()

    # Check the dataframe's head to confirm the columns are present
    print(data.head())
    
    # Save the updated dataframe to a new Excel file
    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data.xlsx'
    data.to_excel(updated_file_path, index=False)

    print(f"Updated data has been saved to: {updated_file_path}")

if __name__ == "__main__":
    main()

In [45]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        # Check if we have the ending phrase to stop at
        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_1.xlsx'

    for index, row in data.iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 0: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:69TB-PJD1-JC8X-6012-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:69J3-PBV1-DYRY-X54F-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6874-P5H1-DYRY-X0WN-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:69F9-XC31-JC8X-602R-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:673H-PG01-DYRY-X0T0-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:692W-5T81-DYRY-X00S-00000-00&context=1516831&sourcegroupingtype=G
Processed index 6: https://advance.lexis.com/a

In [88]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 228  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_2.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 228: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:65RX-MM71-DYRY-X43Y-00000-00&context=1516831&sourcegroupingtype=G
Processed index 229: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5P5P-KNY1-JC8X-61W8-00000-00&context=1516831&sourcegroupingtype=G
Processed index 230: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6B67-7211-DYRY-X000-00000-00&context=1516831&sourcegroupingtype=G
Processed index 231: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6852-W5N1-JC8X-6510-00000-00&context=1516831&sourcegroupingtype=G
Processed index 232: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:692N-C9K1-JC8X-600H-00000-00&context=1516831&sourcegroupingtype=G
Processed index 233: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:62D9-DVV1-DYRY-X11P-00000-00&context=1516831&sourcegroupingtype=G
Processed index 234: https://advan

In [89]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 732  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_3.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 732: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:68RK-DPC1-JBNC-700J-00000-00&context=1516831&sourcegroupingtype=G
Processed index 733: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:642N-B191-JBNC-7192-00000-00&context=1516831&sourcegroupingtype=G
Processed index 734: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:69SC-K1G1-JBNC-754V-00000-00&context=1516831&sourcegroupingtype=G
Processed index 735: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:63YG-0D21-DY4D-Y2V1-00000-00&context=1516831&sourcegroupingtype=G
Processed index 736: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6B5X-1V11-DY4D-Y029-00000-00&context=1516831&sourcegroupingtype=G
Processed index 737: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:68VK-1NT1-JBNC-71RN-00000-00&context=1516831&sourcegroupingtype=G
Processed index 738: https://advan

In [90]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 1239  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_4.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 1239: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6B43-99M1-DY4D-Y00F-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1240: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:686W-CX61-JBNC-736S-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1241: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:61G5-8521-JBNC-7198-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1242: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:69CP-XF61-DY4D-Y02K-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1243: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5PMT-BRX1-JBKF-J54P-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1244: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:612S-K871-JBNC-7037-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1245: https:

In [91]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 1747  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_5.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 1747: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:67JY-YGD1-DYRY-X2TF-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1748: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:69YX-T2J1-DYRY-X54R-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1749: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:643H-8K91-JC8X-63KH-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1750: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5WC0-WP31-DYRY-X0NR-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1751: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:626D-JBY1-DYRY-X07N-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1752: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:65KR-H751-JC8X-611B-00000-00&context=1516831&sourcegroupingtype=G
Processed index 1753: https:

In [92]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 2190  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_6.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 2190: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:69J0-B4C1-DYRY-X0MC-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2191: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5WB8-VM41-JC8X-60C4-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2192: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5XDV-W671-JC8X-604R-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2193: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:65T6-FHD1-JC8X-62V2-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2194: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5XJX-9VG1-JC8X-6491-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2195: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:68Y8-D8R1-JC8X-64BX-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2196: https:

In [93]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 2696  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_7.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 2696: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5R35-BT91-DYRY-X06F-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2697: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5MWY-T561-JC8X-63HX-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2698: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:686C-88P1-DYRY-X3YT-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2699: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5TP9-BF61-DYRY-X061-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2700: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5PCN-T1B1-JC8X-6229-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2701: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:68R3-7YR1-DYRY-X325-00000-00&context=1516831&sourcegroupingtype=G
Processed index 2702: https:

In [95]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 3202  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_8.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 3202: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:69CY-0641-F03R-S03R-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3203: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:68G4-GPK1-F03R-S0NG-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3204: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5X7R-R7J1-DYMH-R1NT-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3205: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5Y8B-52V1-DYMH-R40Y-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3206: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6B4V-D6J1-JCMP-2035-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3207: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:66N5-FRF1-JCMP-202R-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3208: https:

In [96]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 3707  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_9.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 3707: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5W34-BX51-DYMH-R3VW-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3708: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:698Y-CHC1-F03R-S024-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3709: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5YK2-M711-DYMH-R039-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3710: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:685K-NFB1-F03R-S0GG-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3711: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5NKC-0NF1-JC5G-1131-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3712: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:68GJ-F5H1-F03R-S1WK-00000-00&context=1516831&sourcegroupingtype=G
Processed index 3713: https:

In [97]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 4216  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_10.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 4216: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6BD9-81F1-F03R-S005-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4217: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:64DX-7231-F03R-S2F5-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4218: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5HMH-G1C1-DYRY-N4V2-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4219: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5SM1-K501-JC5G-106F-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4220: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5RP7-XWY1-DYMH-R2JB-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4221: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:61H1-DF81-DYMH-R4H6-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4222: https:

In [98]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 4721  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_11.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 4721: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:603B-J9K1-DYWB-S19C-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4722: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:63VP-53H1-DYMG-107F-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4723: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6BN4-8R51-DYMG-101X-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4724: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:66PY-S5J1-JB53-94SW-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4725: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:66GY-S3S1-JB53-94B9-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4726: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:66B2-XHB1-DXM5-1567-00000-00&context=1516831&sourcegroupingtype=G
Processed index 4727: https:

In [99]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 5227  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_12.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 5227: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5TG4-6YY1-DYWB-S3PP-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5228: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6BKD-H5V1-JC5D-94S8-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5229: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:61JH-3C21-JCD9-247M-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5230: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:67X4-N011-JC5D-90P5-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5231: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:66M9-HWF1-DYMG-14WT-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5232: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5VYP-HTG1-DYWB-S0Y6-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5233: https:

In [100]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 5736  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_13.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 5736: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:67W8-S2P1-JC5D-91XP-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5737: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5F1S-2PY1-JC8W-Y293-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5738: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5T0G-35R1-JCD9-2157-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5739: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5F97-2KG1-JC8W-Y2XC-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5740: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:65XB-P3H1-JBX9-K4C2-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5741: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:65TV-5X91-JBX9-K0T0-00000-00&context=1516831&sourcegroupingtype=G
Processed index 5742: https:

In [101]:
def setup_driver():
    # Specify the path to GeckoDriver
    geckodriver_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/geckodriver'
    service = FirefoxService(executable_path=geckodriver_path)
    driver = webdriver.Firefox(service=service)
    return driver

def fetch_article_details(driver, url):
    driver.get(url)
    body, publication_date = None, None
    try:
        # Wait for the publication date element to be present
        publication_date_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p.SS_DocumentInfo:last-of-type"))
        )
        publication_date = publication_date_element.text if publication_date_element else "Publication date not found"

        # Get all paragraphs after the "Body" header within the article section
        body_elements = driver.find_elements(By.XPATH, "//h2[@id='JUMPTO_Body']/following-sibling::p")
        body_text_list = [element.text for element in body_elements if element.text.strip() != '']

        body = ""
        for paragraph in body_text_list:
            if "Bekijk de oorspronkelijke pagina" in paragraph:
                break
            if body:
                body += "\n" + paragraph
            else:
                body = paragraph

    except (NoSuchElementException, TimeoutException) as e:
        print(f"Error fetching details for URL {url}: {e}")
        raise  # Re-raise the exception to handle it in the calling function

    return body, publication_date

def main():
    driver = setup_driver()
    data = pd.read_excel('/Users/helgegeurtjacobusmoes/Desktop/thesis data/Merged_Data.xlsx')
    data['Body'] = None
    data['Publication Date'] = None

    start_index = 6243  # Start processing from the index that failed previously

    updated_file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data_14.xlsx'

    for index, row in data.iloc[start_index:].iterrows():
        if pd.notna(row['URL']):
            try:
                body, publication_date = fetch_article_details(driver, row['URL'])
                data.at[index, 'Body'] = body
                data.at[index, 'Publication Date'] = publication_date
                # Save the data immediately after fetching each article
                data.to_excel(updated_file_path, index=False)
                print(f"Processed index {index}: {row['URL']}")
            except Exception as e:
                print(f"Failed to process index {index}: {row['URL']}. Error: {str(e)}")
                # Save the data when an error occurs
                data.to_excel(updated_file_path, index=False)
                break  # Optionally break if you do not wish to proceed after an error
        else:
            print(f"URL missing for index {index}")

    driver.quit()
    print("Data collection completed. Check the updated file.")

if __name__ == "__main__":
    main()

Processed index 6243: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:608N-TYC1-DYWB-S2F5-00000-00&context=1516831&sourcegroupingtype=G
Processed index 6244: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:60PM-6731-DYWB-S1CY-00000-00&context=1516831&sourcegroupingtype=G
Processed index 6245: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5FT0-YJ41-DYRY-N1CP-00000-00&context=1516831&sourcegroupingtype=G
Processed index 6246: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:65MC-WJM1-DYCC-9000-00000-00&context=1516831&sourcegroupingtype=G
Processed index 6247: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:6076-1JV1-DYWB-S0HM-00000-00&context=1516831&sourcegroupingtype=G
Processed index 6248: https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:5F0R-2B01-JC8W-Y4G1-00000-00&context=1516831&sourcegroupingtype=G
Processed index 6249: https:

## Cleaning data

In [178]:
# Read the Excel files into DataFrames and immediately filter out rows where 'Body' contains NaN
Updated_1_df = pd.read_excel("Updated_Merged_Data_1.xlsx").dropna(subset=['Body'])
Updated_2_df = pd.read_excel("Updated_Merged_Data_2.xlsx").dropna(subset=['Body'])
Updated_3_df = pd.read_excel("Updated_Merged_Data_3.xlsx").dropna(subset=['Body'])
Updated_4_df = pd.read_excel("Updated_Merged_Data_4.xlsx").dropna(subset=['Body'])
Updated_5_df = pd.read_excel("Updated_Merged_Data_5.xlsx").dropna(subset=['Body'])
Updated_6_df = pd.read_excel("Updated_Merged_Data_6.xlsx").dropna(subset=['Body'])
Updated_7_df = pd.read_excel("Updated_Merged_Data_7.xlsx").dropna(subset=['Body'])
Updated_8_df = pd.read_excel("Updated_Merged_Data_8.xlsx").dropna(subset=['Body'])
Updated_9_df = pd.read_excel("Updated_Merged_Data_9.xlsx").dropna(subset=['Body'])
Updated_10_df = pd.read_excel("Updated_Merged_Data_10.xlsx").dropna(subset=['Body'])
Updated_11_df = pd.read_excel("Updated_Merged_Data_11.xlsx").dropna(subset=['Body'])
Updated_12_df = pd.read_excel("Updated_Merged_Data_12.xlsx").dropna(subset=['Body'])
Updated_13_df = pd.read_excel("Updated_Merged_Data_13.xlsx").dropna(subset=['Body'])
Updated_14_df = pd.read_excel("Updated_Merged_Data_14.xlsx").dropna(subset=['Body'])

# List of DataFrames for easier processing
dfs = [Updated_1_df, Updated_2_df, Updated_3_df, Updated_4_df, Updated_5_df, Updated_6_df, Updated_7_df, Updated_8_df, Updated_9_df, Updated_10_df, Updated_11_df, Updated_12_df, Updated_13_df, Updated_14_df]

# Replace 'NRC.NEXT' with 'NRC' in each DataFrame
for df in dfs:
    df['News Outlet'] = df['News Outlet'].replace('NRC.NEXT', 'NRC')

# Merge the DataFrames
updated_merged_data = pd.concat(dfs, ignore_index=True)

# Remove duplicate rows
updated_merged_data = updated_merged_data.drop_duplicates()

# Save the merged DataFrame to an Excel file
updated_merged_data.to_excel("Updated_Merged_Data.xlsx", index=False)

# Save the merged DataFrame to a CSV file
updated_merged_data.to_csv("Updated_Merged_Data.csv", index=False)

# Display the merged DataFrame
updated_merged_data

Unnamed: 0,Headline,Publication,URL,News Outlet,Type of News,Word Count,Body,Publication Date
0,"Nee, kunstmatige intelligentie gaat ons niet u...","Trouw, Verdieping; Blz. 4, 5, 2044 words",https://advance.lexis.com/api/document?collect...,Trouw,Verdieping,2044,Welkom in de AI-fabriek serie\nDat kunstmatige...,7 december 2023 donderdag
1,Wereldleiders zoeken grip op kunstmatige intel...,"Trouw, Vandaag; Blz. 6, 528 words",https://advance.lexis.com/api/document?collect...,Trouw,Vandaag,528,Op het Britse landgoed Bletchley Park werden t...,3 november 2023 vrijdag
2,Kunstmatige intelligentie is best bedreigend,"Trouw, Tijdgeest; Blz. 8, 576 words",https://advance.lexis.com/api/document?collect...,Trouw,Tijdgeest,576,Of kunstmatige intelligentie nuttig is (Tijdge...,13 mei 2023 zaterdag
3,Mensen zijn een stuk efficiënter dan kunstmati...,"Trouw, Vandaag; Blz. 3, 741 words",https://advance.lexis.com/api/document?collect...,Trouw,Vandaag,741,De wereld raakte het afgelopen jaar in de ban ...,21 oktober 2023 zaterdag
4,Bedreigt kunstmatige intelligentie ons godsbeeld?,"Trouw, Religie en Filosofie; Blz. 8, 9, 1367 w...",https://advance.lexis.com/api/document?collect...,Trouw,Religie en Filosofie,1367,Theologisch elftal\n'In het begin was het Woor...,16 december 2022 vrijdag
...,...,...,...,...,...,...,...,...
6451,De rauwe realiteit,"Het Financieele Dagblad, MORGEN; Blz. 4, 2920 ...",https://advance.lexis.com/api/document?collect...,Het Financieele Dagblad,MORGEN,2920,Grootse oplossingen\nDrie stedelijke 'ontwrich...,14 oktober 2017 zaterdag 12:00 AM GMT
6452,No Headline In Original,"Het Financieele Dagblad, PAGINA 13; Blz. 13, 1...",https://advance.lexis.com/api/document?collect...,Het Financieele Dagblad,PAGINA,114,"klinkt als muziek\nDe Walkman, van Sony, is vo...",29 april 2023 zaterdag 12:00 AM GMT
6453,Groeten uit het hart van de hightech,"Het Financieele Dagblad, WEEKEND; Blz. 6, 2799...",https://advance.lexis.com/api/document?collect...,Het Financieele Dagblad,WEEKEND,2799,Het is zover voor 'onze man in San Francisco'....,20 augustus 2016 zaterdag 12:00 AM GMT
6454,De complete lijst Jonge Talenten 2019,"Het Financieele Dagblad, FD PERSOONLIJK; Arbei...",https://advance.lexis.com/api/document?collect...,Het Financieele Dagblad,FD PERSOONLIJK; Arbeidsmarkt,8007,Rebel werkte zes jaar bij zakenbank Morgan Sta...,17 januari 2019 donderdag 1:00 PM GMT


In [180]:
# Count the number of articles for each news outlet
articles_per_outlet = updated_merged_data.groupby('News Outlet').size()

# Print the count of articles per news outlet
print(articles_per_outlet)

News Outlet
AD/Algemeen Dagblad         481
De Telegraaf                808
Het Financieele Dagblad    2069
NRC                        1332
Trouw                       689
de Volkskrant              1067
dtype: int64


In [16]:
# Load the data from the uploaded file
file_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/Updated_Merged_Data.xlsx'
data = pd.read_excel(file_path)

# Function to extract the year using regex
def extract_year_regex(date_str):
    match = re.search(r'\b(19|20)\d{2}\b', date_str)
    if match:
        return int(match.group(0))
    return pd.NaT

# Apply the regex function to the 'Publication Date' column
data['Publication Year'] = data['Publication Date'].apply(extract_year_regex)

# Group by 'News Outlet' and 'Publication Year' and count the number of articles
articles_per_outlet_year = data.groupby(['News Outlet', 'Publication Year']).size().unstack(fill_value=0)

# Add the total for each year
articles_per_outlet_year.loc['Total'] = articles_per_outlet_year.sum()

# Calculate the total number of articles for each year
year_totals = articles_per_outlet_year.loc['Total']

# Calculate the percentage of articles for each year
percentages_per_year = (articles_per_outlet_year.div(year_totals, axis=1) * 100).round(1)

# Combine the counts and percentages into one table
combined_table_with_totals = pd.concat([articles_per_outlet_year, percentages_per_year], 
                                       keys=['Count', 'Percentage'], 
                                       axis=1)

combined_table_with_totals

Unnamed: 0_level_0,Count,Count,Count,Count,Count,Count,Count,Count,Count,Count,...,Percentage,Percentage,Percentage,Percentage,Percentage,Percentage,Percentage,Percentage,Percentage,Percentage
Publication Year,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
News Outlet,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AD/Algemeen Dagblad,3,4,13,24,46,40,41,54,37,158,...,3.2,4.6,4.7,6.3,4.7,6.2,8.9,5.4,10.8,13.5
De Telegraaf,7,20,46,72,100,87,50,62,68,214,...,16.0,16.3,14.0,13.7,10.2,7.5,10.2,9.9,14.6,18.1
Het Financieele Dagblad,9,37,100,157,250,303,234,209,290,355,...,29.6,35.5,30.5,34.2,35.5,35.2,34.5,42.2,24.2,27.6
NRC,14,31,60,113,145,203,163,112,114,290,...,24.8,21.3,21.9,19.8,23.8,24.5,18.5,16.6,19.8,19.2
Trouw,9,12,22,40,64,88,69,71,65,201,...,9.6,7.8,7.8,8.7,10.3,10.4,11.7,9.4,13.7,10.6
de Volkskrant,20,21,41,109,127,132,108,97,114,248,...,16.8,14.5,21.2,17.3,15.5,16.2,16.0,16.6,16.9,11.0
Total,62,125,282,515,732,853,665,605,688,1466,...,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0


In [19]:
# Save the combined table to an Excel file with the index
output_path = '/Users/helgegeurtjacobusmoes/Desktop/thesis data/combined_table_with_totals.xlsx'
combined_table_with_totals.to_excel(output_path, index=True)

print("Table saved to:", output_path)

Table saved to: /Users/helgegeurtjacobusmoes/Desktop/thesis data/combined_table_with_totals.xlsx
