ScienceDirect PDF Extraction and Download Script
1. Purpose
This script automates the process of:
Collecting article titles, Scopus IDs, and ScienceDirect links from multiple text/JSON files.
Cleaning and normalizing those links to canonical ScienceDirect article URLs.
Comparing extracted entries with an Excel file of SCOPUS IDs to remove unwanted records.
Automating the download of PDFs from ScienceDirect using AppleScript on macOS with Safari.
2. Library Imports
The script imports:
requests, concurrent.futures – for HTTP requests and parallel processing.
pandas, numpy, PyPDF2, tqdm – for data handling, PDF reading, and progress bars.
selenium, undetected_chromedriver – for potential browser automation.
subprocess, os, time, re, ast, OrderedDict – for system calls, timing, regex, and ordered dictionaries.
3. Reading and Cleaning Input Files
get_pdf_links_with_titles(file_path) reads either:
Plain text files with lines of ScopusID, Title, Link
Or JSON files with {"title": {"scopus_id":…, "link":…}}
Extracts the Scopus ID, article title, and ScienceDirect link.
Normalizes each link to the canonical form:
https://www.sciencedirect.com/science/article/pii/<PII>
Returns a dictionary:
{
    "Article Title": {
        "scopus_id": "SCOPUS_ID",
        "link": "ScienceDirect URL"
    },
    ...
}
4. Aggregating Multiple Files
A list of input files (retrieved_links_140000–202947.txt) is iterated over.
Each file is parsed into a dictionary and merged into all_pdfs.
The total number of unique PDF/article entries is printed.
5. Removing Unwanted SCOPUS IDs
Loads Remove_Scopus.xlsx into a pandas DataFrame.
Extracts SCOPUS IDs from the second column using a regex (SCOPUS_ID:(\d+)).
Compares these IDs with the batch of articles to flag and optionally remove overlaps.
6. Slicing Data into Batches
slice_dict(d, start, end) is a helper to process large dictionaries in smaller chunks (e.g., 2,000 entries at a time).
7. Safari Automation (macOS Only)
The script uses AppleScript via subprocess calls to automate Safari:
science_direct_open_pdf_in_safari_and_save(pdf_url)
Opens a ScienceDirect article in Safari, waits, clicks inside the page, and triggers the PDF download button using cliclick.
clear_history()
Clears Safari history periodically to prevent performance issues or popups.
new_tab()
Opens new Safari tabs as needed for large batches.
This approach bypasses login barriers if you’re already authenticated in Safari with your institution credentials.
8. File Renaming and Organization
After each download, the script:
Checks the Downloads folder for newly added PDFs within the last 10 seconds.
Cleans the article title for safe filenames (removes special characters).
Renames the file as Title-ScopusID.pdf.
Falls back to ScopusID.pdf if the title is too long for macOS.
9. Handling Missing Entries
Compares SCOPUS IDs in the batch vs. already-downloaded PDFs in the output folder.
Builds a missing_entries dict of articles still needing download.
Loops through these and downloads them via Safari automation.


####10. Workflow Summary
Input: Text/JSON files of ScienceDirect links and an Excel file of SCOPUS IDs to exclude.
Parsing: Extract titles, SCOPUS IDs, and links; normalize URLs.
Filtering: Remove unwanted SCOPUS IDs.
Batching: Slice into manageable sets of 2,000 entries.
Download Automation:
Open article link in Safari.
Click to download PDF.
Rename and save into a target folder.
Error Handling: Skip failures and continue; print issues for later review.


11. Key Advantages
Scales: Handles hundreds of thousands of entries by batching.
Institution Auth: Leverages Safari’s existing login session instead of handling cookies manually.
Safe Filenames: Cleans and renames PDFs consistently.
Recovery: Tracks missing SCOPUS IDs for re-runs.

### Report on PDF Link Extraction and Automated Download Pipeline

#### Overview
This pipeline is designed to process large batches of metadata files containing Scopus IDs, article titles, and URLs (primarily from ScienceDirect), extract valid PDF links, filter and clean the data, and automate the downloading and renaming of PDF files using Safari on macOS. It also includes mechanisms to handle missing entries, avoid duplicates, and manage browser history to ensure smooth operation.

---

#### 1. **PDF Link Extraction from Text Files**

- **Function:** `get_pdf_links_with_titles(file_path)`
- **Purpose:** Reads a text file where each line contains a Scopus ID, article title, and a URL separated by commas.
- **Alternate Function:** A variant of `get_pdf_links_with_titles` reads for file containing 'pdf.sciencedirectassets.com', and fixes them accordinly, this one i created for a few failed files, however it can be combined with the previous function if needed.
- **Process:**
  - Skips lines with insufficient parts.
  - Extracts Scopus ID, title, and link.
  - Filters links to only those containing `www.sciencedirect.com`.
  - Extracts the PII (Publisher Item Identifier) from the URL using regex and reconstructs a clean ScienceDirect article URL.
  - Stores entries in a dictionary keyed by the cleaned title, with values containing Scopus ID and link.
- **Batch Processing:** Multiple files are processed sequentially, and their dictionaries merged into a master dictionary `all_pdfs`.
- **Outcome:** A consolidated dictionary of unique PDF entries keyed by title.

---

#### 3. **Data Slicing and Cleaning**

- **Function:** `slice_dict(d, start, end)` slices the dictionary to process manageable batches (e.g., first 2000 entries).
- **Purpose:** Enables incremental processing and error isolation.
- **Data Cleaning:** 
  - Reads an Excel file (`Remove_Scopus.xlsx`) containing Scopus IDs to exclude.
  - Extracts IDs using regex and compares with batch IDs.
  - Reports any overlaps to avoid processing unwanted or invalid entries.

---

#### 4. **Automated PDF Downloading Using Safari and AppleScript**

- **Functions:**
  - `science_direct_open_pdf_in_safari_and_save(pdf_url)`: Automates Safari to open a PDF URL, simulate mouse clicks to trigger download, and close tabs.
  - `clear_history()`: Clears Safari’s browsing history and cache to prevent interference or rate limiting.
  - `new_tab()`: Opens new tabs in Safari to manage multiple downloads.
- **Implementation Details:**
  - Uses AppleScript executed via `osascript` to control Safari and system events.
  - Coordinates mouse clicks at specific screen coordinates using `cliclick` (a command-line tool for mouse control).
  - Includes delays to allow page loading and download completion.
- **Platform:** macOS-specific automation leveraging Safari and system scripting.

---

#### 5. **Handling Missing Downloads and File Management**

- **Process:**
  - Compares Scopus IDs in the batch with those already downloaded (by parsing filenames in the target folder).
  - Identifies missing entries that need to be downloaded.
- **Download Loop:**
  - Iterates over missing entries.
  - Periodically clears browser history and new tabs to maintain performance and avoid blocking.
  - Downloads PDFs via the automated Safari script.
  - Waits briefly to ensure download completion.
- **File Renaming:**
  - Renames downloaded PDFs using a sanitized version of the article title combined with the Scopus ID.
  - Handles filename length issues by falling back to using only the Scopus.
- **Error Handling:**
  - Catches and logs exceptions during download and renaming to continue processing remaining entries.

---

#### Strengths of the Pipeline

- **Comprehensive Data Handling:** Supports both text and JSON input formats, enabling flexibility.
- **Batch Processing:** Efficiently manages large datasets by slicing and incremental processing.
- **Automated Browser Control:** Uses AppleScript and system tools to automate downloads without manual intervention.
- **Duplicate and Error Management:** Checks for existing files to avoid redundant downloads and handles gracefully.
- **Data Cleaning and Validation:** Cleans titles and URLs to maintain consistency and prevent errors.

---

#### Potential Areas for Improvement

- **Hardcoded Screen Coordinates:** The AppleScript relies on fixed mouse click positions, which may break if UI elements or screen resolution changes. Consider using UI scripting with element references or browser automation tools like Selenium for more robust control.
- **Platform Dependency:** The current automation is macOS and Safari-specific. For cross-platform compatibility, consider browser automation frameworks.
- **Download Verification:** Currently, the script assumes a PDF is downloaded if a new file appears in the Downloads folder. Adding checksum or file size verification could improve reliability.

---

#### Summary

This pipeline effectively automates the extraction, filtering, and downloading of academic PDFs from ScienceDirect using DOIs and metadata. It combines data parsing batch processing, and macOS automation to handle large-scale document retrieval with minimal manual effort. With some enhancements in robustness and scalability, it can serve as a powerful tool for researchers and librarians managing extensive digital collections.

---

If you would like, I can also help design improvements, suggest alternative automation approaches, or assist with documentation and testing strategies. Let me know how you'd like to proceed!

In [2]:
import requests
from concurrent.futures import ThreadPoolExecutor
import subprocess
import time
import re
import pandas as pd
import numpy as np
import PyPDF2
import os
import ast
from tqdm import tqdm
from collections import OrderedDict
import os
import time
import requests
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import subprocess
import time
import undetected_chromedriver as uc
import re


In [3]:
# https://www.sciencedirect.com/b/s/science/article/pii/S0925838824033462/pdfft?brr=93f2d3f3bcdcf00f&crasolve=1&r=93f2d3f22977f00f&ts=1747146355761&rtype=https&vrr=UKN&redir=UKN&redir_fr=UKN&redir_arc=UKN&vhash=UKN&host=d3d3LnNjaWVuY2VkaXJlY3QuY29t&tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&rh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&re=X2JsYW5rXw%3D%3D&ns_h=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ns_e=X2JsYW5rXw%3D%3D&rh_fd=rrr%29n%5Ed%60i%5E%60_dm%60%5Eo%29%5Ejh&tsoh_fd=rrr%29n%5Ed%60i%5E%60_dm%60%5Eo%29%5Ejh&hc=%7Errr%29n%5Ed%60i%5E%60_dm%60%5Eo%29%5Ejhwrrr%29n%5Ed%60i%5E%60_dm%60%5Eo%29%5Ejhwrrr%29n%5Ed%60i%5E%60_dm%60%5Eo%29%5Ejh&iv=f95eb7d334876a04ed3e3be343b24c58&token=31323935623865343461616465366137656633376232653335313539616433356133323737313863326435623038653933336361626363396363376136393833663362373362383033636135653631373030393432616233353636353333353930636130616661363066373663376339343263333630336263393664643438313a386461333930633136313263626463333563356436383566&text=c37440bae187911faf2a0b2a384a7d3018d7598ea093c33cd21d44699974dce4a7f6b0a32a5221eb4c21c3b8eba467c6eb866b758a61324a1fbc65938312cef6a810bf38a46982cd60cadadd13647ace93676cd27434854a66c880b17f171892fd680aa6a34864cdc88284813d841b76b8d0aed6d9bb62199693c7e08207078b43d646b5d90924978298e01ea7077f25ce99cbaabd001938916d99e675f2c8dc4e23cc4ef2c3dfb100ece401c3a99d3a9d79738b78de06ae0ae3180f9613015059e7ee8deb86add0b05215481edf44e181253104bc0e06c18d5f36dfe239230102a29591797a3c3fba4c57838091144f360fc7970087b1b8bbe888b409c991dd16db4f37378cfbe0f354070b094faf16a9b5d1734a4c5f126211388569947b7e791840f2d4f60869a616e5a1c78ede3b&original=3f697344544d52656469723d74727565&chkp=1c&rack=93f2d3f22977f00f

In [4]:
# https://pdf.sciencedirectassets.com/271090/1-s2.0-S0360544224X00195/1-s2.0-S0360544224033462/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEEYaCXVzLWVhc3QtMSJGMEQCIB19S0x%2FE39YZPK87zRpFiYc%2BfkN%2FjEFEvPyFGHJVWlIAiAiJ6QUTRYtR3gdLcnWBKHiQ2JPI5DeCpDEpujr%2Bnmk5Cq7BQjv%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAUaDDA1OTAwMzU0Njg2NSIMlA4%2F1Ox0cTHtuuOiKo8F3T%2BJTei2f1GBU58Et7vxQeYKtKbnz9hcaPFmoNCtfIpD368vqFWWcgvhJa4%2BY3lP7%2BHeOs7lCI%2BQCq9m47SpXs%2F42gNvHrEMgYnaUg1r9Kd2Mu52u5LLWbFn5Ky%2BKMC5Snqt8cXCfoNectxfPVmrUQBg0ubxOJI5Xu6U6yh5jBw7t9%2F2DTCFXqmmNjAdgKQPrTqaPpn8hHusHi%2BALV5BddWtBonW289G04pExEjyfMCEpjHeyr%2F7uWe326L5zND4hYRD1xq5yi8jox%2BkePYgfbk4iipMcuHK9pEmx0jIuH77mn%2FxBtHShSZMag5j4qjovmawiI4wfRZ8V09466hwewXbkHEki9PmeizLK1PO6HahxhzZAqx5oXftyt6wDNZtLqFzC3n5Z%2FjsJwUYj66Fkt0gUvhoR7pb2EsrI6z2dZvr%2BoHj3bFuYRPFTcIQ%2F14aZYKMDSsKKqKrgnsSrxhZOpYoE7ssmdNZuIRMF39%2BVHszkny%2FjYiZFLtG8LTz69JwStdtnRFR7LrjYZRzRLZA8mgKkmny2BFGHUnGdG2VK5cKSDkPNOxLCQ4wX8YVPtibT5VGO7y0Ecst4S7O3Odw%2FPT84IBiv4OK8Qpv8KchBloYz44htEWCMui5HPewSkF5uf5Sztxf24b1LD4IhqcruSPM6u3MiS8pWzC9gd46xeQ4Ta3sZnblul59YomnvyLO1ryxRO1WBFhb%2FzYB62rm9ilop7rDEE7qW1rzsJjTA%2F7UyJ8V953%2FfoGyRazc7wLt611hY1uz2RPPCosgJz7ZhXqKho7d2vRGEO%2FyAA68vqxPrzf54RR4CWwGP%2F%2BiU8UQaNVWbaveHPYCe6A4yAMqxDxC6%2FMVWYDrFN0KL40ZjDDimY3BBjqyAek3iQHibCEFgTP9ORWL2jdnfclk5zb0FIeqfhhcvYGT7GrDHBsNyVBuSMgjSc9jG60s7q7HGcPY4Kl9VY1eP0ZsdnGUqC3CKCvEdETjXnXXV3u%2BLKVi3AZQmzvH%2B2rHWSS5ystwN7ykdnoct%2Bl9a6eFoK%2BXiO30lQI15Ze7VAwbY7Dbwvs2GrAcIVHBzwPDxCsxPRyiB33b4F83futEmzTM23MyD8ytGRGxC5VTO0YpEy8%3D&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20250513T141704Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTY26RKWYQY%2F20250513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=8a0f8cece89f7804926cdaba33c38aa6b1e5e42fbeab89a3ec4721bb799bc8f8&hash=1f08cd0bb40f1c9038b51dc519d872112f391ab5c0ac27a3fdec5a69349b2024&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S0360544224033462&tid=spdf-f5752bb5-99ae-4065-bb74-37520d35f2ad&sid=24890c772493584c2b1b5c947f62c56076c3gxrqa&type=client&tsoh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&rh=d3d3LnNjaWVuY2VkaXJlY3QuY29t&ua=0f105752085c0d59080e&rr=93f2c6fa89a2f00f&cc=us

In [5]:
# https://www.sciencedirect.com/science/article/pii/S0360544224033474

In [3]:
def get_pdf_links_with_titles(file_path):
    pdf_dict = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line.count(',') < 2:
                continue  # not enough parts, skip
            
            parts = line.split(',')
            scopus_id = parts[0].strip()
            link = parts[-1].strip() if parts[-1].strip() else None
            title = ','.join(parts[1:-1]).strip()  # join all parts between scopus_id and link
            
            if link and 'www.sciencedirect.com' in link.lower():
                m = re.search(r'/article/pii/([A-Za-z0-9]+)', link)
                if m:
                    pii = m.group(1)
                    link = f'https://www.sciencedirect.com/science/article/pii/{pii}'
                # Save with title as key
                pdf_dict[title] = {'scopus_id': scopus_id, 'link': link}
    
    return pdf_dict

if __name__ == "__main__":
    all_pdfs = {}
    
    file_paths = [
        './Downloads/retrieved_links_new/retrieved_links_140000-145000.txt',
        './Downloads/retrieved_links_new/retrieved_links_145000-150000.txt',
        './Downloads/retrieved_links_new/retrieved_links_150000-155000.txt',
        './Downloads/retrieved_links_new/retrieved_links_155000-160000.txt',
        './Downloads/retrieved_links_new/retrieved_links_160000-165000.txt',
        './Downloads/retrieved_links_new/retrieved_links_165000-170000.txt',
        './Downloads/retrieved_links_new/retrieved_links_170000-175000.txt',
        './Downloads/retrieved_links_new/retrieved_links_175000-180000.txt',
        './Downloads/retrieved_links_new/retrieved_links_180000-185000.txt',
        './Downloads/retrieved_links_new/retrieved_links_185000-190000.txt',
        './Downloads/retrieved_links_new/retrieved_links_190000-195000.txt',
        './Downloads/retrieved_links_new/retrieved_links_195000-200000.txt',
        './Downloads/retrieved_links_new/retrieved_links_200000-202947.txt'
    ]
    
    for path in file_paths:
        pdf_dict = get_pdf_links_with_titles(path)
        all_pdfs.update(pdf_dict)  # Merges new dict into main one

    print(f"Total unique PDF entries: {len(all_pdfs)}")

Total unique PDF entries: 7362


In [5]:
import re
import json

def get_pdf_links_with_titles(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)  # Load entire JSON as Python dict
    
    pdf_dict = {}

    for title, entry in data.items():
        link = entry.get("link", "")
        scopus_id = entry.get("scopus_id", "")

        if "pdf.sciencedirectassets.com" in link.lower():
            # Optionally fix the link
            link = link.replace('/b/', '/')
            pdf_dict[title] = {
                "scopus_id": scopus_id,
                "link": link
            }

    return pdf_dict



if __name__ == "__main__":
    all_pdfs = {}

    file_paths = [
        './Downloads/missing_entries_sciencedirect_1Jul.txt'
    ]
    
    for path in file_paths:
        pdf_dict = get_pdf_links_with_titles(path)
        all_pdfs.update(pdf_dict)  # Merges new dict into main one

    print(f"Total unique PDF entries: {len(all_pdfs)}")

Total unique PDF entries: 1624


In [6]:
def slice_dict(d, start, end):
    return dict(list(d.items())[start:end])

In [7]:
#  this is for a few entries containing NaN dois, hence if their IDs are found in Remove_Scopus.xlsx, just remove those IDs.

batch = slice_dict(all_pdfs, 0, 2000)
        
all_scopus_ids_in_data = {info['scopus_id'] for info in batch.values()}

remove_df = pd.read_excel('/Users/dtuser/Downloads/Remove_Scopus.xlsx')

# Extract SCOPUS IDs from Excel (assuming the column is named something like 'scopus_id')
remove_scopus_ids = set(remove_df.iloc[:, 1].astype(str).str.extract(r'SCOPUS_ID:(\d+)', expand=False).dropna())


# Check for overlap
overlap_ids = all_scopus_ids_in_data.intersection(remove_scopus_ids)

# Report result
if not overlap_ids:
    print("✅ No SCOPUS IDs in the batch are present in the Excel file.")
else:
    print(f"❌ {len(overlap_ids)} SCOPUS IDs found in both batch and Excel file.")
    print("Conflicting IDs:", overlap_ids)

✅ No SCOPUS IDs in the batch are present in the Excel file.


In [8]:
def science_direct_open_pdf_in_safari_and_save(pdf_url):
    apple_script = f"""

    tell application "Safari"
    activate  -- Brings Safari to the front
    open location "{pdf_url}"
    delay 3 -- Wait for the page to load and for the PDF to render
    end tell
    
    tell application "System Events"
        tell process "Safari"
            set frontmost to true  -- Ensure Safari remains the active window
            delay 1
        end tell
    
        -- Click inside Safari to open the PDF
        do shell script "cliclick c:448,188"
        do shell script "cliclick c:445,153"
        delay 1
    
        -- Click the download button inside Safari
        delay 5
        do shell script "cliclick c:799,829"
        delay 1
    end tell
    
    tell application "Safari"
        activate
        tell front window
            close current tab
        end tell
    end tell
    
    tell application "Safari"
        activate
        tell front window
            close current tab
        end tell
    end tell
    """

    subprocess.run(["osascript", "-e", apple_script], text=True)

In [9]:
def clear_history():
    apple_script = """
    tell application "Safari"
        activate  -- Bring Safari to the front
    end tell

    delay 2  -- Give Safari time to come forward

    tell application "System Events"
        tell process "Safari"
            set frontmost to true  -- Ensure Safari remains the active window
            delay 1
        end tell

        -- Click at position 1
        do shell script "cliclick c:73,1"
        delay 1.5

        -- Click at position 2
        do shell script "cliclick c:115,196"
        delay 1.5

        -- Click at position 3
        do shell script "cliclick c:865,509"
        do shell script "cliclick c:865,509"
        do shell script "cliclick c:846,350"
        do shell script "cliclick c:846,350"
        delay 3
    end tell
    """

    import subprocess
    subprocess.run(["osascript", "-e", apple_script], text=True)


In [10]:
def new_tab():
    apple_script = """
    tell application "Safari"
        activate
        if not (exists window 1) then
            make new document
        else
            tell window 1
                make new tab
            end tell
        end if
    end tell
    """
    import subprocess
    subprocess.run(["osascript", "-e", apple_script], text=True)

In [51]:
clear_history()

In [9]:
next((i for i, (title, info) in enumerate(all_pdfs.items()) if info.get('scopus_id') == '85197488848'), None)

3325

In [13]:
os.mkdir('/Users/dtuser/Downloads/sciencedirect_pdfs_28Jun-b2/')

In [14]:
# in first try, some entries maybe left out, so for that rerun for the same batch.

batch = slice_dict(all_pdfs, 0, 2000)

folder_path = '/Users/dtuser/Downloads/sciencedirect_pdfs_28Jun-b2/'

scopus_ids_in_folder = set()
for filename in os.listdir(folder_path):
    if '-' in filename:
        try:
            scopus_id_part = filename.rsplit('-', 1)[-1].split('.')[0]
            scopus_ids_in_folder.add(scopus_id_part)
        except IndexError:
            continue
    else:
        try:
            if int(filename[:-4]):
                scopus_id_part = filename[:-4]
                scopus_ids_in_folder.add(scopus_id_part)
        except:
            continue
        
all_scopus_ids_in_data = {info['scopus_id'] for info in batch.values()}

missing_scopus_ids = all_scopus_ids_in_data - scopus_ids_in_folder

missing_entries = {title: info for title, info in batch.items() if info['scopus_id'] in missing_scopus_ids}
len(scopus_ids_in_folder), len(missing_entries)


(0, 1624)

In [18]:
new_tab()
new_tab()

tab 5 of window id 147
tab 6 of window id 147


In [15]:
download_path = '/Users/dtuser/Downloads/sciencedirect_pdfs_28Jun-b2/'
count = 0
clear_history()
for title, info in missing_entries.items():
    count += 1
    if count % 25 == 0:
        clear_history()
    if count % 30 == 0:
        new_tab()
        new_tab()
        new_tab()
        new_tab()
    link = info['link']
    scopus_id = info['scopus_id']
    try:
        science_direct_open_pdf_in_safari_and_save(link)
        time.sleep(2)
        files = os.listdir('./Downloads/')
        pdf_files = [f for f in files if f.endswith(".pdf")]
        current_time = time.time()
        recent_files = [
            os.path.join('', f)
            for f in pdf_files
            if current_time - os.path.getctime(os.path.join('./Downloads/', f)) <= 10
        ]
        
        if recent_files == []:
            continue
        safe_title = re.sub(r'</?inf>', '', title)
        safe_title = re.sub(r'[^A-Za-z0-9 ]+', '', safe_title)
        safe_title = safe_title.strip() + '-' + scopus_id
        new_name = os.path.join(download_path, f"{safe_title}.pdf")  # Rename with title
        latest_file = max([os.path.join('./Downloads', f) for f in recent_files], key=os.path.getctime)
        try:
            os.rename(latest_file, new_name)
            print(f"📂 Renamed file to: {new_name}")
        except OSError as e:
            if e.errno == 63:  
                fallback_name = os.path.join(download_path, f"{scopus_id}.pdf")
                os.rename(latest_file, fallback_name)
                print(f"⚠️ Filename too long, saved as: {fallback_name}")
            else:
                raise e  # Reraise other unexpected errors
        continue
    except Exception as e:
        print(f"❌ Error processing ScienceDirect PDF: {e}")
        continue   


KeyboardInterrupt: 

In [85]:
batch

{'Optimal allocation of renewable energy systems in a weak distribution network': {'link': 'https://www.sciencedirect.com/science/article/pii/S0378779624005352',
  'scopus_id': '85196938769'},
 'Fabrication of yttriumdoped Li4SiO4 sorbents for CO2 capture and solar energy storage': {'link': 'https://www.sciencedirect.com/science/article/pii/S0016236124014030',
  'scopus_id': '85196934728'},
 'Infrared nanoimaging and nanospectroscopy of electrochemical energy storage materials and interfaces': {'link': 'https://www.sciencedirect.com/science/article/pii/S2451910324001091',
  'scopus_id': '85196830792'},
 'Generative literature analysis on the rise of prosumers and their influence on the sustainable energy transition': {'link': 'https://www.sciencedirect.com/science/article/pii/S0957178724000924',
  'scopus_id': '85196806789'},
 'Tool for optimization of sale and storage of energy in wind farms': {'link': 'https://www.sciencedirect.com/science/article/pii/S0378475423001167',
  'scopus_id