ALZHEIMER'S DISEASE BIOMARKER OBSERVATORY {BMO}:

BIOMARKER DATA CAPTURE [ PART 2]

PROFESSOR : JORGE FONSECA

STUDENTS: KAJAL DHIMMAR AND SHUBHAM KOLASE

CONTACT INFORMATION OF KAJAL DHIMMAR

EMAIL: dhimmar@unlv.nevada.edu
PHONE NUMBER: 385-489-4611

PART 2: 

Explanation of the Script

This script is designed to fetch URLs for full-text articles related to Alzheimer's disease from PubMed. Below is an explanation of each component of the script, focusing on its goal of fetching URLs.

The provided Python script is a comprehensive tool designed for fetching and processing full-text articles from PubMed Central (PMC) using specific PMC IDs, focusing on its goal of fetching URLs.
Here’s an overall summary of the script:

Purpose:
-------
The script aims to automate the retrieval of full-text articles from PMC and process these articles for further use. This is useful in research projects where accessing and analyzing numerous scientific articles is required.

Key Components:
Imports and Logging Configuration:

Imports: The script imports necessary libraries such as requests for HTTP requests and logging for logging messages.
Logging Configuration: Basic configuration for logging is set up to capture and display log messages.

Function to Fetch Full Text (HTML) of an Article:
------------------------------------------------

1. Function: fetch_full_text_xml(pmc_id)
2. Description: This function takes a PMC ID as input and fetches the full text of the corresponding article from PMC in HTML format.
3. Headers: Custom User-Agent headers are used to mimic a web browser request.
4. Response Handling: If the request is successful, the HTML content is returned; otherwise, an error message is logged.

Example Usage:
-------------

Demonstrates how to use the fetch_full_text_xml function by fetching the full text of an article with a specific PMC ID and printing the first 500 characters of the retrieved content.

Additional Functions (not fully displayed in the output):
---------------------------------------------------------

Functions for fetching article summaries, generating full-text URLs, extracting text from PDFs, and other related tasks are included but not fully shown in the provided content, we were not successfully able to that , still trying that how we can extact pdf format......

Main Function:
--------------

- Function: main(api_key, query)
- Description: This function coordinates the overall process, including fetching article summaries, retrieving PMC IDs, generating full-text URLs, and optionally downloading and processing PDFs.
- Example Usage: Demonstrates how to call the main function with an API key and a query term (e.g., "Alzheimer's disease").


Workflow:
-------------
1. Fetch Article Summaries: Using an API key and a query term, fetch summaries of relevant articles.
2. Retrieve PMC IDs: Extract PMC IDs from the fetched summaries.
3. Generate Full-Text URLs: Create URLs for accessing the full text of the articles using the retrieved PMC IDs.
4. Fetch Full Text: For each URL, fetch the full text in HTML format.
5. Log Progress: Log each step’s progress and any errors encountered.

Usage:
------
Replace placeholders such as your_api_key_here and PMC11040515 with actual API keys and PMC IDs of your choice from pubmed.
Run the main function with appropriate arguments to start the process.

Applications:
-------------

Research projects requiring access to a large number of scientific articles.
Data analysis and extraction from full-text scientific articles for further research or publication.

This script provides a structured and automated way to access and process scientific literature!!!!!

1. Logging Configuration
The script sets up logging to display informational messages and errors, which helps in tracking the script's progress and identifying any issues.

requests is used for making HTTP requests.
logging is used for logging the progress and any issues encountered during the execution.

In [1]:
import requests
import xml.etree.ElementTree as ET
import pandas as pd
import time
import logging

# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def generate_pmc_url(pmc_id):
    """
    Generate a URL for the article's landing page on PubMed Central using the PMC ID.
    """
    base_url = "https://www.ncbi.nlm.nih.gov/pmc/articles/"
    article_url = f"{base_url}{pmc_id}/"
    return article_url

def search_article_ids(query, api_key, max_articles_per_query):
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    all_article_ids = []
    for start in range(0, max_articles_per_query, 100):
        params = {
            "db": "pubmed",
            "term": query,
            "retmode": "json",
            "api_key": api_key,
            "retstart": start,
            "retmax": 100
        }
        response = requests.get(search_url, params=params)
        if response.status_code == 200:
            data = response.json()
            article_ids = data["esearchresult"]["idlist"]
            if not article_ids:
                break
            all_article_ids.extend(article_ids)
            time.sleep(0.33)
        else:
            logging.error(f"Failed to search article IDs: {response.status_code}")
            break
    return all_article_ids

def fetch_batch_details(article_ids, api_key):
    fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    articles_info = []
    batch_size = 100
    for i in range(0, len(article_ids), batch_size):
        batch_ids = article_ids[i:i + batch_size]
        params = {
            "db": "pubmed",
            "retmode": "xml",
            "id": ",".join(batch_ids),
            "api_key": api_key
        }
        response = requests.get(fetch_url, params=params)
        if response.status_code == 200:
            articles_info.extend(parse_article_details(response.text))
            time.sleep(0.33)
        else:
            logging.error(f"Failed to fetch article details: {response.status_code}")
    return articles_info

def parse_article_details(xml_data):
    root = ET.fromstring(xml_data)
    articles_info = []

    for article in root.findall('.//PubmedArticle'):
        article_info = {
            'title': article.find('.//ArticleTitle').text,
            'abstract': article.find('.//Abstract/AbstractText').text if article.find('.//Abstract/AbstractText') is not None else "No abstract",
        }
        pmc_id = article.find(".//ArticleIdList/ArticleId[@IdType='pmc']")
        if pmc_id is not None:
            article_info['pmc_id'] = pmc_id.text
            article_info['pmc_url'] = generate_pmc_url(pmc_id.text)
        else:
            article_info['pmc_id'] = "Not available"
            article_info['pmc_url'] = "URL not available"
        articles_info.append(article_info)
    return articles_info

def main(api_key, query, filename="URL.csv", max_articles=100):
    article_ids = search_article_ids(query, api_key, max_articles)
    if article_ids:
        articles_info = fetch_batch_details(article_ids, api_key)
        df = pd.DataFrame(articles_info)
        df.to_csv(filename, index=False)
        logging.info(f"Found and saved details for {len(df)} articles.")
    else:
        logging.info("No articles found.")

# Example usage parameters
api_key = "496547d98a26cfb847f258ef044604727e08"  # Fill with your actual API key
query = "Alzheimer's disease"
main(api_key, query)


2024-07-17 01:14:57,085 - INFO - Found and saved details for 100 articles.


1.URLs are generated only if a PMC ID is found, and both the PMC ID and the corresponding URL are included in the final DataFrame that is saved to a CSV file.
2. If no PMC ID is found for an article, the script notes this and indicates that the URL is not available.

In [1]:
import requests
import xml.etree.ElementTree as ET
import time
import logging

# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def generate_pmc_url(pmc_id):
    """
    Generate a URL for the article's landing page on PubMed Central using the PMC ID.
    """
    base_url = "https://www.ncbi.nlm.nih.gov/pmc/articles/"
    article_url = f"{base_url}{pmc_id}/"
    return article_url

def search_article_ids(query, api_key, max_articles_per_query):
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    all_article_ids = []
    for start in range(0, max_articles_per_query, 100):
        params = {
            "db": "pubmed",
            "term": query,
            "retmode": "json",
            "api_key": api_key,
            "retstart": start,
            "retmax": 100
        }
        response = requests.get(search_url, params=params)
        if response.status_code == 200:
            data = response.json()
            article_ids = data["esearchresult"]["idlist"]
            if not article_ids:
                break
            all_article_ids.extend(article_ids)
            time.sleep(0.33)
        else:
            logging.error(f"Failed to search article IDs: {response.status_code}")
            break
    return all_article_ids

def fetch_batch_details(article_ids, api_key):
    fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    articles_info = []
    batch_size = 100
    for i in range(0, len(article_ids), batch_size):
        batch_ids = article_ids[i:i + batch_size]
        params = {
            "db": "pubmed",
            "retmode": "xml",
            "id": ",".join(batch_ids),
            "api_key": api_key
        }
        response = requests.get(fetch_url, params=params)
        if response.status_code == 200:
            articles_info.extend(parse_article_details(response.text))
            time.sleep(0.33)
        else:
            logging.error(f"Failed to fetch article details: {response.status_code}")
    return articles_info

def parse_article_details(xml_data):
    root = ET.fromstring(xml_data)
    urls = []

    for article in root.findall('.//PubmedArticle'):
        pmc_id = article.find(".//ArticleIdList/ArticleId[@IdType='pmc']")
        if pmc_id is not None:
            pmc_url = generate_pmc_url(pmc_id.text)
            urls.append(pmc_url)
    return urls

def main(api_key, query, max_articles=100):
    article_ids = search_article_ids(query, api_key, max_articles)
    if article_ids:
        urls = fetch_batch_details(article_ids, api_key)
        for url in urls:
            print(url)
        logging.info(f"Generated URLs for {len(urls)} articles.")
    else:
        logging.info("No articles found.")

# Example usage parameters
api_key = ""  # Fill with your actual API key
query = "Alzheimer's disease"
main(api_key, query)


2024-06-05 15:37:59,068 - INFO - Generated URLs for 59 articles.


https://www.ncbi.nlm.nih.gov/pmc/articles/10313141/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11148533/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11148366/
https://www.ncbi.nlm.nih.gov/pmc/articles/8507321/
https://www.ncbi.nlm.nih.gov/pmc/articles/7380073/
https://www.ncbi.nlm.nih.gov/pmc/articles/6279593/
https://www.ncbi.nlm.nih.gov/pmc/articles/5994507/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149323/
https://www.ncbi.nlm.nih.gov/pmc/articles/2430603/
https://www.ncbi.nlm.nih.gov/pmc/articles/5106496/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149856/
https://www.ncbi.nlm.nih.gov/pmc/articles/6073093/
https://www.ncbi.nlm.nih.gov/pmc/articles/5058336/
https://www.ncbi.nlm.nih.gov/pmc/articles/9502483/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11146203/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11146249/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11144909/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11145550/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1114

In [2]:
"""def fetch_full_text(pmc_id):
    
    #Fetch the full text of an article from PubMed Central using the PMC ID.
    
    full_text_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/pdf/"
    response = requests.get(full_text_url)
    if response.status_code == 200:
        return response.content  # Returns the content of the PDF file
    else:
        logging.error(f"Failed to fetch full text for PMC ID {pmc_id}: {response.status_code}")
        return None

def parse_article_details(xml_data):
    root = ET.fromstring(xml_data)
    full_texts = []

    for article in root.findall('.//PubmedArticle'):
        pmc_id = article.find(".//ArticleIdList/ArticleId[@IdType='pmc']")
        if pmc_id is not None:
            full_text = fetch_full_text(pmc_id.text)
            if full_text:
                full_texts.append(full_text)  # Save or process the full text as needed
    return full_texts"""

In [4]:
"""def search_article_ids(query, api_key, max_articles_per_query):
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    all_article_ids = []
    for start in range(0, max_articles_per_query, 100):
        params = {
            "db": "pubmed",
            "term": query,
            "retmode": "json",
            "api_key": api_key,
            "retstart": start,
            "retmax": 100
        }
        response = requests.get(search_url, params=params)
        print("Search response:", response.json())  # Debugging line
        if response.status_code == 200:
            data = response.json()
            article_ids = data["esearchresult"]["idlist"]
            if not article_ids:
                break
            all_article_ids.extend(article_ids)
            time.sleep(0.33)
        else:
            logging.error(f"Failed to search article IDs: {response.status_code}")
            break
    return all_article_ids

# Add similar print statements in other functions to track the flow and data.""""


In [6]:
"""def search_article_ids(query, api_key, max_articles_per_query):
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    all_article_ids = []
    for start in range(0, max_articles_per_query, 100):
        params = {
            "db": "pubmed",
            "term": query,
            "retmode": "json",
            "api_key": api_key,
            "retstart": start,
            "retmax": 100
        }
        response = requests.get(search_url, params=params)
        if response.status_code == 200:
            data = response.json()
            article_ids = data["esearchresult"]["idlist"]
            if not article_ids:
                break
            all_article_ids.extend(article_ids)
            time.sleep(0.33)
        else:
            logging.error(f"Failed to search article IDs: {response.status_code}")
            break
    return all_article_ids

def fetch_batch_details(article_ids, api_key):
    fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    articles_info = []
    batch_size = 100
    for i in range(0, len(article_ids), batch_size):
        batch_ids = article_ids[i:i + batch_size]
        params = {
            "db": "pubmed",
            "retmode": "xml",
            "id": ",".join(batch_ids),
            "api_key": api_key
        }
        response = requests.get(fetch_url, params=params)
        if response.status_code == 200:
            articles_info.extend(parse_article_details(response.text))
            time.sleep(0.33)
        else:
            logging.error(f"Failed to fetch article details: {response.status_code}")
    return articles_info"""


---------------------------------------------------------------------------------------------Full Text---------------------------------------------------------

First try: Depending on whether the outputs are URLs or the actual full-text data, we might want to handle them differently. If it's PDF content, displaying it directly in the console isn't very useful, so saving it might be a better option.

Here’s an  version of our script with print statements and options to save the PDFs:

It didnt work

The errors we are encountering with status codes 403 and 404 indicate specific issues when trying to fetch the full text PDFs from PubMed Central (PMC):

Try 2:  we try to modify our fetch_full_text function to try fetching the XML version of the articles, which might be less restricted

In [7]:
import requests
import xml.etree.ElementTree as ET
import time
import logging

# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def generate_pmc_url(pmc_id):
    base_url = "https://www.ncbi.nlm.nih.gov/pmc/articles/"
    article_url = f"{base_url}{pmc_id}/"
    return article_url

def search_article_ids(query, api_key, max_articles_per_query):
    search_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    all_article_ids = []
    for start in range(0, max_articles_per_query, 100):
        params = {
            "db": "pubmed",
            "term": query,
            "retmode": "json",
            "api_key": api_key,
            "retstart": start,
            "retmax": 100
        }
        response = requests.get(search_url, params=params)
        if response.status_code == 200:
            data = response.json()
            article_ids = data["esearchresult"]["idlist"]
            if not article_ids:
                break
            all_article_ids.extend(article_ids)
            time.sleep(0.33)
        else:
            logging.error(f"Failed to search article IDs: {response.status_code}")
            break
    return all_article_ids

def fetch_batch_details(article_ids, api_key):
    fetch_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
    articles_info = []
    batch_size = 100
    for i in range(0, len(article_ids), batch_size):
        batch_ids = article_ids[i:i + batch_size]
        params = {
            "db": "pubmed",
            "retmode": "xml",
            "id": ",".join(batch_ids),
            "api_key": api_key
        }
        response = requests.get(fetch_url, params=params)
        if response.status_code == 200:
            articles_info.extend(parse_article_details(response.text))
            time.sleep(0.33)
        else:
            logging.error(f"Failed to fetch article details: {response.status_code}")
    return articles_info

def fetch_full_text(pmc_id):
    """
    Fetch the full text of an article from PubMed Central using the PMC ID in XML format.
    """
    full_text_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
    response = requests.get(full_text_url)
    if response.status_code == 200:
        return response.text  # Returns the XML content of the article
    else:
        logging.error(f"Failed to fetch full text for PMC ID {pmc_id}: {response.status_code}")
        return None

def parse_article_details(xml_data):
    root = ET.fromstring(xml_data)
    full_texts = []

    for article in root.findall('.//PubmedArticle'):
        pmc_id = article.find(".//ArticleIdList/ArticleId[@IdType='pmc']")
        if pmc_id is not None:
            full_text = fetch_full_text(pmc_id.text)
            if full_text:
                full_texts.append(full_text)  # Save or process the full text as needed
    return full_texts

def main(api_key, query, max_articles=100):
    article_ids = search_article_ids(query, api_key, max_articles)
    if article_ids:
        urls = fetch_batch_details(article_ids, api_key)
        for url in urls:
            print(f"Full text saved to: {url}")
        logging.info(f"Generated URLs for {len(urls)} articles.")
    else:
        logging.info("No articles found.")

# Example usage parameters
api_key = ""  # Fill with your actual API key
query = "Alzheimer's disease"
main(api_key, query)


2024-06-05 15:38:27,733 - ERROR - Failed to fetch full text for PMC ID 10313141: 403
2024-06-05 15:38:28,232 - ERROR - Failed to fetch full text for PMC ID PMC11148533: 403
2024-06-05 15:38:28,689 - ERROR - Failed to fetch full text for PMC ID PMC11148366: 403
2024-06-05 15:38:29,315 - ERROR - Failed to fetch full text for PMC ID 8507321: 403
2024-06-05 15:38:29,921 - ERROR - Failed to fetch full text for PMC ID 7380073: 403
2024-06-05 15:38:30,606 - ERROR - Failed to fetch full text for PMC ID 6279593: 403
2024-06-05 15:38:31,204 - ERROR - Failed to fetch full text for PMC ID 5994507: 403
2024-06-05 15:38:31,756 - ERROR - Failed to fetch full text for PMC ID PMC11149323: 403
2024-06-05 15:38:32,421 - ERROR - Failed to fetch full text for PMC ID 2430603: 403
2024-06-05 15:38:33,089 - ERROR - Failed to fetch full text for PMC ID 5106496: 403
2024-06-05 15:38:33,574 - ERROR - Failed to fetch full text for PMC ID PMC11149856: 403
2024-06-05 15:38:34,170 - ERROR - Failed to fetch full text

HTTP 403 (Forbidden): This typically means that access to the resource is not allowed. This could be due to several reasons:

Permission Issues: The articles might be behind a paywall or restricted from direct PDF download even though they have a PMC ID.
Access Control: PMC may have restrictions on directly accessing PDF files using automated scripts or from certain IP addresses.

can we just try to do for 1 article???????????

In [8]:
import requests
import logging

# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_full_text_xml(pmc_id):
    """
    Fetch the full text of an article from PubMed Central using the PMC ID in XML format.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    full_text_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
    response = requests.get(full_text_url, headers=headers)
    if response.status_code == 200:
        logging.info(f"Successfully fetched full text for PMC ID {pmc_id}")
        return response.text  # Returns the HTML content of the article
    else:
        logging.error(f"Failed to fetch full text for PMC ID {pmc_id}: {response.status_code}")
        return None

# Example usage
pmc_id = "PMC11040515"  # Replace with your actual PMC ID
full_text = fetch_full_text_xml(pmc_id)
if full_text:
    print("Fetched Full Text:", full_text[:500])  # Print first 500 characters of the full text
else:
    print("Failed to fetch full text.")


2024-06-05 15:41:18,295 - INFO - Successfully fetched full text for PMC ID PMC11040515


Fetched Full Text: 
    
<!DOCTYPE html>




<html lang="en" >
<head >
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <!-- Mobile properties -->
    <meta name="HandheldFriendly" content="True">
    <meta name="MobileOptimized" content="320">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">

  
    <!-- Stylesheets -->
    <link rel="stylesheet" href="/pmc/static/CACHE/css/output.6fd905b0fb6f.css" type="text/css">
  
  <link rel="stylesheet" href


Great! It looks like the script successfully fetched the HTML content of the article from PubMed Central. The output you're seeing is the HTML structure of the article's webpage, which includes various metadata and style elements.

Next Steps: Extracting Useful Information
Since the content fetched is HTML, the next step involves parsing this HTML to extract meaningful data, such as the article text itself, figures, or any specific sections you are interested in.

Using BeautifulSoup for HTML Parsing
To extract information from the HTML, you can use the BeautifulSoup library from Python, which is excellent for parsing HTML and XML documents. Here's how you can modify your script to extract the article text or other elements:

In [9]:
! pip install beautifulsoup4





[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
from bs4 import BeautifulSoup
import requests
import logging

# Setup basic configuration for logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_full_text_html(pmc_id):
    """
    Fetch the full text of an article from PubMed Central using the PMC ID.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'# put just one useragent
    }
    full_text_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"

    response = requests.get(full_text_url, headers=headers)
    print(response.url)
    if response.status_code == 200:
        logging.info(f"Successfully fetched full text for PMC ID {pmc_id}")
        return response.text
    else:
        logging.error(f"Failed to fetch full text for PMC ID {pmc_id}: {response.status_code}")
        return None

def parse_html_content(html_content):
    """
    Parse HTML content using BeautifulSoup to extract the main article text.
    """
    soup = BeautifulSoup(html_content, 'html.parser')
    article_text = soup.find_all('p')  # Assuming the main content is in <p> tags
    clean_text = ' '.join([p.get_text() for p in article_text])
    return clean_text

# Example usage
pmc_id = "PMC11040515"  # Replace with your actual PMC ID
full_text_html = fetch_full_text_html(pmc_id)
if full_text_html:
    article_text = parse_html_content(full_text_html)
    print("Article Text Extracted:", article_text[:500])  # Print first 500 characters of the article text
else:
    print("Failed to fetch full text.")


2024-06-05 15:41:30,429 - INFO - Successfully fetched full text for PMC ID PMC11040515


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11040515/
Article Text Extracted: An official website of the United States government 
The .gov means it’s official.

            Federal government websites often end in .gov or .mil. Before
            sharing sensitive information, make sure you’re on a federal
            government site.
           
The site is secure.

            The https:// ensures that you are connecting to the
            official website and that any information you provide is encrypted
            and transmitted securely.
           

             


Now, Adjusting the Parsing Logic :
Since the initial attempt captured generic website information, we'll need to inspect the HTML structure of the page to determine the correct tags and attributes that enclose the article's main text.

1. Inspect the Web Page:
Open the URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11040515/ in your web browser.
Use the "Inspect Element" feature (right-click on the article text and select "Inspect") to see how the article text is structured within the HTML.
2. Refine the BeautifulSoup Selector:
Once you identify the HTML elements that specifically contain the article text (e.g., a specific <div> class or id), update the script to fetch these elements.

article text is contained within <div> elements with the class "article-text"

In [11]:
from bs4 import BeautifulSoup
import requests
import logging

# Setup logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def fetch_full_text_html(pmc_id):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    full_text_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/"
    response = requests.get(full_text_url, headers=headers)
    if response.status_code == 200:
        logging.info(f"Successfully fetched full text for PMC ID {pmc_id}")
        return response.text
    else:
        logging.error(f"Failed to fetch full text for PMC ID {pmc_id}: {response.status_code}")
        return None

def parse_html_content(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    main_content = soup.find('div', class_="main-content")
    if main_content:
        content = main_content.find('div', class_="content")
        if content:
            paragraphs = content.find_all('p')
            article_text = ' '.join(p.get_text() for p in paragraphs)
            return article_text
        else:
            return "Specific article content not found. Check the class within 'main-content'."
    else:
        return "Main content not found. Check the main class selector."

# Example usage
pmc_id = "PMC11040515"
full_text_html = fetch_full_text_html(pmc_id)
if full_text_html:
    article_text = parse_html_content(full_text_html)
    if article_text:
        print("Article Text Extracted:", article_text[:500])
    else:
        print("Failed to extract text: Article text not found.")
else:
    print("Failed to fetch full text.")


2024-06-05 15:41:37,528 - INFO - Successfully fetched full text for PMC ID PMC11040515


Article Text Extracted: Main content not found. Check the main class selector.


I'm doing manually,for just 1 aricle, it is in PDF format

In [12]:
import requests
import logging

def fetch_pdf(pmc_id):
    """
    Attempt to fetch the PDF file of an article using its PMC ID.
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    pdf_url = f"https://www.ncbi.nlm.nih.gov/pmc/articles/{pmc_id}/pdf/"
    response = requests.get(pdf_url, headers=headers)
    if response.status_code == 200:
        pdf_filename = f"{pmc_id}.pdf"
        with open(pdf_filename, 'wb') as f:
            f.write(response.content)
        return pdf_filename
    else:
        logging.error(f"Failed to fetch PDF for PMC ID {pmc_id}: {response.status_code}")
        return None

# Example usage
pmc_id = "PMC11040515"  # Replace with the actual PMC ID
pdf_file = fetch_pdf(pmc_id)
if pdf_file:
    print(f"PDF successfully downloaded: {pdf_file}")
else:
    print("Failed to download PDF.")


PDF successfully downloaded: PMC11040515.pdf


In [13]:
! pip install pdfplumber





[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [14]:
import pdfplumber

def extract_text_from_pdf(pdf_path):
    """
    Extract text from a PDF file using pdfplumber.
    """
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text

# Example usage
pdf_path = "PMC11040515.pdf"  # Path to the downloaded PDF file
extracted_text = extract_text_from_pdf(pdf_path)
print(extracted_text[:1000])  # Print the first 500 characters of the extracted text


https://doi.org/10.1093/braincomms/fcae113 BRAIN COMMUNICATIONS 2024, fcae113 | 1
BRAIN COMMUNICATIONS
Histologic tau lesions and magnetic resonance
imaging biomarkers differ across two
progressive supranuclear palsy variants
Francesca Orlandi,1,2 Arenn F. Carlos,1 Farwa Ali,1 Heather M. Clark,1 Joseph R. Duffy,1
Rene L. Utianski,1 Hugo Botha,1 Mary M. Machulda,3 Yehkyoung C. Stephens,1
Christopher G. Schwarz,4 Matthew L. Senjem,4,5 Clifford R. Jack,4 Federica Agosta,2,6
Massimo Filippi,2,6 Dennis W. Dickson,7 Keith A. Josephs1 and Jennifer L. Whitwell4
Progressive supranuclear palsy is a neurodegenerative disease characterized by the deposition of four-repeat tau in neuronal and glial le-
sions in the brainstem, cerebellar, subcortical and cortical brain regions. There are varying clinical presentations of progressive supra-
nuclear palsy with different neuroimaging signatures, presumed to be due to different topographical distributions and burden of tau.
The classic Richardson syndro

In [15]:

#This code is updated version of above code which we did in the very beggining of the script to fetch urls using PMC IDS, but only the PDF Format:


# Alzheimer's Disease Research Project - Part 2

# Importing Required Libraries
import pdfplumber
import requests
import xml.etree.ElementTree as ET
import pandas as pd
import logging

# Fetching Article Summaries
def fetch_article_summaries(api_key, query):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    params = {
        'db': 'pubmed',
        'term': query,
        'retmax': 100,  # Adjust this to fetch more or fewer articles
        'api_key': api_key,
        'retmode': 'xml'
    }
    response = requests.get(base_url, params=params)
    root = ET.fromstring(response.content)
    id_list = [id_elem.text for id_elem in root.findall('.//Id')]
    return id_list

# Fetching PMC IDs and Full Text URLs
def fetch_pmc_ids(api_key, id_list):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi"
    params = {
        'db': 'pubmed',
        'id': ','.join(id_list),
        'retmode': 'xml',
        'api_key': api_key
    }
    response = requests.get(base_url, params=params)
    root = ET.fromstring(response.content)
    pmc_ids = []
    for docsum in root.findall('.//DocSum'):
        pmc_id_elem = docsum.find(".//Item[@Name='pmc']")
        if pmc_id_elem is not None:
            pmc_ids.append(pmc_id_elem.text)
    return pmc_ids

def generate_full_text_urls(pmc_ids):
    base_url = "https://www.ncbi.nlm.nih.gov/pmc/articles/"
    urls = [f"{base_url}{pmc_id}/pdf" for pmc_id in pmc_ids]
    return urls

# Extracting Text from PDFs
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text

# Main Function
def main(api_key, query):
    logging.basicConfig(level=logging.INFO)
    logging.info("Starting the PubMed full-text fetcher script.")
    
    # Fetch article summaries
    article_summaries = fetch_article_summaries(api_key, query)
    if article_summaries:
        logging.info(f"Fetched {len(article_summaries)} article summaries.")
        
        # Fetch PMC IDs
        pmc_ids = fetch_pmc_ids(api_key, article_summaries)
        logging.info(f"Fetched {len(pmc_ids)} PMC IDs.")
        
        # Generate full-text URLs
        urls = generate_full_text_urls(pmc_ids)
        logging.info(f"Generated URLs for {len(urls)} articles.")
        
        for url in urls:
            logging.info(f"Full text URL: {url}")
            # Optionally download and process each PDF
            # Example: download_pdf(url)
    else:
        logging.info("No articles found.")

# Example Usage
api_key = ""  # Replace with your actual API key
query = "Alzheimer's disease"
main(api_key, query)


2024-06-05 15:42:08,061 - INFO - Starting the PubMed full-text fetcher script.
2024-06-05 15:42:08,547 - INFO - Fetched 100 article summaries.
2024-06-05 15:42:09,552 - INFO - Fetched 44 PMC IDs.
2024-06-05 15:42:09,552 - INFO - Generated URLs for 44 articles.
2024-06-05 15:42:09,553 - INFO - Full text URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11148533/pdf
2024-06-05 15:42:09,554 - INFO - Full text URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11148366/pdf
2024-06-05 15:42:09,554 - INFO - Full text URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149323/pdf
2024-06-05 15:42:09,554 - INFO - Full text URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11149856/pdf
2024-06-05 15:42:09,554 - INFO - Full text URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11146203/pdf
2024-06-05 15:42:09,554 - INFO - Full text URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11146249/pdf
2024-06-05 15:42:09,554 - INFO - Full text URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11144909/p