# Introduction
This notebook demonstrates the process of extracting abstracts from the PubMed website and performing the necessary preprocessing on the extracted data.


# Imports
- **requests**: To make HTTP requests to the PubMed API.
- **pandas**: Manages large datasets in tabular format.
- **csv**: To read/write data to CSV files.
- **os**: To interact with the file system.
- **xml.etree.ElementTree**: To parse and process XML responses.


In [1]:
import requests
import xml.etree.ElementTree as ET
import csv
import pandas as pd
import os

# Defining a function to search PubMed using its API for article IDs based on a list of keywords. 

1. Constructs a search query from keywords using AND logic.
2. Sends an HTTP request to PubMed's ESearch API.
3. Parses the JSON response to extract a list of PubMed IDs.

In [3]:
def search_pubmed_for_ids(keywords):
    # Construct the query using the keywords
    query = ' AND '.join(keywords)
    search_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={query}&retmode=json&retmax=10000"
    
    # Perform the request
    response = requests.get(search_url)
    
    # Check for successful response
    if response.status_code == 200:
        data = response.json()
        return data['esearchresult']['idlist']  # Return list of PubMed IDs
    else:
        print(f"Failed to retrieve PubMed IDs. Status code: {response.status_code}")
        return []

keywords = ["Neoplasms", "Antineoplastic Agents", "Adverse Effects", "Toxicity"]

# Fetch PubMed IDs based on the keywords
abstract_ids = search_pubmed_for_ids(keywords)

# Display a few PubMed IDs to confirm
print("Fetched PubMed IDs:", abstract_ids[:10])  # Display first 10 PubMed IDs
print(len(abstract_ids))

Fetched PubMed IDs: ['39410854', '39403930', '39384801', '39384243', '39382658', '39377592', '39377544', '39375818', '39375657', '39368044']
9999


# Define the base URL and parameters for the EFetch API, which retrieves full article abstracts by PubMed IDs.

1. base_url: The API endpoint for fetching articles.
2. params: Default parameters for the API request (database, return type, API key).

In [4]:
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"

# Parameters for the API request
params = {
    'db': 'pubmed',
    'rettype': 'abstract',
    'api_key': 'd4a0e5f85881f5f38b9c0e9a84ac5338e408'
}

# Extracting the abstracts from PubMed using their IDs

Set up a CSV file for saving PubMed abstracts and associated MeSH headings.

1. Initialize counters for tracking total, null, and valid abstracts processed.
2. Check if the CSV file already exists to avoid overwriting.
3. Open the file in append or write mode and add a header if it's a new file.


Define a function to fetch the abstract and MeSH headings for a given PubMed ID.

1. Sends a request to the EFetch API using the PubMed ID.
2. Parses the XML response to extract:
   -  **AbstractText**: The article's abstract.
   - **MeshHeading/DescriptorName**: Associated MeSH headings.
3. Handles XML parsing errors and API response issues.


In [5]:
total_abstract_count = 0
null_abstract_count = 0
valid_abstract_count = 0

# Check if the CSV file already exists
file_exists = os.path.isfile('raw_abstracts_data.csv')

# Open the CSV file once to write all abstracts
with open('raw_abstracts_data.csv', mode='a' if file_exists else 'w', newline='', encoding='utf-8') as csv_file:
    writer = csv.writer(csv_file)

    # Write the header row only if the file is being created (i.e., it does not already exist)
    if not file_exists:
        writer.writerow(['PubMedID', 'Abstract', 'Mesh Headings'])
        
    # Iterate through each PubMed ID and fetch the abstract
    for pubmed_id in abstract_ids:
        params['id'] = pubmed_id
        response = requests.get(base_url, params=params)

        if response.status_code == 200:
            total_abstract_count += 1

            # Parse the XML content and handle any parsing errors
            try:
                root = ET.fromstring(response.text)

                # Find all elements with the tag 'AbstractText'
                abstract_text_elements = root.findall('.//AbstractText')

                # Concatenate all the text from <AbstractText> tags into one string
                abstract_texts = ' '.join(''.join(element.itertext()).strip() for element in abstract_text_elements if element.text)
                
                # Find all the MeshHeadings
                mesh_heading_elements = root.findall('.//MeshHeading/DescriptorName')
                mesh_headings = ', '.join([element.text for element in mesh_heading_elements if element.text])

                # Write the PubMed ID and combined abstract into the CSV
                if abstract_texts:  # Only write if abstract is not null
                    writer.writerow([pubmed_id, abstract_texts, mesh_headings])
                    valid_abstract_count += 1
                else:
                    null_abstract_count += 1
                
            except ET.ParseError as e:
                print(f"XML parsing error for PubMed ID {pubmed_id}: {e}")

        else:
            print(f"Failed to fetch data for PubMed ID {pubmed_id}. Status code: {response.status_code}")

print('Total Accessed Abstracts:', total_abstract_count)
print('Valid abstracts:', valid_abstract_count)
print('Null abstracts:', null_abstract_count)

Total Accessed Abstracts: 9999
Valid abstracts: 9648
Null abstracts: 351


# Preprocessing

Necessary Libraries and data.
- **re**: For regular expressions to clean text (e.g., removing URLs and special characters).
- **nltk**: Provides natural language processing tools.
- **WordNetLemmatizer**: For reducing words to their base form.
- **stopwords and word_tokenize**: To remove common words and tokenize text into words.
- **punkt(NLTK)**: Used for word tokenization.
- **stopwords(NLTK)**: Provides a list of common words to filter out.
- **wordnet(NLTK)**: A lexical database for lemmatization.

Preprocessing steps:
1. **Convert to lowercase**: Standardizes text for consistent processing.
2. **Remove URLs**: Cleans out links from the text.
3. **Remove mathematical formulas**: Filters out numeric and symbolic formulas.
4. **Remove special characters**: Keeps only letters, numbers, and spaces.
5. **Tokenize**: Splits text into individual words.
6. **Lemmatize**: Reduces words to their base or dictionary form (e.g., "running" → "run").
7. **Rejoin Tokens**: Combines the cleaned words back into a single string.

Save the cleaned dataset to a new CSV file.

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from string import punctuation

# Ensure you have the necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

df = pd.read_csv('raw_abstracts_data.csv')

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Load stopwords
stop_words = set(stopwords.words('english'))

# Define a preprocessing function
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove mathematical formulas (assuming they include digits, symbols like +, -, /, *, etc.)
    text = re.sub(r'[0-9]+[+-/*^=<>]+[0-9]*', '', text)

    # Remove special characters (keep only alphanumeric and space)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Tokenization (split into words)
    tokens = word_tokenize(text)

    # Lemmatize the tokens (convert to base forms)
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

    # Join the cleaned tokens back into a string
    cleaned_text = ' '.join(lemmatized_tokens)
    
    return cleaned_text

# Apply preprocessing to the 'Abstract' column
df['Abstract'] = df['Abstract'].apply(preprocess_text)

# Save the cleaned data to a new CSV file
df.to_csv('preprocessed_abstracts_data.csv', index=False)
