<a href="https://colab.research.google.com/github/Palaeoprot/PRIDE/blob/main/PRIDE_metadata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **PRIDE Metadata Extraction and Ontology Mapping**

## **Project Overview**
This project aims to extract, standardise, and analyse metadata from the **PRIDE (Proteomics Identifications Database)** repository. The goal is to:
- **Retrieve metadata** for specific PRIDE datasets using the PRIDE API.
- **Store metadata** in a structured format (JSON) within predefined folders.
- **Inspect metadata fields** to identify key elements relevant to proteomics research.
- **Standardise metadata** using ontologies (e.g., PSI-MS) to ensure consistency.
- **Enable aggregation** of metadata across multiple datasets for broader analysis.

This notebook is designed to help you extract and standardize metadata from the PRIDE (Proteomics Identifications Database) repository. Here’s what it does step by step:

1. **Install Required Packages**  
   - The notebook installs two key Python packages:
     - **`owlready2`**: Used for loading and working with ontologies. An ontology is a controlled vocabulary that helps standardize terms (like instrument names) used in proteomics research.
     - **`requests`**: Allows the notebook to make HTTP requests to the PRIDE API to fetch metadata.

2. **Introduction and Project Goals**  
   - At the very beginning, there is a detailed explanation (in markdown) about the project’s purpose. This includes:
     - Extracting metadata for specific PRIDE datasets.
     - Storing the metadata in JSON format.
     - Mapping metadata fields to standardized ontology terms (using the PSI-MS ontology).
     - Preparing the data for later aggregation and analysis.

3. **Installing Additional Tools**  
   - It also installs the `pridepy` package along with `tqdm` for enhanced progress tracking when working with PRIDE data. (Although this notebook focuses on metadata extraction, `pridepy` is useful for other parts of the project.)

4. **Mounting Google Drive and Authentication**  
   - The notebook mounts your Google Drive. This is done so that the extracted metadata files can be saved in a structured directory on your Drive.
   - It then authenticates the user so that the notebook can access your Google Drive and any required Google services.

5. **Loading and Exploring the PSI-MS Ontology**  
   - The notebook loads the **PSI-MS ontology** from an online source. This ontology provides standardized terms for mass spectrometry and proteomics.
   - It demonstrates a few examples:
     - **Searching for specific terms:** For example, it searches for the term corresponding to an "Orbitrap" instrument and prints its label.
     - **Finding related terms:** It shows how to search for terms containing “mass spectrometer” and checks hierarchical relationships (like whether one term is a child of another).
     - **Iterating over ontology classes:** It prints all the classes (or categories) available in the ontology. This helps illustrate how metadata fields might be mapped to standardized terms.

6. **Setting Module Parameters and Preparing Directories**  
   - The notebook defines some key parameters:
     - A **folder name** (e.g., “Hominins”) which will be used to organize the data.
     - A **base directory path** on Google Drive where the experiment’s data will be stored.
   - It then creates an “experiment” folder in the specified directory to hold all the metadata files.

7. **Defining PRIDE Dataset IDs**  
   - A list of PRIDE dataset IDs is defined (for example, “PXD003190”, “PXD003208”, etc.). These IDs represent the specific experiments or datasets that will be queried from the PRIDE repository.

8. **Fetching Metadata from the PRIDE API**  
   - A Python function called `fetch_pride_metadata` is defined:
     - It takes a dataset ID and constructs the appropriate URL for the PRIDE API.
     - It then sends a request to that URL. If the request is successful, the API returns metadata in JSON format.
   - The notebook iterates over the list of PRIDE IDs, uses this function to fetch metadata for each, and saves each JSON response to a file. Each file is named after its corresponding PRIDE ID and stored in the experiment folder.

9. **Saving and Confirming the Extracted Metadata**  
   - After successfully fetching the metadata, the notebook writes the JSON data into separate files within the designated folder on Google Drive.
   - It prints messages to the console confirming that the metadata for each PRIDE dataset has been successfully saved.

#Install Dependencies

#Make Selections

In [1]:
# Import required libraries
import os
import json
import requests
from pathlib import Path
import logging
from google.colab import drive, auth
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
from typing import Optional, List, Dict, Any
from datetime import datetime
from tqdm import tqdm

In [2]:
# Cell 1: Initial Setup and Drive Mounting
# Mount Google Drive
mount_point = '/content/drive'

# Check if the drive is already mounted
if os.path.exists(mount_point) and os.path.isdir(mount_point):
    try:
        os.listdir(mount_point) #tests if the drive is accessible.
        print("Warning: Google Drive is already mounted. Please do not remount.")
    except:
        drive.mount(mount_point) #If the drive exists, but is not accessible, then it is mounted.
else:
    drive.mount(mount_point)



In [3]:
# Cell 2: Configuration and Class Definitions
# --- Module Parameters ---
shared_drive_base_dir_str = "/content/drive/Shareddrives/ZooMS_Data/PRIDE"  # @param {type:"string"}
spreadsheet_id = '127K6zdl5y46DRqUwRr-V32nUDoceaddbhG9XyozJs-4'  # @param {type:"string"}
sheet_name = 'Metadata'  # @param {type:"string"}
folder_name = 'Hominins'  # @param {type:"string"}

# Create base directory if it doesn't exist
os.makedirs(shared_drive_base_dir_str, exist_ok=True)
os.chdir(shared_drive_base_dir_str)

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

In [4]:
# Authenticate for Google Sheets access
auth.authenticate_user()

#Run Script

In [5]:
#Log Progress
class ProgressLogger:
    """Helper class to manage progress reporting and logging"""

    def __init__(self, name: str):
        self.logger = logging.getLogger(name)
        self.start_time = None

    def start_section(self, section_name: str):
        """Start a new section of processing"""
        self.start_time = datetime.now()
        print("\n" + "="*50)
        print(f"Starting: {section_name}")
        print("="*50)
        self.logger.info(f"Starting {section_name}")

    def end_section(self, section_name: str):
        """End a section of processing"""
        duration = datetime.now() - self.start_time
        print("\n" + "-"*50)
        print(f"Completed: {section_name}")
        print(f"Duration: {duration}")
        print("-"*50 + "\n")
        self.logger.info(f"Completed {section_name}. Duration: {duration}")

    def log_progress(self, message: str, level: str = "info"):
        """Log a progress message"""
        print(message)
        getattr(self.logger, level)(message)

def get_pride_ids_from_sheet(spreadsheet_id: str, sheet_name: str) -> List[str]:
    """Enhanced function to retrieve PRIDE IDs with progress reporting"""
    progress = ProgressLogger("SheetReader")
    progress.start_section("Reading PRIDE IDs from Google Sheet")

    try:
        service = build('sheets', 'v4')
        sheet = service.spreadsheets()
        range_name = f"{sheet_name}!A2:A"

        progress.log_progress(f"Accessing sheet: {sheet_name}")
        result = sheet.values().get(spreadsheetId=spreadsheet_id, range=range_name).execute()
        values = result.get('values', [])

        if not values:
            progress.log_progress("✗ No data found in sheet", "warning")
            return []

        pride_ids = [row[0].strip() for row in values if row and row[0].strip().startswith('PXD')]

        progress.log_progress(f"✓ Found {len(pride_ids)} PRIDE IDs:")
        for px_id in pride_ids:
            progress.log_progress(f"  - {px_id}")

        progress.end_section("Reading PRIDE IDs")
        return pride_ids

    except HttpError as err:
        progress.log_progress(f"✗ Failed to access sheet: {err}", "error")
        return []

def fetch_pride_metadata(px_id: str) -> Optional[Dict[str, Any]]:
    """Fetch metadata for a PRIDE project"""
    base_url = "https://www.ebi.ac.uk/pride/ws/archive/v3/projects"
    url = f"{base_url}/{px_id}"

    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()

        metadata = response.json()
        if not metadata:
            return None

        return metadata

    except requests.exceptions.RequestException as e:
        if isinstance(e, requests.exceptions.HTTPError) and e.response.status_code == 404:
            print(f"✗ Project {px_id} not found (404)")
        else:
            print(f"✗ Error fetching {px_id}: {str(e)}")
        return None

    except json.JSONDecodeError as e:
        print(f"✗ Error parsing JSON response for {px_id}: {str(e)}")
        return None

def check_existing_metadata(px_id: str, output_dir: Path) -> bool:
    """Check if metadata file already exists for given PRIDE ID"""
    metadata_file = output_dir / px_id / f"{px_id}_experiment_metadata.json"
    return metadata_file.exists()


def save_metadata(metadata: Dict[str, Any], px_id: str, output_dir: Path, mapped: bool = False) -> bool:
    """
    Save experiment-level metadata only if it doesn't already exist.

    The file is saved in a subfolder named after the PRIDE ID, e.g.:
      <output_dir>/<px_id>/<px_id>_experiment_metadata.json
    """
    try:
        # Ensure that output_dir is the folder where experiment subfolders reside (e.g., "Hominins")
        # Create a subdirectory for this project if it doesn't exist.
        project_dir = output_dir / px_id
        project_dir.mkdir(parents=True, exist_ok=True)

        # Define the output file name inside the project subfolder.
        output_file = project_dir / f"{px_id}_experiment_metadata.json"
        logger.info(f"Computed output file path: {output_file}")

        # Check if the file already exists in the project subfolder.
        if output_file.exists():
            logger.info(f"Experiment metadata file already exists for {px_id} at {output_file}")
            return True  # File already exists, so do nothing.

        # Write the metadata into the file within the project subfolder.
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(metadata, f, indent=2, ensure_ascii=False)
        logger.info(f"Experiment metadata saved successfully for {px_id} in {project_dir}")
        return True

    except Exception as e:
        logger.error(f"Error saving metadata for {px_id}: {str(e)}")
        return False


def main(spreadsheet_id: str, sheet_name: str, base_dir: str, folder_name: str):
    """Main execution function with file organization and existence checks"""
    progress = ProgressLogger("Main")

    # Setup and initialization
    progress.start_section("Initialization")
    output_dir = Path(base_dir) / folder_name
    output_dir.mkdir(parents=True, exist_ok=True)
    progress.log_progress(f"Output directory: {output_dir}")

    # Get PRIDE IDs
    pride_ids = get_pride_ids_from_sheet(spreadsheet_id, sheet_name)
    if not pride_ids:
        progress.log_progress("✗ No valid PRIDE IDs found. Exiting.", "error")
        return

    # Process each PRIDE ID
    progress.start_section("Processing PRIDE Projects")
    stats = {
        "total": len(pride_ids),
        "skipped": 0,
        "successful": 0,
        "failed": 0
    }

    for px_id in tqdm(pride_ids, desc="Processing projects"):
        progress.log_progress(f"\nChecking {px_id}")

        # Check if metadata already exists
        if check_existing_metadata(px_id, output_dir):
            progress.log_progress(f"⏭ Skipping {px_id} - metadata already exists")
            stats["skipped"] += 1
            continue

        # Fetch metadata
        progress.log_progress(f"Fetching metadata for {px_id}")
        metadata = fetch_pride_metadata(px_id)

        if metadata:
            # Save metadata
            if save_metadata(metadata, px_id, output_dir):
                progress.log_progress(f"✓ Successfully processed {px_id}")
                stats["successful"] += 1
            else:
                progress.log_progress(f"✗ Failed to save metadata for {px_id}")
                stats["failed"] += 1
        else:
            progress.log_progress(f"✗ Failed to fetch metadata for {px_id}")
            stats["failed"] += 1

    # Final summary
    progress.start_section("Final Summary")
    progress.log_progress(f"""
Processing Complete:
- Total projects: {stats['total']}
- Already existed (skipped): {stats['skipped']}
- Successfully processed: {stats['successful']}
- Failed: {stats['failed']}
""")
    progress.end_section("Process Complete")



In [6]:
if __name__ == "__main__":
    # Configure logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )

    # Mount Google Drive if needed
    if not os.path.ismount('/content/drive'):
        drive.mount('/content/drive')

    # Authenticate for Google Sheets access
    auth.authenticate_user()

    # Run main processing
    main(
    spreadsheet_id=spreadsheet_id,
    sheet_name=sheet_name,
    base_dir=shared_drive_base_dir_str,
    folder_name=folder_name
    )


Starting: Initialization
Output directory: /content/drive/Shareddrives/ZooMS_Data/PRIDE/Hominins

Starting: Reading PRIDE IDs from Google Sheet
Accessing sheet: Metadata
✓ Found 8 PRIDE IDs:
  - PXD011377
  - PXD018264
  - PXD018721
  - PXD020530
  - PXD043272
  - PXD045412
  - PXD047932
  - PXD058447

--------------------------------------------------
Completed: Reading PRIDE IDs
Duration: 0:00:06.036005
--------------------------------------------------


Starting: Processing PRIDE Projects


Processing projects: 100%|██████████| 8/8 [00:00<00:00, 69.64it/s]


Checking PXD011377
⏭ Skipping PXD011377 - metadata already exists

Checking PXD018264
⏭ Skipping PXD018264 - metadata already exists

Checking PXD018721
⏭ Skipping PXD018721 - metadata already exists

Checking PXD020530
⏭ Skipping PXD020530 - metadata already exists

Checking PXD043272
⏭ Skipping PXD043272 - metadata already exists

Checking PXD045412
⏭ Skipping PXD045412 - metadata already exists

Checking PXD047932
⏭ Skipping PXD047932 - metadata already exists

Checking PXD058447
Fetching metadata for PXD058447
✗ Project PXD058447 not found (404)
✗ Failed to fetch metadata for PXD058447

Starting: Final Summary

Processing Complete:
- Total projects: 8
- Already existed (skipped): 7
- Successfully processed: 0
- Failed: 1


--------------------------------------------------
Completed: Process Complete
Duration: 0:00:00.000040
--------------------------------------------------




