<a href="https://colab.research.google.com/github/Palaeoprot/PRIDE/blob/main/Multi_PRIDE_sheets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#PRIDE Files Download Program

This Colab notebook provides an automated solution for downloading proteomics data files from the PRIDE (PRoteomics IDEntifications) database using the [pridepy](https://github.com/PRIDE-Archive/pridepy) package.

## Features

- Retrieves PRIDE project IDs from a specified Google Sheet
- Downloads essential proteomics files (.fasta, .mgf) and README files
- Supports multiple download protocols (aspera, ftp, globus)
- Organizes downloads in a structured directory hierarchy
- Integrates with Google Drive for storage

## Prerequisites

- Google Colab environment
- Access to Google Drive
- Google Sheets containing PRIDE project IDs
- `pridepy` package (automatically installed by the notebook)

## Configuration Parameters

The following parameters can be configured in the notebook:

- `sheet_name`: Name of the worksheet containing PRIDE IDs
- `repository`: Repository name (currently set to 'PRIDE')
- `file_types`: File types to download (default: 'mgf')
- `download_raw`: Boolean flag for downloading RAW files
- `protocol`: Download protocol ('aspera', 'ftp', or 'globus')
- `folder_name`: Name of the folder where files will be stored
- `spreadsheet_id`: Google Sheets ID containing PRIDE project IDs
- `shared_drive_base_dir_str`: Base directory path in Google Drive

## Google Sheet Structure

The program expects a Google Sheet with:
- PRIDE project IDs in column A
- Data starting from row 2 (row 1 assumed to be headers)

## Usage

1. Open the notebook in Google Colab
2. Mount your Google Drive
3. Configure the parameters as needed
4. Run all cells

The program will:
1. Authenticate and access Google Drive
2. Create necessary directories
3. Retrieve PRIDE IDs from the specified Google Sheet
4. Download files for each PRIDE project
5. Organize files in project-specific folders

## File Selection

The program automatically downloads:
- `.fasta` files
- `.mgf` files
- README files
- `.raw` files (optional, controlled by `download_raw` parameter)

## Output Structure

Files are organized in the following structure:
Use code with caution.
Python
shared_drive_base_dir/
└── folder_name/
├── PRIDE_ID_1/
│ ├── file1.mgf
│ ├── file2.fasta
│ └── readme.txt
└── PRIDE_ID_2/
├── file1.mgf
└── file2.fasta

## Error Handling

The program includes error handling for:
- Google Sheets API errors
- File download failures
- JSON parsing errors
- Directory creation issues

## Dependencies

- `pridepy`
- `google.colab`
- `googleapiclient`
- `pathlib`
- `subprocess`
- `json`

## Notes

- The program uses the `pridepy` command-line interface for file downloads
- Progress and errors are logged to the notebook output
- Failed downloads are reported but don't stop the entire process
- Existing files may be overwritten
"""


In [1]:
#download pridepy
!pip install --upgrade pridepy tqdm

"""To learn more about pridepy"""

# !pridepy --help
# !pridepy stream-files-metadata --help
# !pridepy --help | grep download
# !pip install --upgrade pridepy tqdm



'To learn more about pridepy'

In [2]:
import subprocess
import json
import logging
from pathlib import Path
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
import pandas as pd

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

# --- Module Parameters ---
sheet_name = 'PX Hominins'  # @param {type:"string"}
repository = 'PRIDE'  # @param {type:"string"}
file_types = 'mgf, fasta, txt, raw'  # @param {type:"string"}
protocol = 'aspera'  # @param ['aspera', 'ftp', 'globus']
folder_name = 'Hominins'  # @param {type:"string"}
shared_drive_base_dir_str = "/content/drive/Shareddrives/MS_data/PRIDE"  # @param {type:"string"}
spreadsheet_id = 'put the ID of your spreadhsheet here'  # @param {type:"string"}

# --- Configure Logging ---
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


In [3]:
@dataclass
class FileInfo:
    """Data class for file information."""
    filename: str
    size: int
    project_id: str
    file_type: str

    @property
    def size_in_gb(self) -> float:
        return self.size / 1e9

    @classmethod
    def from_dict(cls, data: Dict[str, Any], project_id: str) -> 'FileInfo':
        """Create FileInfo instance from PRIDE metadata dictionary."""
        # Try different size fields that might exist in PRIDE metadata
        size_fields = ['fileSize', 'publicFileSize', 'fileSizeBytes']
        file_size = 0
        for field in size_fields:
            if field in data and data[field]:
                try:
                    # Some fields might store size as string
                    file_size = int(str(data[field]).replace(',', ''))
                    break
                except (ValueError, TypeError):
                    continue

        return cls(
            filename=data["fileName"],
            size=file_size,
            project_id=project_id,
            file_type=Path(data["fileName"]).suffix.lower()[1:]
        )

def get_pride_ids_from_sheet(spreadsheet_id: str, sheet_name: str) -> list:
    """Retrieves PRIDE IDs from Google Sheet."""
    try:
        service = build('sheets', 'v4')
        sheet = service.spreadsheets()
        range_name = f"{sheet_name}!A2:A"
        result = sheet.values().get(spreadsheetId=spreadsheet_id, range=range_name).execute()
        values = result.get('values', [])

        if not values:
            logger.warning('No data found in the Google Sheet.')
            return []

        pride_ids = [row[0] for row in values if row]
        logger.info(f"Retrieved {len(pride_ids)} PRIDE IDs")
        for pride_id in pride_ids:
            logger.info(f"Found PRIDE ID: {pride_id}")
        return pride_ids

    except HttpError as err:
        logger.error(f"Failed to retrieve data from Google Sheets: {err}")
        return []

def get_project_files(project_id: str, download_dir: Path) -> List[FileInfo]:
    """Get list of files available for a PRIDE project."""
    project_dir = download_dir / project_id
    project_dir.mkdir(parents=True, exist_ok=True)
    metadata_file = project_dir / f"{project_id}_metadata.json"

    command = [
        'pridepy',
        'stream-files-metadata',
        '-a', project_id,
        '-o', str(metadata_file)
    ]

    try:
        result = subprocess.run(
            command,
            check=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True
        )
        logger.info(f"Retrieved metadata for project {project_id}")

        with open(metadata_file, "r") as f:
            metadata = json.load(f)

        file_infos = []
        for data in metadata:
            file_info = FileInfo.from_dict(data, project_id)
            file_infos.append(file_info)
            logger.info(f"Found file: {file_info.filename} ({file_info.size_in_gb:.2f} GB)")

        return file_infos

    except (subprocess.CalledProcessError, json.JSONDecodeError, IOError) as err:
        logger.error(f"Failed to get files for project {project_id}: {err}")
        return []

def download_file(file_info: FileInfo, output_dir: Path, protocol: str = 'aspera') -> bool:
    """Download a single file using pridepy."""
    project_dir = output_dir / file_info.project_id
    project_dir.mkdir(parents=True, exist_ok=True)

    command = [
        'pridepy',
        'download-file-by-name',
        '-a', file_info.project_id,
        '-f', file_info.filename,
        '-o', str(project_dir),
        '-p', protocol
    ]

    try:
        logger.info(f"Downloading {file_info.filename} with {protocol}")
        result = subprocess.run(command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        logger.info(f"Successfully downloaded {file_info.filename}")
        return True
    except subprocess.CalledProcessError as e:
        logger.error(f"Failed to download {file_info.filename} with {protocol}: {e.stderr}")
        # Try FTP if Aspera fails
        if protocol == 'aspera':
            logger.info(f"Retrying with FTP: {file_info.filename}")
            return download_file(file_info, output_dir, 'ftp')
        return False

def download_files_by_type(pride_id: str, file_type: str, all_files: Dict[str, List[FileInfo]],
                          output_dir: Path, protocol: str = 'aspera'):
    """Download all files of a specific type for a PRIDE project."""
    if pride_id not in all_files:
        logger.error(f"No files found for {pride_id}")
        return

    files = all_files[pride_id]
    files_to_download = [f for f in files if f.file_type == file_type]

    if not files_to_download:
        logger.warning(f"No {file_type} files found for {pride_id}")
        return

    logger.info(f"Starting download of {len(files_to_download)} {file_type} files for {pride_id}")

    successful = 0
    for file_info in files_to_download:
        if download_file(file_info, output_dir, protocol):
            successful += 1

    logger.info(f"Successfully downloaded {successful}/{len(files_to_download)} {file_type} files for {pride_id}")

def group_files_by_type(files: List[FileInfo]) -> Dict[str, List[FileInfo]]:
    """Group files by their type."""
    grouped = {}
    for file in files:
        if file.file_type not in grouped:
            grouped[file.file_type] = []
        grouped[file.file_type].append(file)
    return grouped

def print_file_summary(pride_id: str, files: List[FileInfo]):
    """Print summary of available files for a PRIDE project."""
    print(f"\nFiles available for {pride_id}:")
    print("=" * 50)

    if not files:
        print("No files found")
        return

    grouped_files = group_files_by_type(files)

    for file_type, file_list in sorted(grouped_files.items()):
        total_size = sum(f.size_in_gb for f in file_list)
        print(f"\nFile type: .{file_type}")
        print(f"Number of files: {len(file_list)}")
        print(f"Total size: {total_size:.2f} GB")
        print("\nFiles:")
        for f in sorted(file_list, key=lambda x: x.filename):
            print(f"- {f.filename} ({f.size_in_gb:.2f} GB)")

def save_file_summary(pride_id: str, files: List[FileInfo], summary_dir: Path):
    """Save file summary to a text file."""
    summary_path = summary_dir / pride_id / "available_files_summary.txt"
    summary_path.parent.mkdir(parents=True, exist_ok=True)

    grouped_files = group_files_by_type(files)

    with open(summary_path, "w") as f:
        f.write(f"Files available for {pride_id}:\n")
        f.write("=" * 50 + "\n")

        if not files:
            f.write("\nNo files found\n")
            return

        for file_type, file_list in sorted(grouped_files.items()):
            total_size = sum(f.size_in_gb for f in file_list)
            f.write(f"\nFile type: .{file_type}\n")
            f.write(f"Number of files: {len(file_list)}\n")
            f.write(f"Total size: {total_size:.2f} GB\n")
            f.write("\nFiles:\n")
            for file_info in sorted(file_list, key=lambda x: x.filename):
                f.write(f"- {file_info.filename} ({file_info.size_in_gb:.2f} GB)\n")



In [4]:
#-----------------Main
def main():
    """Main execution function with verification and download steps."""
    try:
        # Get PRIDE IDs
        pride_ids = get_pride_ids_from_sheet(spreadsheet_id, sheet_name)
        if not pride_ids:
            logger.error("No PRIDE IDs found")
            return

        # Setup directories
        base_dir = Path(shared_drive_base_dir_str)
        download_dir = base_dir / folder_name
        download_dir.mkdir(parents=True, exist_ok=True)

        # First list all available files
        print("\nChecking available files for each PRIDE project...")
        all_files = {}

        for pride_id in pride_ids:
            files = get_project_files(pride_id, download_dir)
            all_files[pride_id] = files
            print_file_summary(pride_id, files)
            save_file_summary(pride_id, files, download_dir)

        # Print overall summary
        print("\nOverall Summary:")
        print("=" * 50)
        for pride_id, files in all_files.items():
            grouped = group_files_by_type(files)
            print(f"\n{pride_id}:")
            for file_type, file_list in sorted(grouped.items()):
                print(f"  .{file_type}: {len(file_list)} files")

        # Add download functionality
        print("\nWould you like to proceed with downloads? (y/n)")
        response = input().lower()
        if response == 'y':
            print("\nSelect file type to download:")
            available_types = set()
            for files in all_files.values():
                for file in files:
                    available_types.add(file.file_type)

            print("Available file types:", ", ".join(sorted(available_types)))
            file_type = input("Enter file type (without dot): ").lower()

            if file_type not in available_types:
                logger.error(f"Invalid file type. Must be one of: {', '.join(sorted(available_types))}")
                return

            print(f"\nDownloading {file_type} files using {protocol} protocol...")
            for pride_id in pride_ids:
                download_files_by_type(pride_id, file_type, all_files, download_dir, protocol)

        print("\nFile summaries have been saved. You can now proceed with downloads.")
        print(f"Current file types to download: {file_types}")

    except Exception as e:
        logger.error(f"Process failed: {str(e)}")
        raise

if __name__ == "__main__":
    main()


Checking available files for each PRIDE project...

Files available for PXD018264:

File type: .raw
Number of files: 78
Total size: 201.58 GB

Files:
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1499_Equid1_Trypsin1.raw (2.46 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1500_Equid1_Trypsin2.raw (2.42 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1501_Equid1_Trypsin3.raw (2.45 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1502_Equid1_Pepsin1.raw (3.19 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1503_Equid1_Pepsin2.raw (3.29 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1504_Equid1_Pepsin3.raw (3.29 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1505_Equid1_LysN1.raw (3.10 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1506_Equid1_LysN2.raw (3.11 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1507_Equid1_LysN3.raw (3.10 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1508_Equid1_GluC1.raw (2.94 GB)
- 20181116_QE7_nLC11_MEM_COLLAB_FWLeakey_1509_Equid1_GluC2.raw (2.91 GB)
- 20181116_QE7_nLC11_MEM_COLLAB