<a href="https://colab.research.google.com/github/Palaeoprot/PRIDE/blob/main/Multi_PRIDE_sheets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PRIDE Files Download Program

This Colab notebook provides an automated solution for downloading proteomics data files from the PRIDE (PRoteomics IDEntifications) database using the [pridepy](https://github.com/PRIDE-Archive/pridepy) package.

## Features

- Retrieves PRIDE project IDs from a specified Google Sheet
- Downloads essential proteomics files (.fasta, .mgf) and README files
- Supports multiple download protocols (aspera, ftp, globus)
- Organizes downloads in a structured directory hierarchy by file category
- Integrates with Google Drive for storage
- Smart handling of large text files with configurable size limits
- Interactive file type selection and download tracking

## Prerequisites

- Google Colab environment
- Access to Google Drive
- Google Sheets containing PRIDE project IDs
- `pridepy` package (automatically installed by the notebook)

## Configuration Parameters

The following parameters can be configured in the notebook:

- `sheet_name`: Name of the worksheet containing PRIDE IDs
- `repository`: Repository name (currently set to 'PRIDE')
- `file_types`: File types to download (default: 'mgf, fasta, txt, raw')
- `protocol`: Download protocol ('aspera', 'ftp', or 'globus')
- `folder_name`: Name of the folder where files will be stored
- `download_large_text_files`: Boolean flag to control downloading of text files > 1MB
- `shared_drive_base_dir_str`: Base directory path in Google Drive
- `spreadsheet_id`: Google Sheets ID containing PRIDE project IDs

## File Categories

Files are automatically organized into the following categories:

- `RAW`: Raw instrument data files (.raw, .wiff, .d)
- `PEAK`: Peak list files (.mgf, .mzml)
- `RESULT`: Analysis result files (.mzidentml, .mztab)
- `FASTA`: Sequence database files (.fasta)
- `OTHER`: Documentation and miscellaneous files (.txt, .pdf)

## Output Structure

Files are organized in a category-based structure:
```
shared_drive_base_dir/
└── folder_name/
    └── PRIDE_ID/
        ├── RAW/
        │   └── raw_files...
        ├── PEAK/
        │   └── peak_files...
        ├── RESULT/
        │   └── result_files...
        ├── FASTA/
        │   └── fasta_files...
        └── OTHER/
            └── documentation_files...
```

## Interactive Features

- Lists available file types with size information
- Tracks downloaded file types across sessions
- Shows progress and remaining file types
- Optional size limits for text files (default 1MB limit)
- Automatic retry with FTP if Aspera download fails

## Error Handling

The program includes comprehensive error handling for:
- Google Sheets API errors
- File download failures with protocol fallback
- JSON parsing errors
- Directory creation issues
- Size limit violations
- Invalid file type selections

## Dependencies

- `pridepy`
- `google.colab`
- `googleapiclient`
- `pathlib`
- `subprocess`
- `json`
- `tqdm` (for progress bars)

## Notes

- The program uses the `pridepy` command-line interface for file downloads
- Progress and errors are logged to both console and log files
- Failed downloads are reported but don't stop the entire process
- Existing files are skipped to avoid unnecessary downloads
- Text files > 1MB are skipped by default unless explicitly enabled

In [None]:
#download pridepy
!pip install --upgrade pridepy tqdm

"""To learn more about pridepy"""

# !pridepy --help
# !pridepy stream-files-metadata --help
# !pridepy --help | grep download
# !pip install --upgrade pridepy tqdm

In [None]:
from google.colab import auth, drive
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError
import subprocess
import json
import logging
from pathlib import Path
from typing import List, Dict, Any, Optional
from dataclasses import dataclass
import pandas as pd

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

# --- Module Parameters ---
sheet_name = 'PX Hominins'  # @param {type:"string"}
repository = 'PRIDE'  # @param {type:"string"}
file_types = 'mgf, fasta, txt, raw'  # @param {type:"string"}
protocol = 'aspera'  # @param ['aspera', 'ftp', 'globus']
folder_name = 'Hominins'  # @param {type:"string"}
download_large_text_files = False  # @param {type:"boolean"}
shared_drive_base_dir_str = "/content/drive/Shareddrives/ZooMS_Data/PRIDE"  # @param {type:"string"}
spreadsheet_id = '127K6zdl5y46DRqUwRr-V32nUDoceaddbhG9XyozJs-4'  # @param {type:"string"}


# Mount Google Drive
drive.mount('/content/drive', force_remount=True)

# --- Authenticate ---
auth.authenticate_user()

# --- Configure Logging ---
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


In [None]:
@dataclass
class FileInfo:
    """Data class for file information."""
    filename: str
    size: int
    project_id: str
    file_type: str
    category: str = "OTHER"  # Default category if none specified

    @property
    def size_in_gb(self) -> float:
        """Return file size in gigabytes."""
        return self.size / 1e9

    @property
    def size_in_mb(self) -> float:
        """Return file size in megabytes."""
        return self.size / (1024 * 1024)

    @classmethod
    def from_dict(cls, data: Dict[str, Any], project_id: str) -> 'FileInfo':
        """Create FileInfo instance from PRIDE metadata dictionary."""
        # Try different size fields that might exist in PRIDE metadata
        size_fields = ['fileSize', 'publicFileSize', 'fileSizeBytes']
        file_size = 0
        for field in size_fields:
            if field in data and data[field]:
                try:
                    file_size = int(str(data[field]).replace(',', ''))
                    break
                except (ValueError, TypeError):
                    continue

        # Determine file category based on extension
        filename = data["fileName"]
        file_ext = Path(filename).suffix.lower()

        # Map file extensions to PRIDE categories
        category_map = {
            '.raw': 'RAW',
            '.wiff': 'RAW',
            '.d': 'RAW',
            '.mgf': 'PEAK',
            '.mzml': 'PEAK',
            '.mzidentml': 'RESULT',
            '.mztab': 'RESULT',
            '.fasta': 'FASTA',
            '.txt': 'OTHER',
            '.pdf': 'OTHER'
        }

        category = category_map.get(file_ext, 'OTHER')

        # Special case for README files
        if 'readme' in filename.lower():
            category = 'OTHER'

        return cls(
            filename=filename,
            size=file_size,
            project_id=project_id,
            file_type=file_ext[1:] if file_ext else '',  # Remove the dot from extension
            category=category
        )

def should_download_file(file_info: FileInfo, download_large_text_files: bool = False) -> bool:
    """
    Determine if a file should be downloaded based on its type and size.

    Args:
        file_info: FileInfo object containing file metadata
        download_large_text_files: Flag to control downloading of large text files

    Returns:
        bool: True if file should be downloaded, False otherwise
    """
    # Always download non-text files
    if not file_info.filename.lower().endswith('.txt'):
        return True

    # For text files, check size limit unless override is set
    if not download_large_text_files and file_info.size_in_mb > 1:
        logger.info(f"Skipping large text file: {file_info.filename} ({file_info.size_in_mb:.2f} MB)")
        return False

    return True

def download_file(file_info: FileInfo, output_dir: Path, protocol: str = 'aspera',
                 download_large_text_files: bool = False) -> bool:
    """Download a single file using pridepy with size limit checks."""
    # First check if we should download this file
    if not should_download_file(file_info, download_large_text_files):
        logger.info(f"Skipping {file_info.filename} due to size restrictions")
        return False

    # Create category-based subdirectory
    project_dir = output_dir / file_info.project_id / file_info.category
    project_dir.mkdir(parents=True, exist_ok=True)

    # Check if file already exists
    file_path = project_dir / file_info.filename
    if file_path.exists():
        logger.info(f"Skipping {file_info.filename}, already present in {project_dir}")
        print(f"[SKIP] {file_info.filename} (Already exists)")
        return True

    command = [
        'pridepy',
        'download-file-by-name',
        '-a', file_info.project_id,
        '-f', file_info.filename,
        '-o', str(project_dir),
        '-p', protocol
    ]

    try:
        print(f"[DOWNLOADING] {file_info.filename} ...")
        result = subprocess.run(command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
        print(f"[SUCCESS] {file_info.filename}")
        logger.info(f"Successfully downloaded {file_info.filename}")
        return True
    except subprocess.CalledProcessError as e:
        print(f"[FAILED] {file_info.filename} (Error: {e.stderr})")
        logger.error(f"Failed to download {file_info.filename} with {protocol}: {e.stderr}")

        # Try FTP if Aspera fails
        if protocol == 'aspera':
            logger.info(f"Retrying with FTP: {file_info.filename}")
            return download_file(file_info, output_dir, 'ftp', download_large_text_files)
        return False

In [None]:
# #----Functions

def get_pride_ids_from_sheet(spreadsheet_id: str, sheet_name: str) -> list:
    """Retrieves PRIDE IDs from Google Sheet."""
    try:
        service = build('sheets', 'v4')
        sheet = service.spreadsheets()
        range_name = f"{sheet_name}!A2:A"
        result = sheet.values().get(spreadsheetId=spreadsheet_id, range=range_name).execute()
        values = result.get('values', [])

        if not values:
            logger.warning('No data found in the Google Sheet.')
            return []

        pride_ids = [row[0] for row in values if row]
        logger.info(f"Retrieved {len(pride_ids)} PRIDE IDs")
        for pride_id in pride_ids:
            logger.info(f"Found PRIDE ID: {pride_id}")
        return pride_ids

    except HttpError as err:
        logger.error(f"Failed to retrieve data from Google Sheets: {err}")
        return []

def get_project_files(project_id: str, download_dir: Path) -> List[FileInfo]:
    """Get list of files available for a PRIDE project."""
    project_dir = download_dir / project_id
    project_dir.mkdir(parents=True, exist_ok=True)
    metadata_file = project_dir / f"{project_id}_metadata.json"

    command = [
        'pridepy',
        'stream-files-metadata',
        '-a', project_id,
        '-o', str(metadata_file)
    ]

    try:
        result = subprocess.run(
            command,
            check=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True
        )
        logger.info(f"Retrieved metadata for project {project_id}")

        with open(metadata_file, "r") as f:
            metadata = json.load(f)

        file_infos = []
        for data in metadata:
            file_info = FileInfo.from_dict(data, project_id)
            file_infos.append(file_info)
            logger.info(f"Found file: {file_info.filename} ({file_info.size_in_gb:.2f} GB)")

        return file_infos

    except (subprocess.CalledProcessError, json.JSONDecodeError, IOError) as err:
        logger.error(f"Failed to get files for project {project_id}: {err}")
        return []



def group_files_by_type(files: List[FileInfo]) -> Dict[str, List[FileInfo]]:
    """Group files by their type."""
    grouped = {}
    for file in files:
        if file.file_type not in grouped:
            grouped[file.file_type] = []
        grouped[file.file_type].append(file)
    return grouped


def print_file_summary(pride_id: str, files: List[FileInfo]):
    """Print summary of available files for a PRIDE project."""
    print(f"\nFiles available for {pride_id}:")
    print("=" * 50)

    if not files:
        print("No files found")
        return

    # Group files by extension
    grouped_files = {}
    for f in files:
        ext = Path(f.filename).suffix.lower()
        if ext not in grouped_files:
            grouped_files[ext] = []
        grouped_files[ext].append(f)

    for ext, file_list in sorted(grouped_files.items()):
        total_size = sum(f.size_in_gb for f in file_list)
        print(f"\nFile type: {ext}")
        print(f"Number of files: {len(file_list)}")
        print(f"Total size: {total_size:.2f} GB")
        print("\nFiles:")
        for file_info in sorted(file_list, key=lambda x: x.filename):
            print(f"- {file_info.filename} ({file_info.size_in_gb:.2f} GB)")

def save_file_summary(pride_id: str, files: List[FileInfo], summary_dir: Path):
    """Save file summary to a text file."""
    summary_path = summary_dir / pride_id / "available_files_summary.txt"
    summary_path.parent.mkdir(parents=True, exist_ok=True)

    # Group files by extension
    grouped_files = {}
    for f in files:
        ext = Path(f.filename).suffix.lower()
        if ext not in grouped_files:
            grouped_files[ext] = []
        grouped_files[ext].append(f)

    with open(summary_path, "w") as f:
        f.write(f"Files available for {pride_id}:\n")
        f.write("=" * 50 + "\n\n")

        if not files:
            f.write("No files found\n")
            return

        for ext, file_list in sorted(grouped_files.items()):
            total_size = sum(f.size_in_gb for f in file_list)
            f.write(f"\nFile type: {ext}\n")
            f.write(f"Number of files: {len(file_list)}\n")
            f.write(f"Total size: {total_size:.2f} GB\n")
            f.write("\nFiles:\n")
            for file_info in sorted(file_list, key=lambda x: x.filename):
                f.write(f"- {file_info.filename} ({file_info.size_in_gb:.2f} GB)\n")

In [None]:
def main():
    """Main execution function with enhanced download options and size limits."""
    try:
        # Get PRIDE IDs and check files
        pride_ids = get_pride_ids_from_sheet(spreadsheet_id, sheet_name)
        if not pride_ids:
            logger.error("No PRIDE IDs found")
            return

        # Setup directories
        base_dir = Path(shared_drive_base_dir_str)
        download_dir = base_dir / folder_name
        download_dir.mkdir(parents=True, exist_ok=True)

        # First list all available files
        print("\nChecking available files for each PRIDE project...")
        all_files = {}
        downloaded_types = set()  # Track which file types have been downloaded

        for pride_id in pride_ids:
            files = get_project_files(pride_id, download_dir)
            all_files[pride_id] = files
            print_file_summary(pride_id, files)
            save_file_summary(pride_id, files, download_dir)

        while True:  # Continue until user is done
            # Get all available file types that haven't been downloaded
            available_types = set()
            for files in all_files.values():
                for file in files:
                    ext = Path(file.filename).suffix.lower()[1:]  # Get extension without dot
                    if ext and ext not in downloaded_types:  # Only add if extension exists
                        available_types.add(ext)

            if not available_types:
                print("\nAll file types have been downloaded!")
                break

            # Print remaining file types with size information for text files
            print("\nFile types not yet downloaded:")
            for ext in sorted(available_types):
                file_count = 0
                total_size_mb = 0
                skipped_count = 0
                for files in all_files.values():
                    for file in files:
                        if Path(file.filename).suffix.lower()[1:] == ext:
                            if ext == 'txt' and not download_large_text_files and file.size_in_mb > 1:
                                skipped_count += 1
                            else:
                                file_count += 1
                                total_size_mb += file.size_in_mb

                print(f"- {ext}: {file_count} files ({total_size_mb:.2f} MB total)")
                if ext == 'txt' and skipped_count > 0:
                    print(f"  Note: {skipped_count} text files > 1MB will be skipped")

            # Ask if user wants to download more files
            print("\nWould you like to download additional file types? (y/n)")
            response = input().lower()
            if response != 'y':
                break

            # Get file type selection
            print("\nSelect file type to download:")
            print("Available types:", ", ".join(sorted(available_types)))
            file_type = input("Enter file type (without dot): ").lower()

            if file_type not in available_types:
                logger.error(f"Invalid file type. Must be one of: {', '.join(sorted(available_types))}")
                continue

            print(f"\nDownloading {file_type} files using {protocol} protocol...")
            for pride_id, files in all_files.items():
                # Filter files by type and download
                type_files = [f for f in files if Path(f.filename).suffix.lower()[1:] == file_type]
                for file in type_files:
                    download_file(file, download_dir, protocol, download_large_text_files)

            # Add to downloaded types
            downloaded_types.add(file_type)

            # Show progress
            remaining = len(available_types) - len(downloaded_types)
            print(f"\nProgress: {len(downloaded_types)} file types downloaded, {remaining} remaining")

        # Final summary
        print("\nDownload session complete!")
        print("Downloaded file types:", ", ".join(sorted(downloaded_types)))
        if available_types - downloaded_types:
            print("Remaining file types:", ", ".join(sorted(available_types - downloaded_types)))

        # Print size limit information
        if not download_large_text_files:
            print("\nNote: Text files larger than 1MB were skipped. Set download_large_text_files = True to download all text files.")

    except Exception as e:
        logger.error(f"Process failed: {str(e)}")
        raise

In [None]:
if __name__ == "__main__":
    main()

In [None]:
# #-----------------Main
# def main():
#     """Main execution function with verification and download steps."""
#     try:
#         # Get PRIDE IDs
#         pride_ids = get_pride_ids_from_sheet(spreadsheet_id, sheet_name)
#         if not pride_ids:
#             logger.error("No PRIDE IDs found")
#             return

#         # Setup directories
#         base_dir = Path(shared_drive_base_dir_str)
#         download_dir = base_dir / folder_name
#         download_dir.mkdir(parents=True, exist_ok=True)

#         # First list all available files
#         print("\nChecking available files for each PRIDE project...")
#         all_files = {}

#         for pride_id in pride_ids:
#             files = get_project_files(pride_id, download_dir)
#             all_files[pride_id] = files
#             print_file_summary(pride_id, files)
#             save_file_summary(pride_id, files, download_dir)

#         # Print overall summary
#         print("\nOverall Summary:")
#         print("=" * 50)
#         for pride_id, files in all_files.items():
#             grouped = group_files_by_type(files)
#             print(f"\n{pride_id}:")
#             for file_type, file_list in sorted(grouped.items()):
#                 print(f"  .{file_type}: {len(file_list)} files")

#         # Add download functionality
#         print("\nWould you like to proceed with downloads? (y/n)")
#         response = input().lower()
#         if response == 'y':
#             print("\nSelect file type to download:")
#             available_types = set()
#             for files in all_files.values():
#                 for file in files:
#                     available_types.add(file.file_type)

#             print("Available file types:", ", ".join(sorted(available_types)))
#             file_type = input("Enter file type (without dot): ").lower()

#             if file_type not in available_types:
#                 logger.error(f"Invalid file type. Must be one of: {', '.join(sorted(available_types))}")
#                 return

#             print(f"\nDownloading {file_type} files using {protocol} protocol...")
#             for pride_id in pride_ids:
#                 download_files_by_type(pride_id, file_type, all_files, download_dir, protocol)

#         print("\nFile summaries have been saved. You can now proceed with downloads.")
#         print(f"Current file types to download: {file_types}")

#     except Exception as e:
#         logger.error(f"Process failed: {str(e)}")
#         raise

# if __name__ == "__main__":
#     main()