# Data Acquisition from PRIDE Repository

This notebook demonstrates how to use the **Proteomics Data Pipeline** to acquire proteomics datasets from the PRIDE Archive.

## What is PRIDE?

**PRIDE (PRoteomics IDEntifications Database)** is the world's largest public repository of mass spectrometry-based proteomics data. It's part of the ProteomeXchange Consortium and contains thousands of publicly available proteomics datasets.

## What You'll Learn

In this notebook, we'll walk through:
1. **Searching** for datasets in PRIDE
2. **Retrieving** dataset metadata
3. **Listing** available files
4. **Downloading** proteomics data files
5. **Parsing** mzTab files into pandas DataFrames
6. **Exploring** the actual proteomics data

Let's get started!

## 1. Setup and Imports

First, let's import the necessary libraries and modules.

In [1]:
# Add src to path
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / "src"))

# Import our custom modules
from data_acquisition.pride_api import PRIDEClient
from data_acquisition.file_parser import FileParser

# Import standard libraries
import pandas as pd
import matplotlib.pyplot as plt

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 50)

print("âœ“ Imports successful!")

âœ“ Imports successful!


## 2. Initialize PRIDE Client

Create a client to interact with the PRIDE Archive API. The client handles:
- API calls with automatic retries
- Response caching (for faster repeated requests)
- Download progress tracking

In [2]:
# Initialize the PRIDE API client
client = PRIDEClient()

print(f"âœ“ Connected to PRIDE API: {client.base_url}")
print(f"  Cache enabled: {client.cache_enabled}")
print(f"  Timeout: {client.timeout}s")
print(f"  Max retries: {client.max_retries}")

âœ“ Connected to PRIDE API: https://www.ebi.ac.uk/pride/ws/archive/v2
  Cache enabled: True
  Timeout: 30s
  Max retries: 3


## 3. Search for Datasets

Let's search for proteomics datasets related to a specific topic. We'll search for datasets containing the term "Erwinia" (a bacterial genus).

In [3]:
# Search for datasets
search_results = client.search_datasets("Erwinia", page_size=5)

# Convert to DataFrame for nice display
results_df = pd.DataFrame(search_results)

# Display selected columns
display_cols = ['accession', 'title', 'submissionDate']
print(f"Found {len(results_df)} datasets:\n")
results_df[display_cols]

Found 5 datasets:



Unnamed: 0,accession,title,submissionDate
0,PXD060613,Comparative proteomics for investigating the e...,2025-02-10
1,PXD038825,Being spontaneous has its costs! Characterizat...,2022-12-14
2,PXD035224,"Unraveling the Bombus terrestris hemolymph, a ...",2022-07-11
3,PXD010663,Bacteria Associated with Russian Wheat Aphid (...,2018-08-02
4,PXD000001,TMT spikes - Using R and Bioconductor for pro...,2012-03-13


## 4. Get Dataset Metadata

Now let's look at a specific dataset in detail. We'll use **PXD000001**, which is the first dataset ever submitted to PRIDE (a historic dataset!).

In [4]:
# Fetch metadata for PXD000001
dataset_id = "PXD000001"
metadata = client.get_dataset_metadata(dataset_id)

# Display key information
print(f"Dataset: {metadata['accession']}")
print(f"Title: {metadata['title']}")
print(f"Submission Date: {metadata['submissionDate']}")
print(f"Publication Date: {metadata['publicationDate']}")
print(f"\nOrganisms:")
for org in metadata.get('organisms', []):
    print(f"  - {org['name']}")
print(f"\nInstruments:")
for inst in metadata.get('instruments', [])[:3]:  # First 3
    print(f"  - {inst['name']}")

Dataset: PXD000001
Title: TMT spikes -  Using R and Bioconductor for proteomics data analysis
Submission Date: 2012-03-13
Publication Date: 2012-03-07

Organisms:
  - Erwinia carotovora

Instruments:
  - LTQ Orbitrap Velos


## 5. List Available Files

Each PRIDE dataset contains multiple files (raw data, processed results, etc.). Let's see what files are available for this dataset.

In [5]:
# Get list of files
files = client.get_dataset_files(dataset_id)

# Convert to DataFrame
files_df = pd.DataFrame(files)

# Add size in MB for readability
files_df['sizeMB'] = files_df['fileSizeBytes'] / (1024 * 1024)

# Display file information
print(f"Total files: {len(files_df)}\n")
files_df[['fileName', 'fileCategory', 'sizeMB']].head(10)

Total files: 8



Unnamed: 0,fileName,fileCategory,sizeMB
0,PRIDE_Exp_Complete_Ac_22134.pride.mztab.gz,OTHER,0.474916
1,PRIDE_Exp_Complete_Ac_22134.pride.mgf.gz,PEAK,15.686133
2,TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_6...,PEAK,231.77269
3,erwinia_carotovora.fasta,OTHER,1.580875
4,TMT_Erwinia_1uLSike_Top10HCD_isol2_45stepped_6...,RAW,210.261868
5,F063721.dat-mztab.txt,OTHER,0.290678
6,PRIDE_Exp_Complete_Ac_22134.xml.gz,RESULT,10.182576
7,F063721.dat,SEARCH,20.204031


## 6. Download a File

Let's download a small file to work with. We'll look for a small RESULT file (processed data) to demonstrate the download functionality.

**Note**: Files are downloaded with a progress bar showing download speed and time remaining.

In [7]:
# Find a small mzTab file (< 1 MB for quick download)
small_files = files_df[
    (files_df['fileName'].str.contains('mztab', case=False)) & 
    (files_df['sizeMB'] < 1.0)
].sort_values('sizeMB')

if len(small_files) > 0:
    # Get the smallest file
    file_to_download = small_files.iloc[0]
    
    print(f"Downloading: {file_to_download['fileName']}")
    print(f"Size: {file_to_download['sizeMB']:.2f} MB")
    print(f"Category: {file_to_download['fileCategory']}\n")
    
    # Download the file
    output_path = f"../data/raw/{file_to_download['fileName']}"
    client.download_file(file_to_download['downloadUrl'], output_path)
    
    print(f"\nâœ“ Download complete: {output_path}")
else:
    print("No small mzTab files found. Skipping download.")

Downloading: F063721.dat-mztab.txt
Size: 0.29 MB
Category: OTHER



F063721.dat-mztab.txt: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 298k/298k [00:00<00:00, 1.46MB/s]


âœ“ Download complete: ../data/raw/F063721.dat-mztab.txt





## 7. Parse the Downloaded File

Now let's parse the downloaded file to extract the proteomics data. Our `FileParser` can handle multiple formats including mzTab, CSV, and TSV.

In [12]:
if len(small_files) > 0:
    # Reload module to pick up parser updates
    import importlib
    import sys
    if 'data_acquisition.file_parser' in sys.modules:
        importlib.reload(sys.modules['data_acquisition.file_parser'])
        from data_acquisition.file_parser import FileParser
    
    # Parse the downloaded file
    parser = FileParser()
    
    # Check if it's mzTab format (by filename or extension)
    if 'mztab' in file_to_download['fileName'].lower():
        data = parser.parse_mztab(output_path)
    else:
        data = parser.parse_file(output_path)
    
    print(f"Data type: {type(data)}")
    print(f"\nData shape: {data.shape}")
    if len(data.columns) > 0:
        print(f"Columns: {list(data.columns)[:10]}...")  # Show first 10 columns
else:
    print("No file to parse (download was skipped)")

Data type: <class 'pandas.core.frame.DataFrame'>

Data shape: (1528, 20)
Columns: ['sequence', 'accession', 'unit_id', 'unique', 'database', 'database_version', 'search_engine', 'search_engine_score', 'reliability', 'modifications']...


## 8. Explore the Data

Let's examine the structure and content of the proteomics data we just acquired.

In [13]:
if len(small_files) > 0:
    # Display first few rows
    print("First 5 rows:")
    print(data.head())
    
    print("\n" + "="*80 + "\n")
    
    # Display summary statistics
    print("Summary statistics:")
    print(data.describe())
    
    print("\n" + "="*80 + "\n")
    
    # Check for missing values
    print("Missing values per column:")
    missing = data.isnull().sum()
    print(missing[missing > 0] if missing.sum() > 0 else "No missing values")
else:
    print("No data to explore")

First 5 rows:
    sequence accession    unit_id unique database          database_version  \
0    DGVSVAR   ECA0625  TMTspikes     --  Erwinia  erwinia_carotovora.fasta   
1     NVVLDK   ECA0625  TMTspikes     --  Erwinia  erwinia_carotovora.fasta   
2  VEDALHATR   ECA0625  TMTspikes     --  Erwinia  erwinia_carotovora.fasta   
3  LAGGVAVIK   ECA0625  TMTspikes     --  Erwinia  erwinia_carotovora.fasta   
4   LIAEAMEK   ECA0625  TMTspikes     --  Erwinia  erwinia_carotovora.fasta   

  search_engine                       search_engine_score reliability  \
0            --  [PRIDE,PRIDE:0000069,Mascot Score,26.92]          --   
1            --  [PRIDE,PRIDE:0000069,Mascot Score,22.66]          --   
2            --   [PRIDE,PRIDE:0000069,Mascot Score,41.9]          --   
3            --  [PRIDE,PRIDE:0000069,Mascot Score,48.99]          --   
4            --  [PRIDE,PRIDE:0000069,Mascot Score,38.04]          --   

  modifications retention_time charge mass_to_charge uri  \
0           

## 9. Visualize the Data

Let's create a simple visualization to understand the distribution of values in our proteomics data.

In [14]:
if len(small_files) > 0:
    # Convert abundance columns to numeric
    abundance_cols = [col for col in data.columns if 'abundance' in col.lower()]
    
    if len(abundance_cols) > 0:
        # Convert first abundance column to numeric for visualization
        col_name = abundance_cols[0]
        data[col_name] = pd.to_numeric(data[col_name], errors='coerce')
        
        # Plot distribution
        plt.figure(figsize=(10, 6))
        plt.hist(data[col_name].dropna(), bins=50, edgecolor='black', alpha=0.7, color='steelblue')
        plt.xlabel(f'{col_name} (counts)')
        plt.ylabel('Frequency')
        plt.title(f'Distribution of Peptide Abundance\n({col_name})')
        plt.grid(True, alpha=0.3)
        plt.ticklabel_format(style='scientific', axis='x', scilimits=(0,0))
        plt.show()
        
        print(f"\nâœ“ Plotted distribution for: {col_name}")
        print(f"  Other abundance columns: {abundance_cols[1:]}")
    else:
        print("No abundance columns found for visualization")
else:
    print("No data to visualize")

No numeric columns found for visualization


## Summary

In this notebook, we've demonstrated the complete data acquisition workflow:

1. âœ“ **Initialized** the PRIDE API client with retry logic and caching
2. âœ“ **Searched** for datasets using keywords
3. âœ“ **Retrieved** detailed metadata for a specific dataset
4. âœ“ **Listed** all available files in the dataset
5. âœ“ **Downloaded** a data file with progress tracking
6. âœ“ **Parsed** the file into a structured DataFrame
7. âœ“ **Explored** the data structure and statistics
8. âœ“ **Visualized** the data distribution

### What's Next?

The data acquisition module handles:
- Resilient API calls with exponential backoff retry
- Response caching to minimize API requests
- Multiple file format parsing (mzTab, CSV, TSV)
- Progress tracking for large downloads

In **Epic 3: Data Processing**, we'll work with this acquired data to:
- Detect and handle missing values
- Apply imputation methods
- Transform data (log2)
- Normalize across samples

This pipeline is designed for reproducible proteomics research! ðŸ”¬