# DIA Astral Module Testing and Visualization

This notebook tests the new DIA Astral protein-group quantification module and provides visualizations of the results.

In [1]:
#!/usr/bin/env python3
"""
Test script for the DIA Astral module with visualizations.
"""

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("üì¶ Libraries imported successfully!")

üì¶ Libraries imported successfully!


## Import ProteoBench Module

Import the DIA Astral protein groups module and set up the configuration.

In [2]:
# Import your module
from proteobench.modules.quant.quant_lfq_proteingroup_DIA_Astral import DIAQuantProteinGroupModuleAstral

print("üîß ProteoBench modules imported successfully!")

üîß ProteoBench modules imported successfully!


## Configuration

Set up the test parameters and file paths.

In [3]:
# Configuration
token = "dummy_token_for_testing"  # Replace with real token if needed

# Test data - replace these paths with your actual test files
input_file = "/Users/locard/Documents/Projets_en_cours/2022_ProteoBench/Dev/20260219_Module_PG/test_data/Legacy.pg_matrix_modified.tsv"
input_format = "DIA-NN"

# Sample user input configuration -> FAKE VALUES, replace with actual user input as needed
user_input = {
    "software_name": "DIA-NN",
    "software_version": "1.0",
    "search_engine_version": "1.0", 
    "search_engine": "DIA-NN",
    "ident_fdr_peptide": 0.01,
    "ident_fdr_psm": 0.01,
    "ident_fdr_protein": 0.01,
    "enable_match_between_runs": 1,
    "enzyme": "Trypsin",
    "allowed_miscleavages": 2,
    "min_peptide_length": 6,
    "max_peptide_length": 40,
    "precursor_mass_tolerance": 20,
    "fragment_mass_tolerance": 20,
}

print("‚úÖ Configuration set!")
print(f"üìÅ Input file: {input_file}")
print(f"üîß Input format: {input_format}")

‚úÖ Configuration set!
üìÅ Input file: /Users/locard/Documents/Projets_en_cours/2022_ProteoBench/Dev/20260219_Module_PG/test_data/Legacy.pg_matrix_modified.tsv
üîß Input format: DIA-NN


## Initialize Core Components (Git-Free Approach)

Since the full module requires Git repository access, we'll test the core functionality using individual components. This approach lets you:

- ‚úÖ Test your DIA Astral processing pipeline  
- ‚úÖ Validate input parsing and quantification
- ‚úÖ Generate visualizations and metrics
- ‚ùå Skip Git repository operations (pull requests, data submission)

This is perfect for **development and testing** phases!

In [4]:
# Since the full module requires Git, let's use the core components independently

# Import the individual components we need
from proteobench.io.parsing.parse_proteingroup import load_input_file
from proteobench.io.parsing.parse_settings import ParseSettingsBuilder
from proteobench.score.quantscores import QuantScoresHYE
from proteobench.datapoint.quant_datapoint import QuantDatapointHYE

print("üöÄ Core components imported successfully!")
print("üìã We can now test the functionality without Git dependencies")

üöÄ Core components imported successfully!
üìã We can now test the functionality without Git dependencies


In [5]:
# Test the core functionality step by step (without Git)
try:
    print("üîÑ Step 1: Loading input file...")
    input_df = load_input_file(input_file, input_format)
    print(f"‚úÖ Input file loaded! Shape: {input_df.shape}")
    
    print("\nüìä First 3 rows (transposed for better readability):")
    display(input_df.head(3).T)  # Transpose to show columns as rows
    
    print("\nüîÑ Step 2: Setting up parse settings...")
    # You'll need to specify the correct parse settings directory for your module
    parse_settings_dir = "proteobench/io/parsing/io_parse_settings/Quant/lfq/DIA/proteingroup/Astral"
    module_id = "quant_lfq_DIA_proteingroup_Astral"
    
    # Try to create parse settings
    parse_settings_builder = ParseSettingsBuilder(
        parse_settings_dir=parse_settings_dir,
        module_id=module_id
    )
    parse_settings = parse_settings_builder.build_parser(input_format)
    print("‚úÖ Parse settings created!")
    
    print("\nüîÑ Step 3: Converting to standard format...")
    standard_format, replicate_to_raw = parse_settings.convert_to_standard_format(input_df) 
    print(f"‚úÖ Standard format created! Shape: {standard_format.shape}")
    ## Display the first few rows of the standard format
    print("\nüìä First 3 rows of standard format (transposed):")
    display(standard_format.head(3).T)  # Transpose to show columns as rows

except Exception as e:
    print(f"‚ùå Workflow failed at some step: {e}")
    import traceback
    traceback.print_exc()

üîÑ Step 1: Loading input file...
‚úÖ Input file loaded! Shape: (11388, 12)

üìä First 3 rows (transposed for better readability):


Unnamed: 0,0,1,2
Protein.Group,A0A024RBG1,A0A096LP01,A0A0B4J2D5;P0DPI2
Protein.Names,NUD4B_HUMAN,SIM26_HUMAN,GAL3A_HUMAN;GAL3B_HUMAN
Genes,,,
First.Protein.Description,,,
N.Sequences,5,2,10
N.Proteotypic.Sequences,1,2,0
LFQ_Astral_DIA_15min_50ng_Condition_A_REP1,96186.5,10061.4,230732.0
LFQ_Astral_DIA_15min_50ng_Condition_A_REP2,78648.1,17722.0,205229.0
LFQ_Astral_DIA_15min_50ng_Condition_A_REP3,81700.5,10945.4,223626.0
LFQ_Astral_DIA_15min_50ng_Condition_B_REP1,100371.0,7809.48,238560.0



üîÑ Step 2: Setting up parse settings...
‚úÖ Parse settings created!

üîÑ Step 3: Converting to standard format...
‚úÖ Standard format created! Shape: (63734, 20)

üìä First 3 rows of standard format (transposed):


Unnamed: 0,0,1,2
Genes,,,
YEAST,False,False,False
N.Sequences,5,2,10
Proteins,NUD4B_HUMAN,SIM26_HUMAN,GAL3A_HUMAN;GAL3B_HUMAN
First.Protein.Description,,,
contaminant,False,False,False
MULTI_SPEC,False,False,False
HUMAN,True,True,True
Protein.Group,A0A024RBG1,A0A096LP01,A0A0B4J2D5;P0DPI2
N.Proteotypic.Sequences,1,2,0


In [6]:

try:
    
    print("\nüîÑ Step 4: Computing quantification scores...")
    quant_score = QuantScoresHYE(
        parse_settings.species_expected_ratio(),
        parse_settings.species_dict(),
        feature_column_name="Proteins"  # Use Proteins for protein-group level
    )
    intermediate_df = quant_score.generate_intermediate(standard_format, replicate_to_raw)
    print(f"‚úÖ Quantification scores computed! Shape: {intermediate_df.shape}")
    ## Display the first few rows of the intermediate dataframe
    print("\nüìä First 3 rows of intermediate dataframe (transposed):")
    display(intermediate_df.head(3).T)  # Transpose to show columns as rows

except Exception as e:
    print(f"‚ùå Workflow failed at some step: {e}")
    import traceback
    traceback.print_exc()


üîÑ Step 4: Computing quantification scores...
‚úÖ Quantification scores computed! Shape: (11300, 30)

üìä First 3 rows of intermediate dataframe (transposed):


Unnamed: 0,0,1,2
Proteins,1433B_HUMAN,1433E_HUMAN,1433F_HUMAN
log_Intensity_mean_A,20.225978,21.768699,20.921548
log_Intensity_mean_B,20.27619,21.80988,20.94067
log_Intensity_std_A,0.038071,0.024606,0.040057
log_Intensity_std_B,0.029908,0.019429,0.045938
Intensity_mean_A,1226666.666667,3573333.333333,1986666.666667
Intensity_mean_B,1270000.0,3676666.666667,2013333.333333
Intensity_std_A,32145.502537,61101.009266,55075.705473
Intensity_std_B,26457.513111,49328.828623,63508.529611
CV_A,0.026206,0.017099,0.027723


In [7]:
try:
    
    print("\nüîÑ Step 5: Generating datapoint...")
    current_datapoint = QuantDatapointHYE.generate_datapoint(
        intermediate_df, input_format, user_input, default_cutoff_min_prec=3
    )
    print("‚úÖ Datapoint generated!")
    
    # Create a simple DataFrame for all_datapoints
    all_datapoints = pd.DataFrame([current_datapoint])
    
    print(f"\nüéâ Complete workflow successful!")
    print(f"üìä Intermediate dataframe shape: {intermediate_df.shape}")
    print(f"üìà All datapoints shape: {all_datapoints.shape}")
    print(f"üìÑ Input dataframe shape: {input_df.shape}")
    ## Display the first few rows of all_datapoints
    print("\nüìä First 3 rows of input_df (transposed)")
    display(input_df.head(3).T)  # Transpose to show columns as rows:
    ## Print all values in all_datapoints
    print("\nüìä all_datapoints")
    display(all_datapoints)
    
except Exception as e:
    print(f"‚ùå Workflow failed at some step: {e}")
    import traceback
    traceback.print_exc()


üîÑ Step 5: Generating datapoint...
‚úÖ Datapoint generated!

üéâ Complete workflow successful!
üìä Intermediate dataframe shape: (11300, 30)
üìà All datapoints shape: (1, 29)
üìÑ Input dataframe shape: (11388, 12)

üìä First 3 rows of input_df (transposed)


Unnamed: 0,0,1,2
Protein.Group,A0A024RBG1,A0A096LP01,A0A0B4J2D5;P0DPI2
Proteins,NUD4B_HUMAN,SIM26_HUMAN,GAL3A_HUMAN;GAL3B_HUMAN
Genes,,,
First.Protein.Description,,,
N.Sequences,5,2,10
N.Proteotypic.Sequences,1,2,0
LFQ_Astral_DIA_15min_50ng_Condition_A_REP1,96186.5,10061.4,230732.0
LFQ_Astral_DIA_15min_50ng_Condition_A_REP2,78648.1,17722.0,205229.0
LFQ_Astral_DIA_15min_50ng_Condition_A_REP3,81700.5,10945.4,223626.0
LFQ_Astral_DIA_15min_50ng_Condition_B_REP1,100371.0,7809.48,238560.0



üìä all_datapoints


Unnamed: 0,id,software_name,software_version,search_engine,search_engine_version,ident_fdr_psm,ident_fdr_peptide,ident_fdr_protein,enable_match_between_runs,precursor_mass_tolerance,...,mean_abs_epsilon_global,median_abs_epsilon_eq_species,mean_abs_epsilon_eq_species,median_abs_epsilon_precision_global,mean_abs_epsilon_precision_global,median_abs_epsilon_precision_eq_species,mean_abs_epsilon_precision_eq_species,nr_prec,comments,proteobench_version
0,DIA-NN_20260219_182356,DIA-NN,1.0,DIA-NN,1.0,0.01,0.01,0.01,1,20,...,0.227412,0.20138,0.315543,0.103952,0.201875,0.143139,0.260089,10928,,0.3.4.dev269+g92dbfe1


# ProteoBench Plotting and Visualization

Now let's use ProteoBench's built-in plotting capabilities to create publication-ready visualizations!

## Import ProteoBench Plotting Modules

Let's import ProteoBench's specialized plotting functions for quantification analysis:

In [8]:
# Import ProteoBench's LFQHYEPlotGenerator
from proteobench.plotting.plot_generator_lfq_HYE import LFQHYEPlotGenerator

print("‚úÖ LFQHYEPlotGenerator imported successfully!")

# Create an instance of the plot generator
plot_generator = LFQHYEPlotGenerator()
print("üé® LFQHYEPlotGenerator instance created!")

# Display information about available plotting methods
print("\nüìä Available LFQHYEPlotGenerator methods:")
print("  ‚Ä¢ generate_in_depth_plots() - Create standard LFQ HYE plots")
print("  ‚Ä¢ plot_main_metric() - Generate main performance metric plot") 
print("  ‚Ä¢ get_in_depth_plot_layout() - Get plot layout configuration")
print("  ‚Ä¢ get_in_depth_plot_descriptions() - Get plot descriptions")

‚úÖ LFQHYEPlotGenerator imported successfully!
üé® LFQHYEPlotGenerator instance created!

üìä Available LFQHYEPlotGenerator methods:
  ‚Ä¢ generate_in_depth_plots() - Create standard LFQ HYE plots
  ‚Ä¢ plot_main_metric() - Generate main performance metric plot
  ‚Ä¢ get_in_depth_plot_layout() - Get plot layout configuration
  ‚Ä¢ get_in_depth_plot_descriptions() - Get plot descriptions


## 1. Main Quantification Plot

Create the primary ProteoBench quantification visualization:

In [None]:
## plot median_abs_epsilon_precision_global vs nr_prec from all_datapoints

## 2. In depth plots

In [9]:
# Generate the standard ProteoBench in-depth plots
try:
    print("üé® Generating ProteoBench standard plots...")
    
    # Generate all the standard LFQ HYE plots
    plots = plot_generator.generate_in_depth_plots(
        performance_data=intermediate_df,
        parse_settings=parse_settings
    )
    
    print(f"‚úÖ Generated {len(plots)} ProteoBench plots:")
    for plot_name in plots.keys():
        print(f"  ‚Ä¢ {plot_name}")
    
    # Display each plot
    plot_descriptions = plot_generator.get_in_depth_plot_descriptions()
    
    for plot_name, fig in plots.items():
        print(f"\nüìä {plot_name.upper()} PLOT:")
        print(f"üìù Description: {plot_descriptions.get(plot_name, 'No description available')}")
        
        # Show the plot
        fig.show()
    
    print("\n‚úÖ All ProteoBench standard plots generated successfully!")
    
except Exception as e:
    print(f"‚ùå Error generating ProteoBench plots: {e}")
    import traceback
    traceback.print_exc()
    
    # Fallback to basic plotting if needed
    print("\n‚ö†Ô∏è  Falling back to basic plotting...")
    
    # Create a simple fold change plot as fallback
    import plotly.express as px
    
    if 'log2_A_vs_B' in intermediate_df.columns and 'species' in intermediate_df.columns:
        fig = px.histogram(
            intermediate_df, 
            x='log2_A_vs_B', 
            color='species',
            title='Fold Change Distribution by Species',
            nbins=50,
            opacity=0.7
        )
        fig.update_layout(
            xaxis_title='Log2 Fold Change (A vs B)',
            yaxis_title='Count'
        )
        fig.show()
        print("‚úÖ Basic fold change plot created as fallback!")

üé® Generating ProteoBench standard plots...
‚úÖ Generated 3 ProteoBench plots:
  ‚Ä¢ logfc
  ‚Ä¢ cv
  ‚Ä¢ ma_plot

üìä LOGFC PLOT:
üìù Description: log2 fold changes calculated from the performance data



üìä CV PLOT:
üìù Description: CVs calculated from the performance data



üìä MA_PLOT PLOT:
üìù Description: MA plot calculated from the performance data



‚úÖ All ProteoBench standard plots generated successfully!
