diff --git a/_posts/2025-08-02-GSoC2025-dingo.md b/_posts/2025-08-02-GSoC2025-dingo.md new file mode 100644 index 0000000..c294189 --- /dev/null +++ b/_posts/2025-08-02-GSoC2025-dingo.md @@ -0,0 +1,354 @@ +--- +layout: single +title: "Initiating te dingo-stats package for statistics on flux sampling datasets" +date: 2025-08-02 +author: Sotirios Touliopoulos +author_profile: true +read_time: true +comments: true +share: true +related: true +hidden: false +--- + + +# Introduction to dingo-stats: package for statistics on flux sampling datasets + + +
+
+ Sotirios Touliopoulos +
+ Contributor to Google Summer of Code 2025 with GeomScale +
+
+ + +> #### Mentors of this project: Elias Tsigaridas, Apostolos Chalkis, Haris Zafeiropoulos, Ioannis Psarros + + +## Goal of the project + +The goal of this project is to introduce [dingo-stats](https://github.com/SotirisTouliopoulos/dingo-stats): a python package for statistical analysis of flux sampling datasets. Till now, all post-sampling statistics were integrated into the `dingo` package. However, as the package grew, it became clear that separating the statistical analysis into its own package would be beneficial. This separation allows for a more focused development of statistical methods and makes it easier for users to access and utilize these methods without needing to install the entire `dingo` package. + +The `dingo-stats` package provides a set of functions for performing statistical tests and visualizing results given any flux sampling dataset (not restricted to `dingo` samples). The package is designed to be user-friendly with a series of [tutorials](https://github.com/SotirisTouliopoulos/dingo-stats/tree/master/tests) and [documentation](https://dingo-stats.readthedocs.io/en/latest/) available. + + +## Updated features + +Functions regarding computing pairwise reaction correlations, clustering and graphs analysis from `dingo` have been updated and moved to `dingo-stats`. The removal of these functions from `dingo` was completed in Pull Request [#118](https://github.com/GeomScale/dingo/pull/118). These functions will not be presented in this blog, as they were presented in [last year's blog](https://sotiristouliopoulos.github.io/dingo/). + + +## New features + +The `dingo-stats` package will include the following new features: +- Functions for modifying metabolic models (changing the objective function, objective direction, setting reaction bounds) +- Functions for inspecting Sampling Distributions +- Functions for retrieving pathway information and mapping reaction IDs to KEGG IDs. +- Functions used to identify loopy reactions in metabolic models. +- Functions for performing Differential Flux Analysis, including statistical tests such as the Kolmogorov-Smirnov test and Hypergeometric test for pathway enrichment. + + +## Modify Metabolic Models + +The `modify_model()` function modifies a given cobra model. Users can change objective function, optimal percentage (lower bound of the objective will be fixed based on the optimal percentage), define custom reaction bounds and change objective direction. + +- `cobra_model` is a cobra model object +- `objective_function` is a string with the requested objective function ID to be assigned to the model +- `optimal_percentage` is the percentage of the FBA solution to be set as lower bound to the objectie function. +- `reaction_bounds` is a dictionary of custom reaction bounds to be assigned to the model’s reactions +- `objective_direction` defines the direction to optimize the model: max and min are available options. + +This function enables the user to create and sample metabolic models simulating different conditions. Here, 2 new models are created based on updating the initial model: + +- One that maximizes for biomass production (asking at least almost 100% of the biomass maximum FBA value) + +```python +ec_cobra_model_condition_100 = modify_model( + cobra_model = ec_cobra_model, + objective_function = "BIOMASS_Ecoli_core_w_GAM", + optimal_percentage = 100, + objective_direction = "max" +) +``` + +- One that maximize for biomass production (asking at least 0% of the biomass maximum FBA value) + +```python +ec_cobra_model_condition_0 = modify_model( + cobra_model = ec_cobra_model, + objective_function = "BIOMASS_Ecoli_core_w_GAM", + optimal_percentage = 0, + objective_direction = "max" +) +``` + +Now you are ready to sample your edited model with the sampling methods of your choice + + +## Inspect Sampling Distributions + +The `plot_grid_reactions_flux()` function plots a grid of selected flux distributions and enables inspections of sampling results to get early insights. Distributions here are chosen based on a list of reaction IDs (subset from Glycolysis and Pentose Phosphate Pathways) derived from KEGG pathway information (see helper functions in next paragraph). The corresponding KEGG pathways tutorial is presented in the next section. From the grid we can detect: Left/Right Shifted, Normal, Fixed (transparent based on the tolerance parameter) or other uncommon distributions. + +- `samples` is a numpy 2D array of the samples +- `model_reactions` is a list containing strings of reaction IDs +- `tolerance` is a tolerance level to make distribution plot transparent based on flux range +- `nrows` is the number of rows for the plot +- `ncols` is the number of columns for the plot +- `title` is the title of the plot + +plot_grid_reactions_flux( + samples = samples_glycolysis_ppp, + model_reactions = reaction_ids_glycolysis_ppp, + nrows = 5, + ncols = 5, + tolerance = 1e-3, + title = "Sampling Distributions") + +
+

+
+ +The `sampling_statistics()` function calculates statistics on flux distributions for a reaction of interest. This information can be used to identify significantly altered flux distributions between different conditions + +- `samples` is a numpy 2D array of the samples +- `model_reactions` is a list containing strings of reaction IDs +- `reaction_id` is a reaction ID of the selected reaction to calculate statistics on + +```python +mean, min, max, std, skewness, kurtosis = sampling_statistics( + samples = samples, + model_reactions = ec_cobra_reaction_ids, + reaction_id = "FRD7") + +print(mean, min, max, std, skewness, kurtosis) +``` + +```python +479.1369619828837 0.14995320273677437 993.9157983127856 289.3315287517938 0.09951034266820558 -1.2064534542990524 +``` + + +## KEGG Pathway Information + +A set of helper functions is available to retrieve KEGG pathway information and map reaction IDs to KEGG IDs. These helper functions will not be presented here, but are available in the [dingo-stats documentation](https://dingo-stats.readthedocs.io/en/latest/index.html). Here, we will present how to leverage the pathway information to analyze a sampling dataset. + +The `subset_model_reactions_from_pathway_info()` function works given a dataFrame created with the `get_kegg_pathways_from_reaction_ids()` function and returns all reaction IDs affiliated with a given KEGG pathway name or ID. + +```python +PPP_from_name = subset_model_reactions_from_pathway_info( + kegg_info_df = df_kegg_pathways, + pathway_query = "Pentose phosphate pathway") + +Glycolysis_from_name = subset_model_reactions_from_pathway_info( + kegg_info_df = df_kegg_pathways, + pathway_query = "Glycolysis / Gluconeogenesis") + +Glycolysis_from_id = subset_model_reactions_from_pathway_info( + kegg_info_df = df_kegg_pathways, + pathway_query = "rn00010") + +print(PPP_from_name) +print(Glycolysis_from_name) +print(Glycolysis_from_id) +``` + +```python +['FBA', 'FBP', 'GND', 'PFK', 'PGL', 'RPE', 'RPI', 'TKT1'] +['ALCD2x', 'ENO', 'FBA', 'FBP', 'GAPD', 'PFK', 'PGK', 'PGM', 'PPCK', 'PPS', 'PYK', 'TPI'] +['ALCD2x', 'ENO', 'FBA', 'FBP', 'GAPD', 'PFK', 'PGK', 'PGM', 'PPCK', 'PPS', 'PYK', 'TPI'] +``` + +The `dictionary_reaction_id_to_pathway()` function takes one or multiple lists containing reaction IDs (corresponding to KEGG pathways) and creates a dictionary that maps the IDs to pathway names. If a reaction appears in more than one pathway, it is classified with the term `Multiple-Pathways`. This is useful for plotting to work with subsets of reactions and to replace names from the `df_kegg_pathways` dataframe like `Glycolysis / Gluconeogenesis` to `Glycolysis` and `Pentose phosphate pathway` to `PPP`. + +- `**named_lists` are named lists where each argument is a list of reaction IDs and the argument name represents the pathway name. + +```python +named_lists = { + "Glycolysis": Glycolysis, + "PPP": PPP +} + +bigg_to_pathway_dict = dictionary_reaction_id_to_pathway( + named_lists) + +print(bigg_to_pathway_dict.get("GND")) +print(bigg_to_pathway_dict.get("ENO")) +print(bigg_to_pathway_dict.get("FBA")) +``` + +```python +"Pentose phosphate pathway" +"Glycolysis / Gluconeogenesis" +"Multiple-Pathways" +``` + +The `reaction_in_pathway_binary_matrix()` function is used to create a new pandas dataframe with reactions as rows and different pathways as columns. The corresponding cell of the dataframe will show if a reaction belongs to a certain pathway (1) or not (0). If a reaction belongs to more than one pathways, then the column: `Multiple-Pathways` is created and the reaction matching this will only get True (1) there and not in the individual pathway columns (e.g. 1 in `Multiple-Pathways`, 0 in `Glycolysis` and 0 in `PPP`). + +- `reaction_id_to_pathway_dict` is dictionary mapping reaction IDs to pathway names created with the dictionary_reaction_id_to_pathway function + +```python +binary_df = reaction_in_pathway_binary_matrix( + reaction_id_to_pathway_dict = bigg_to_pathway_dict) +``` + +The `plot_reaction_in_pathway_heatmap()` function is used to plot a heatmap of the `binary_df` created from the `reaction_in_pathway_binary_matrix()` function to better illustrate the connection between reactions and pathways. + +- `binary_df` is a pandas dataFrame with binary values (0 or 1) +- `font_size` is the font size for axis labels and ticks +- `fig_width` is the width of the figure in pixels +- `fig_height` is the height of the figure in pixels +- `title` is the title of the plot + +```python +plot_reaction_in_pathway_heatmap( + binary_df = binary_df, + font_size = 8, + fig_width = 600, + fig_height = 600, + title = "" ) +``` + +
+

+
+ + +## Identify Loopy Reactions + +The `loops_enumeration_from_fva()` function computes twice a Flux Variability Analysis (with `loopless` parameter to `False` and `True`) and identifies possible loopy reaction in given model and returns them in a list. Each item in the list is a tuple of a string (reaction ID) and a float (loopy score returned from the function) + +- `ec_cobra_model` is a cobra model object +- `fraction_of_optimum` is a float variable that defines the fraction_of_optimum parameter used in `flux_variability_analysis` + +```python +loopy_reactions_fva = loops_enumeration_from_fva( + ec_cobra_model = ec_cobra_model, + fraction_of_optimum = 0.999) + +print(loopy_reactions_fva) +``` + +```python +[('SUCDi', 994.7794007141794), ('FRD7', 995.0539767141795)] +``` + +The `loopy_reactions_turned_off_in_pfba()` function identifies which reactions from the loopy reactions set calculated with the `loops_enumeration_from_fva()` function can be turned off when performing a `loopless_solution` applied to `pFBA` results + +- `loopy_cobra_model` is a cobra model object possibly containing loops (default model without any preproccess) +- `loopy_reactions` is a list with loopy_reactions with reactions classified as loopy +- `tol` is a tolerance value used to classify a numeric change as important or not + +```python +turned_off_reactions = loopy_reactions_turned_off_in_pfba( + loopy_cobra_model = ec_cobra_model, + loopy_reactions = loopy_reactions, + tol = 1e-6) + +print(turned_off_reactions) +``` + +```python +['FRD7'] +``` + + +## Differential Flux Analysis + +The `significantly_altered_reactions()` function takes as input at least two (different) flux sampling datasets to compare and identifies significantly altered reactions. It performs a Kolmogorov-Smirnov (KS) non-parametric test and corrects p-value for multiple comparisons. It additionally calculates a fold change that together with the p-value classifies reactions as significantly altered or not. It returns two lists, one with the significantly changed (`significant_diff_reactions`) and one with the not significantly changed (`not_significant_diff_reactions`) reactions. Also, two dictionaries, one mapping reaction IDs to corrected p_values (`pval_dict`) and one mapping reactions IDs to fold change values (`fold_change_dict`) + +- `conditions` is a list of different flux sampling arrays +- `selected_comparisons` is a list showing which conditions to compare (useful when comparing more than two sampling arrays) +- `cobra_model` is a cobra model object +- `fold_change_cutoff` is a cutoff for fold-change to consider two distributions significantly different +- `std_cutoff` is a cutoff to ensure distributions that are compared are not fixed to a certain value + +```python +conditions = [samples_condition_1, samples_condition_2] +selected_comparisons = [(0, 1)] + +(significant_diff_reactions, + not_significant_diff_reactions, + pval_dict, + fold_change_dict) = significantly_altered_reactions( + conditions = conditions, + selected_comparisons = selected_comparisons, + cobra_model = ec_cobra_model, + fold_change_cutoff = 0.6, + std_cutoff = 1e-3) +``` + +The `plot_volcano()` function creates a volcano plot to show results from differential flux analysis. Volcano plot has Fold Change on the x-axis and -log10(p_value) on the y-axis. Users can provide a reactions list in the annotate parameter, to show reaction IDs on the plot. Also, lines showing the significance cutoffs may optionally be added, when providing `p_value_cutoff`, `fold_change_cutoff` and `show_cutoff_lines`. + +- `pval_dict` is a dictionary mapping reaction ID to corrected p-value. +- `fold_change_dict` is a dictionary mapping reaction ID to fold-change value. +- `p_value_cutoff` is a significance threshold for p-value (to appear in the plot). +- `fold_change_cutoff` is a threshold for fold-change (to appear in the plot). +- `annotate` is a list of reaction IDs to annotate on the plot. +- `width` is the width of the figure in pixels. +- `height` is the height of the figure in pixels. +- `title` is the title of the plot. +- `show_cutoff_lines` is a boolean variable for whether to draw p-value and fold-change cutoff lines. + +```python +reactions_to_annotate = ["NH4t"] + +plot_volcano( + pval_dict = pval_dict, + fold_change_dict = fold_change_dict, + p_value_cutoff = 0.05, + fold_change_cutoff = 0.6, + annotate = reactions_to_annotate, + width = 800, + height = 600, + title = "", + show_cutoff_lines = True) +``` + +
+

+
+ + +The `hypergeometric_test_pathway_enrichment()` function performs a hypergeometric test to find significantly affected pathways between our sampling conditions. It also calculated fold enrichment and p_values useful for filtering significant changes + +- `significant_reactions` is a list of reaction IDs with altered flux. +- `model_reactions` is a list of all reaction IDs considered in the analysis. +- `reaction_to_pathways` is a dictinary mapping reaction ID -> list of pathway names. + +```python +hypergeometric_pathway_enrichment_df = hypergeometric_test_pathway_enrichment( + significant_reactions = significant_diff_reactions, + model_reactions = ec_cobra_reaction_ids, + reaction_to_pathways = reaction_to_pathways) +``` + +The `plot_pathway_enrichment()` function takes as input the hypergeometric_pathway_enrichment_df dataframe created with the `hypergeometric_test_pathway_enrichment()` function and plots the enrichment results in a bubble plot. Bubble size stands for pathway size (number of reactions) and reaction colour stands for -log10(p-value) + +- `hypergeometric_enrichment_df` is a dataframe from the hypergeometric_test_pathway_enrichment function. +- `pval_threshold` is a significance p value threshold for filtering. +- `use_fdr` is a boolean for whether to use ‘fdr’ or ‘pval’ for filtering. +- `font_size` is the font size for plot labels and title. +- `width` is the width of the figure in pixels. +- `height` is the height of the figure in pixels. + +```python +plot_pathway_enrichment( + hypergeometric_enrichment_df = hypergeometric_pathway_enrichment_df, + pval_threshold = 2, + use_fdr = True, + font_size = 14, + width = 900, + height = 600) +``` + +
+

+
+ + +## References + +- Ebrahim, A.; Lerman, J. A.; Palsson, B. O.; Hyduke, D. R. COBRApy: COnstraints-Based Reconstruction and Analysis for Python. BMC Systems Biology 2013, 7 (1). https://doi.org/10.1186/1752-0509-7-74. + +- Apostolos Chalkis; Vissarion Fisikopoulos; Tsigaridas, E.; Haris Zafeiropoulos. Dingo: A Python Package for Metabolic Flux Sampling. Bioinformatics Advances 2024, 4 (1). https://doi.org/10.1093/bioadv/vbae037. \ No newline at end of file