Skip to content

change binning aggregation from "mean" to "sum"#23

Merged
singjc merged 17 commits intoOpenMS:mainfrom
jcharkow:patch/spectrum_binning
Sep 13, 2024
Merged

change binning aggregation from "mean" to "sum"#23
singjc merged 17 commits intoOpenMS:mainfrom
jcharkow:patch/spectrum_binning

Conversation

@jcharkow
Copy link
Collaborator

@jcharkow jcharkow commented Sep 8, 2024

User description

binning aggregation by mean leads to strange trends when have annotations that the annotation peaks are much higher than everything else. Changing to sum leads to more realistic plot.

Binning by mean (old):
image

Binning by sum (new):
image

I also changed the peakmap but these are not tested.


PR Type

enhancement


Description

  • Changed the binning aggregation method from "mean" to "sum" to provide more realistic plots, especially when annotation peaks are significantly higher than other data points.
  • Improved code readability by updating method signatures and formatting.
  • Adjusted the logic for binning and peak annotations to align with the new aggregation method.

Changes walkthrough 📝

Relevant files
Enhancement
_core.py
Update binning aggregation from mean to sum in plots         

pyopenms_viz/_core.py

  • Changed binning aggregation method from "mean" to "sum" for intensity
    calculations.
  • Updated method signatures and formatting for better readability.
  • Adjusted logic to handle binning and peak annotations.
  • +37/-45 

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    binning aggregation by mean leads to strange trends when have
    annotations that the annotation peaks are much higher than everything
    else. Changing to sum leads to more realistic plot
    @jcharkow jcharkow requested a review from singjc September 8, 2024 17:10
    @qodo-code-review
    Copy link
    Contributor

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Potential Performance Impact
    Changing the binning aggregation method from "mean" to "sum" may lead to unexpected results or performance issues for large datasets.

    Code Consistency
    The binning method change is not consistently applied across all relevant parts of the code, which may lead to inconsistent behavior.

    @qodo-code-review
    Copy link
    Contributor

    qodo-code-review bot commented Sep 8, 2024

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Best practice
    Replace the parameter name 'cls' with 'self' in the method signature

    Consider using a more descriptive variable name instead of cls in the plot method
    signature. Since this is an instance method (not a class method), using self would
    be more appropriate and consistent with Python conventions.

    pyopenms_viz/_core.py [317-319]

     def plot(
    -    cls, fig, data, x, y, by: str | None = None, plot_3d: bool = False, **kwargs
    +    self, fig, data, x, y, by: str | None = None, plot_3d: bool = False, **kwargs
     ):
     
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: The suggestion correctly identifies a best practice issue by recommending the use of 'self' instead of 'cls' for an instance method, which improves code readability and adheres to Python conventions.

    9
    Enhancement
    Replace multiple if-elif statements with a dictionary-based approach for more efficient conditional assignment

    Consider using a more efficient approach for conditional assignment. Instead of
    using multiple if-elif statements, you can use a dictionary to map the bin methods
    to their corresponding functions.

    pyopenms_viz/_core.py [600-605]

    -if self.bin_method == "sturges":
    -    self.num_x_bins = sturges_rule(data, x)
    -elif self.bin_method == "freedman-diaconis":
    -    self.num_x_bins = freedman_diaconis_rule(data, x)
    -elif self.bin_method == "none":
    -    self.num_x_bins = num_x_bins
    +bin_methods = {
    +    "sturges": lambda: sturges_rule(data, x),
    +    "freedman-diaconis": lambda: freedman_diaconis_rule(data, x),
    +    "none": lambda: num_x_bins
    +}
    +self.num_x_bins = bin_methods.get(self.bin_method, lambda: num_x_bins)()
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion provides a more efficient and maintainable way to handle conditional assignments using a dictionary, which simplifies the code and reduces potential errors in future modifications.

    8
    ✅ Combine separate grouping operations into a single, more flexible grouping and aggregation step
    Suggestion Impact:The commit implemented a more flexible grouping and aggregation operation by using a variable to hold grouping columns and applying aggregation based on a parameterized method, aligning with the suggestion to consolidate separate operations.

    code diff:

                     # Group by x, y and by columns and calculate the sum intensity within each bin
                     data = (
                         data.groupby([x, y, by], observed=True)
    -                    .agg({z: "sum"})
    +                    .agg({z: aggregation_method})
                         .reset_index()
                     )
                     # Add by back to kwargs
                     kwargs["by"] = by
                 else:
                     # Group by x and y bins and calculate the sum intensity within each bin
    -                data = data.groupby([x, y], observed=True).agg({z: "sum"}).reset_index()
    +                data = data.groupby([x, y], observed=True).agg({z: aggregation_method}).reset_index()

    Consider using a more concise and efficient approach for grouping and aggregating
    data. The current implementation uses separate code blocks for different conditions,
    which can be combined into a single, more flexible operation.

    pyopenms_viz/_core.py [909-920]

    +group_cols = [x, y]
     if by is not None:
    -    # Group by x, y and by columns and calculate the sum intensity within each bin
    -    data = (
    -        data.groupby([x, y, by], observed=True)
    -        .agg({z: "sum"})
    -        .reset_index()
    -    )
    -    # Add by back to kwargs
    +    group_cols.append(by)
         kwargs["by"] = by
    -else:
    -    # Group by x and y bins and calculate the sum intensity within each bin
    -    data = data.groupby([x, y], observed=True).agg({z: "sum"}).reset_index()
    +data = data.groupby(group_cols, observed=True).agg({z: "sum"}).reset_index()
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: The suggestion enhances code efficiency and readability by consolidating separate grouping operations into a single flexible step, reducing redundancy and potential errors.

    8
    Maintainability
    Simplify the condition in the if statement for better readability

    Consider simplifying the condition in the if statement. The current condition
    bin_peaks == True or (data.shape[0] > num_x_bins * num_y_bins and bin_peaks ==
    "auto") can be simplified to improve readability.

    pyopenms_viz/_core.py [903-905]

    -if bin_peaks == True or (
    -    data.shape[0] > num_x_bins * num_y_bins and bin_peaks == "auto"
    -):
    +if bin_peaks is True or (bin_peaks == "auto" and data.shape[0] > num_x_bins * num_y_bins):
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion improves code readability by simplifying the condition in the if statement, making it clearer and easier to understand without changing the logic.

    7

    @jcharkow
    Copy link
    Collaborator Author

    jcharkow commented Sep 9, 2024

    Possibly should allow user to customize the aggregation method?

    @singjc
    Copy link
    Collaborator

    singjc commented Sep 9, 2024

    I did a bit more digging and made some comparisons. I added a tolerance binning method, to bin mz's based on a fixed tolerance. I found the sturges and freedman binning methods sometimes don't work as well for either sparse or really dense spectrums. I also added an aggregation param to allow the user to aggregate by 'sum', 'mean' or 'max`. Based on my comparisons using the mz tolerance bining method (with tol = 1) and using max as the aggregation method seems to return spectra that closely matches the orignal raw spectra and is a lot faster for plotting (5.007686 seconds for raw vs 0.614527 seconds for max mz tol=1 binning)

    image

    Update:

    I added two options for automating the computation of the tolerances for the mz tolerance bining method:

    1. use the freedman bin width
    2. use the 1 percentile of the non-zero differences of the mz values

    This results in an even faster binning and plotting, with the binned spectrum still looking similar to the original

    raw (5.91sec) > mz-tol-bin + tol=1 (0.83sec) > mz-tol-bin + tol=1pct-dif ( 0.09sec)

    image

    Testing with very sparse spectrum (from Spectrum.ipynb)

    image

    Testing with a very dense spectrum (from alphatims_tutorial.ipynb)

    image

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    what is this nb about?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Sorry this is a rough notebook added by mistake to this PR. Will remove

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    It's not added by mistake. It's a tutorial notebook for showing an example of loading bruker tdf DIA and DDA data using alphatims and showcasing plotting with pyopenms_viz.

    The notebook was updated and added to this PR to reflect the changes with the spectrum binning.

    @timosachsenberg
    Copy link

    just for reference: I used max in the past for similar reasons.

    @singjc singjc merged commit 901e275 into OpenMS:main Sep 13, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants