Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace beeswarm plot with different visualisation #2066

Closed
4 tasks done
pontushojer opened this issue Sep 22, 2023 · 6 comments
Closed
4 tasks done

Replace beeswarm plot with different visualisation #2066

pontushojer opened this issue Sep 22, 2023 · 6 comments

Comments

@pontushojer
Copy link
Contributor

Description of bug

@wm75 informed me about this very large Pangolin dataset posted on galaxy (https://usegalaxy.eu/published/history?id=5ee10825304a885f) and figured it would be interesting to test using MultiQC. Specifially this dataset: https://usegalaxy.eu/api/datasets/4838ba20a6d867654919ea0761c5ed4d/display?to_ext=tabular which translates to a ~80 Mb CSV. In contains 375676 samples!

Running it was no big issue, it took about 2 min and used a maximum of ~5Gb memory on my Macbook.

Trying to view the report however the browser first hangs (see image below).

image

After a while it loads, but scrolling is quite laggy with the large number of samples. The table is also converted to a beeswarm plot which I am not sure is very informative, or at least some columns should probably be removed. See below

image

File that triggers the error

No response

MultiQC Error log

$ gtime -v multiqc -n pangolin_large testdata/data/modules/pangolin/v4.0.5/galaxy_pangolin_results_usher_mode_full.csv

  /// MultiQC 🔍 | v1.16.dev0 (35b18ac)

|           multiqc | Search path : /Users/pontus.hojer/projects/MultiQC-pontus/testdata/data/modules/pangolin/v4.0.5/galaxy_pangolin_results_usher_mode_full.csv
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1  
|          pangolin | Found 375676 samples
|           multiqc | Report      : pangolin_large.html
|           multiqc | Data        : pangolin_large_data
|           multiqc | MultiQC complete
	Command being timed: "multiqc -n pangolin_large testdata/data/modules/pangolin/v4.0.5/galaxy_pangolin_results_usher_mode_full.csv"
	User time (seconds): 144.94
	System time (seconds): 8.91
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:35.54
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 5561124
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 36
	Minor (reclaiming a frame) page faults: 3937354
	Voluntary context switches: 3129
	Involuntary context switches: 36071
	Swaps: 0
	File system inputs: 0
	File system outputs: 4
	Socket messages sent: 17
	Socket messages received: 34
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Before submitting

  • I have read the troubleshooting documentation.
  • I am using the latest release of MultiQC.
  • I have included a full MultiQC log, not truncated.
  • I have attached an input file (.zip if necessary) that triggers the error.
@ewels
Copy link
Member

ewels commented Sep 23, 2023

Yes, this is a known issue. Generally the idea is to switch to static image plots when sample numbers are very high. This caps the maximum report filesize and also the JavaScript run time (eg. doesn't hang the browser). However, there is no flat-image plot for beeswarm plots yet (or heatmaps).

This should be resolved when we replace the plotting library in the near future. We will also look into replacing beeswarm plots with a better plot type - eg. violin or similar.

@ewels
Copy link
Member

ewels commented Sep 23, 2023

I think I'll close this for now as there's nothing specific that we'll do for the pangolin module.

This will be a good test case for #1789 though!

@ewels ewels changed the title Report hangs on large (300,000+ samples) pangolin dataset Replace beeswarm plot with different visualisation Sep 23, 2023
@ewels
Copy link
Member

ewels commented Sep 23, 2023

Actually I changed my mind. We didn't have an issue specifically for switching out the beeswarm plot yet. So I've changed the title and commandeered the issue 😅

@ewels
Copy link
Member

ewels commented Jan 16, 2024

@vladsavelyev - see above for a nice example dataset above to try out the new plot with huge sample numbers..

@ewels
Copy link
Member

ewels commented Jan 16, 2024

Note to self: needed to convert the downloaded file from TSV to CSV. Then tell MultiQC not to ignore the large file. This worked for me:

multiqc -f . --cl-config "log_filesize_limit: 2000000000"

With this (unzipped) file:

Galaxy-pangolin_results_usher_mode.tabular.csv.zip

@vladsavelyev
Copy link
Member

This is effectively addressed by adding the violin plot in #2292

Screenshot 2024-02-08 at 17 04 49

(Still not ideal that the long tick labels getting cropped, something to fix in the future).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants