Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new pairwise indexed PAF adapter format with CLI creation workflow #3859

Merged
merged 12 commits into from
Dec 11, 2023

Conversation

cmdcolin
Copy link
Collaborator

@cmdcolin cmdcolin commented Aug 15, 2023

This proposes a new way of loading larger synteny datasets by pairwise tabix indexing a PAF file "query-wise" and "target-wise"

It requires a little bit of a command line setup but it enables much faster (and potentially more scalable) loading of synteny data

Example, converting the human vs mouse chain file to the pairwise indexed format

Session data size difference

The gzipped chain file which has to be loaded up front into memory, is hs1ToMm39.over.chain.gz 69Mb gzipped, 219Mb ungzipped.

With this branch, only 3.9Mb of bgzip data is downloaded for a session share link that looks at a large region of chr1 vs chr1 http://localhost:3000/?session=share-9-Dj3li4oS&config=test_data%2Fhs1_vs_mm39%2Fconfig.json&password=4vvbF

So, it is 5% of the data needed, a considerably reduction (~20x less)

Considerations

for the whole genome overview, we could maybe create a reduced PAF (e.g. strip CIGAR) that it semantically switches to

note that the meanQueryIdentity coloring mode may need to be pre-computed, since it requires global information to calculate the percent identity (it aggregates the mapping quality across all the pieces of the query sequence, even if it's split into multiple PAF lines)

@github-actions github-actions bot added the needs label triage Needs a label to show in changelog (breaking, enhancement, bug, documentation, or internal) label Aug 15, 2023
@cmdcolin cmdcolin added enhancement New feature or request and removed needs label triage Needs a label to show in changelog (breaking, enhancement, bug, documentation, or internal) labels Aug 15, 2023
@cmdcolin
Copy link
Collaborator Author

random background to PR

on going scalability concerns issues e.g. #2788

may help protocol-ize our synteny better a la current protocols paper (e.g. making sure the protocol can handle any size genome is good, otherwise have to explain limits more, something the jbrowse 2 paper did not do)

also in response to feedback received in person at ismb where a user said they were not sure why some paf tracks took a long time

@cmdcolin
Copy link
Collaborator Author

I came up with two possible ideas that could create a "single file" that contains all the needed information

  1. in order to create an "overview" (whole genome dotplot) we can either strip the CIGAR but will inaccurately "bridge across" large insertions and deletions. to remedy we can, as a pre-processing script, scan across the CIGAR string, and if we encounter a insertion or deletion that "would be visible" at a large overview scale (say, 100,000bp just for example but could be dynamically calculated), then we split the feature into two: one going from start of feature to start of deletion, another going from after deletion to end of feature. this allows us to create an "overview" paf that is not as lossy as completely stripping the CIGAR

  2. the configuration for this track ends up producing many files currently, already 4 files (query.paf.gz, query.paf.gz.tbi, target.paf.gz, target.paf.gz.tbi) without the overview paf which would be ~5 ( the overview wouldnt be tabix indexed probably since it is downloaded in full). this is a lot of files to maintain for a 'single synteny track'. instead, we could make a single tabix file that has a specialized encoding. example screenshot illustrating encoding

untitled

the output has both the querywise and the targetwise sorting in a single file, with the a literal prefix called "querywise" and "targetwise" preprended to all the refnames. another prefix for the overview could even be added. preparing this may require more code than the simple 6 line bash script at the top, but it would "reduce the mental overhead of juggling so many files to prepare the track" which may be worth it

@cmdcolin
Copy link
Collaborator Author

implemented the strategy mentioned above where both query and target are in a single file. used a script made here https://github.com/cmdcolin/pairwise_indexed_paf which could be added to the jbrowse cli tools probably

@cmdcolin cmdcolin force-pushed the pairwise_paf branch 2 times, most recently from 4e9edfc to 162eda7 Compare September 8, 2023 19:29
@cmdcolin
Copy link
Collaborator Author

cmdcolin commented Sep 8, 2023

Created a new command "jbrowse process-paf" with the intended usage being something like

minimap2 grape.fa peach.fa > out.paf
jbrowse process-paf out.ppaf | sort -k1,1 -k3,3n |bgzip> out.sorted.ppaf.gz
tabix -s1 -b3 -e4 out.sorted.ppaf.gz
jbrowse add-track out.sorted.ppaf.gz -a peach,grape

or, can pipe minemap

minimap2 grape.fa peach.fa| jbrowse process-paf | sort -k1,1 -k3,3n |bgzip> out.sorted.ppaf.gz
tabix -s1 -b3 -e4 out.sorted.ppaf.gz
jbrowse add-track out.sorted.ppaf.gz -a peach,grape

I thought the file extension ppaf may help distinguish this "processed PAF" from the source one. That gives it a custom file extension that add-track can use to use the specialized PairwiseIndexedPAFAdapter for. The ppaf is still PAF format but has duplicated the info into separate "tabix query spaces" (one for query sorting, one with target sorting)

@cmdcolin
Copy link
Collaborator Author

cmdcolin commented Sep 8, 2023

note: potentially the above set of ~6 commands could be automated also, but also adds complexity to our program

@cmdcolin cmdcolin force-pushed the pairwise_paf branch 2 times, most recently from 33968af to a7a9c49 Compare September 11, 2023 19:07
@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Attention: 66 lines in your changes are missing coverage. Please review.

Comparison is base (9f69cb5) 63.40% compared to head (b6a2d3a) 63.43%.

❗ Current head b6a2d3a differs from pull request most recent head 31d04e5. Consider uploading reports for the commit 31d04e5 to get more accurate results

Files Patch % Lines
...wiseIndexedPAFAdapter/PairwiseIndexedPAFAdapter.ts 0.00% 51 Missing ⚠️
...ntenyDisplay/components/LinearSyntenyRendering.tsx 0.00% 7 Missing ⚠️
products/jbrowse-cli/src/commands/add-track.ts 50.00% 3 Missing ⚠️
products/jbrowse-cli/src/commands/process-paf.ts 95.34% 2 Missing ⚠️
...ters/src/PairwiseIndexedPAFAdapter/configSchema.ts 50.00% 1 Missing ⚠️
...ve-adapters/src/PairwiseIndexedPAFAdapter/index.ts 75.00% 1 Missing ⚠️
plugins/comparative-adapters/src/index.ts 80.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3859      +/-   ##
==========================================
+ Coverage   63.40%   63.43%   +0.03%     
==========================================
  Files        1057     1061       +4     
  Lines       30787    30831      +44     
  Branches     7332     7356      +24     
==========================================
+ Hits        19520    19559      +39     
- Misses      11094    11100       +6     
+ Partials      173      172       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cmdcolin
Copy link
Collaborator Author

cmdcolin commented Dec 5, 2023

I refreshed this branch, as I think it has major benefits for visualizing PAF. especially use cases like viewing just the alignment in a small region, like @scottcain often has with loading up 6 or more syntenytracks of a small region. this can probably drastically reduce memory consumption, and make many synteny use cases more palatable

whole genome overviews are not addressed by this yet, still would load a large amount of data, but the use case for loading just the synteny/alignment data for a region can be much improved by this PR

@cmdcolin
Copy link
Collaborator Author

cmdcolin commented Dec 5, 2023

also updated some of the comments posted previously with cli usage, etc.

@cmdcolin
Copy link
Collaborator Author

I changed proposed file extension from ppaf to pif. I think it's a little less confusing perhaps to be a nice 3 letter extension

and then added a new CLI command called "jbrowse make-pif file.paf"

this will output file.pif.gz and file.pif.gz.tbi. optionally can supply a --out flag too

@cmdcolin
Copy link
Collaborator Author

the new "jbrowse make-pif" also runs the sort, bgzip and tabix commands automatically so it should streamline usage

@cmdcolin cmdcolin merged commit e11b16b into main Dec 11, 2023
10 checks passed
@cmdcolin cmdcolin changed the title Pairwise indexed PAF adapter proposal Add new pairwise indexed PAF adapter format with CLI creation workflow Dec 11, 2023
@cmdcolin cmdcolin deleted the pairwise_paf branch December 11, 2023 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant