Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and identify missing metagenomes #245

Closed
Smithmania opened this issue Oct 14, 2022 · 14 comments
Closed

Detect and identify missing metagenomes #245

Smithmania opened this issue Oct 14, 2022 · 14 comments

Comments

@Smithmania
Copy link

Can we add a feature that prints text like you use for files that are not available (e.g., " not available") for metagenomes that are missing from the MGSD tags? This would probably be acheived by making a list of all IDs tagged as metagenomes and comparing to the list polled from MGSD tags and finding the difference.

This feature will be beneficial for us and end users to track the progress of MG analysis.

@hou098
Copy link
Collaborator

hou098 commented Oct 17, 2022

@Smithmania

This would probably be acheived by making a list of all IDs tagged as metagenomes

How are samples tagged as having metagenome data? Do you mean those samples that have an OTU with the special metaxa_from_metagnome amplicon, or something else? The only tag I'm aware of is the 'type:amdb-metagenomics-analysed tag on CKAN packages.

@Smithmania
Copy link
Author

Smithmania commented Oct 18, 2022 via email

@hou098
Copy link
Collaborator

hou098 commented Oct 18, 2022

@Smithmania: This might almost be a one-liner using the CKAN python API.

Something along the lines of…

package_search(q='type:(not amdb-metagenomics-analysed)', fq='tags:metagenomics')

… might work. I will investigate.

@hou098
Copy link
Collaborator

hou098 commented Oct 18, 2022

This seems to work

ckan_remote_object.action.package_search(
    q='(tags:metagenomics) AND NOT (type:amdb-metagenomics-analysed)',
    rows=3000)['results'])

1893 results

@hou098
Copy link
Collaborator

hou098 commented Oct 19, 2022

Clarifying the data model for the production of these CKAN datasets.

Datasets with tags:metagenomics and not of type:amdb-metagenomics-analysed are processed to generate new datasets that have tags:metagenomics and are of type:amdb-metagenomics-analysed. Only "input" datasets for this process that have res_format:FASTQ are eligible for processing.

@Smithmania
Copy link
Author

Smithmania commented Oct 19, 2022 via email

@hou098
Copy link
Collaborator

hou098 commented Oct 24, 2022

OK, how about a "special" URL, say, /metagenome/status to report this? There's already a precedent for this kind of thing as we have the undocumented /ingest/ URL for the ingest report.

@Smithmania
Copy link
Author

Smithmania commented Oct 24, 2022 via email

@hou098
Copy link
Collaborator

hou098 commented Oct 24, 2022 via email

@Smithmania
Copy link
Author

Smithmania commented Nov 9, 2022

Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.

As no MGSD data will be housed on CKAN, available metagenome samples should be identified by searching tags metagenomics and filetype FASTQ. Instead of prioviding a list of available files for each individual sample, the popup selector should contain the list of all file types that the user can select (as in the Download zip archive of selected metagenome files for selected samples button. The analytics team will determine if those files are available or not for the sample and consult with the individual requesting the data. It would still be good if samples not meeting metadata requirements are excluded, however this can be done by the analytics team when retrieving the data.
Interactive sample searches using metadata such as (e.g. lat, long, vegetation type, environment etc.) and map based would be good, ranther than the plain non denoised sample request.

a list of available files for each sampleID will be prepared as part of the data analysis workflow and provided in the bpa-otu ingest packet. This list will be used to populate the file availability popup instead of CKAN query. Samples excluded due to not meeting metadata requirements should either be removed as per the amplicons or better still flagged “unavailable due to non compliant metadata”.

We will need to re-think how the MG search is done.

  • Perhaps an Amplicon style selector to switch between metaxa and MGSD data products. Currently searching by metaxa data may omit some samples in the odd case where no results are returned from the metaxa analysis for that sample
  • Perhaps when the search button is pressed without any selected taxonomy being selected, the search results retrived will be from the CKAN metagenomics/FASTQ search

@hou098
Copy link
Collaborator

hou098 commented Nov 9, 2022

Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.

Just to be clear: you still want to be able to restrict by sample context search (e.g. lat, long, vegetation type, environment etc.), yes? i.e. you don't want something as plain as the non-denoised search where all you get to filter by is sample id.

We will need to re-think how the MG search is done. Perhaps an Amplicon style selector to switch between metaxa and MGSD data products. Currently searching by metaxa data may omit some samples in the odd case where no results are returned from the metaxa analysis for that sample

One possibility is to completely wildcard the amplicon part of the search. This should be possible without any dramatic ill effects. The only downside I can think of right now is that there will be more taxonomy selection options at every rank, as the choices will be built from all available values for that rank. (e.g. the Kingdom dropdown would include k_fungi as well as d_Archaea, and every other option ever available in the kingdom dropdown).

A further possibility is to remove the taxonomy dropdowns altogether in the metagenome search page.

Also, can you confirm that there's no metagenome data for the following samples? These are samples that have no associated otu or taxonomy info at all.

webapp=# select otu.sample_context.id, sample_site_location_description from otu.sample_context left outer join otu.sample_otu on otu.sample_context.id = otu.sample_otu.sample_id left join otu.otu on otu.sample_otu.otu_id = otu.otu.id left join otu.taxonomy_otu on otu.otu.id = otu.taxonomy_otu.otu_id where otu.sample_otu.otu_id is null;
   id   |                     sample_site_location_description                     
--------+--------------------------------------------------------------------------
 137929 | Mingenew
 7046   | Lake Lewis
 137799 | Kerang
 19572  | WCP12 (2003CN) - informal reserve (research) in production native forest
 138686 | Towra Point
 13554  | Antarctic
 34937  | inshore reef_Channel
 19571  | WCP12 (2003CN) - informal reserve (research) in production native forest
 7074   | Lake Way
 13566  | Antarctic
 141301 | Rottnest Island
 7072   | Mibbeyean Creek
 137853 | Clare
 34949  | inshore reef_Channel
 8290   | Rutherglen
 7073   | Lake Way
 13285  | King Island
 13734  | Credo Redgum Plot
 137923 | Tammin
(19 rows)

webapp=# 

@Smithmania
Copy link
Author

Also, can you confirm that there's no metagenome data for the following samples? These are samples that have no associated otu or taxonomy info at all.

webapp=# select otu.sample_context.id, sample_site_location_description from otu.sample_context left outer join otu.sample_otu on otu.sample_context.id = otu.sample_otu.sample_id left join otu.otu on otu.sample_otu.otu_id = otu.otu.id left join otu.taxonomy_otu on otu.otu.id = otu.taxonomy_otu.otu_id where otu.sample_otu.otu_id is null;
   id   |                     sample_site_location_description                     
--------+--------------------------------------------------------------------------
 137929 | Mingenew
 7046   | Lake Lewis
 137799 | Kerang
 19572  | WCP12 (2003CN) - informal reserve (research) in production native forest
 138686 | Towra Point
 13554  | Antarctic
 34937  | inshore reef_Channel
 19571  | WCP12 (2003CN) - informal reserve (research) in production native forest
 7074   | Lake Way
 13566  | Antarctic
 141301 | Rottnest Island
 7072   | Mibbeyean Creek
 137853 | Clare
 34949  | inshore reef_Channel
 8290   | Rutherglen
 7073   | Lake Way
 13285  | King Island
 13734  | Credo Redgum Plot
 137923 | Tammin
(19 rows)

I dont see any datasets returned on ckan (using search sample_id:102.100.100.<sample_id> on CKAN for the above samples except for 34949, this sample was on a 16S (plate AUWLK) - and it looks like it failed sequencing by the number of returned reads. It does look like we have metadata in our DB for all samples, at a quick glance it looks like it meets minimal standards - so its likely those samples completely failed sequencing (no fastq generated).

@hou098
Copy link
Collaborator

hou098 commented Nov 13, 2022

@Smithmania
What do you think about this:

One possibility is to completely wildcard the amplicon part of the search. This should be possible without any dramatic ill effects. The only downside I can think of right now is that there will be more taxonomy selection options at every rank, as the choices will be built from all available values for that rank. (e.g. the Kingdom dropdown would include k_fungi as well as d_Archaea, and every other option ever available in the kingdom dropdown).

@hou098
Copy link
Collaborator

hou098 commented Jan 9, 2023

Modify metagenome download functionality to mimic the non-denoised amplicon behaviour. In this case MGSD files requested by a user for selected amplicon(s) will be emailed via the bpa help desk to the analytics team. The analytics team will manually distribute the requested files in consultation with the user.

Implemented in https://github.com/BioplatformsAustralia/bpaotu/tree/1.36.0 (see 1f1647a )

In metagenome mode, the amplicon selector can be set to '--', which selects every sample tagged as having metagenome data, regardless of taxonomy.

@hou098 hou098 closed this as completed Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants