Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baseline filter #328

Open
aswarren opened this issue Aug 10, 2023 · 5 comments
Open

baseline filter #328

aswarren opened this issue Aug 10, 2023 · 5 comments

Comments

@aswarren
Copy link

Pulling down surveillance from the API includes all sequences no matter the reason. In the case of the US / GISAID this includes traveller surveillance, which if estimating prevalence for a particular area, can give a very different picture than domestic spread. Is there a way to filter sequences based on baseline sequencing tag? If not it would be useful to have.

@chaoran-chen
Copy link
Member

We have a field samplingStrategy. You can see the available tags using fields=samplingStrategy, e.g., at https://lapis.cov-spectrum.org/open/v1/sample/aggregated?fields=samplingStrategy.

@aswarren
Copy link
Author

aswarren commented Aug 10, 2023

Awesome! Thanks!
Is there a field guide for explanation of A, X, Y, N?

{"errors":[],"info":{"apiVersion":1,"dataVersion":1690103788,"deprecationDate":null,"deprecationInfo":null,"acknowledgement":null},"data":[{"samplingStrategy":"A","count":48019},{"samplingStrategy":"X","count":192119},{"samplingStrategy":"Y","count":44101},{"samplingStrategy":"N","count":314101},{"samplingStrategy":null,"count":7683436}]}

@corneliusroemer
Copy link
Contributor

Is there a field guide for explanation of A, X, Y, N?

The fields A,X,Y,N are shown only for data pulled from RKI (Germany's CDC) as opposed to Genbank. Their README is here: https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland

image

It's a bit scrambled, the sentences seem incomplete. I would say:
X: unknown whether targeted or not
Y: sequencing done potentially due to interesting mutations/variant PCR
A: Variant PCR suggested something of interest
N: Representative sampling

I'm not sure about how reliable the annotation is though. I remember that when I looked into it a year ago, it seemed like representative sampling wasn't necessarily representative.

I think the field was introduced back in the day when labs started to do variant PCRs to get a quick idea of which variant a patient - as variant PCR was as fast as PCR and less delay than waiting for whole genome sequencing.

@aswarren
Copy link
Author

aswarren commented Aug 10, 2023

Ah thanks very much to you both. Since @chaoran-chen example uses the open API, I also was also wondering about the binding from the "purpose_of_sampling" tag in NCBI to the codes explained by @corneliusroemer 's link?
One example where the baseline tag ends up mattering in the US, is the CDC sequencing nasal swabs vs traveller surveillance. In previous months when pulling down the surveillance via API the growth curve of XBB.1.16 looked much more aggressive in domestic surveillance because traveller surveillance was being included. If I were estimating prevalence in a state I likely wouldn't want to include people landing at the airport domestic/international. That motivated my initial question about the ability to filter since presumably traveller surveillance wouldn't qualify for baseline or might be distinguishable in some way via that field.
On NCBI the purpose_of_sampling can be accessed via CLI like so:
$ datasets summary virus genome taxon sars-cov-2 --released-after 05/20/2023 | jq -r '.reports[] | select(.purpose_of_sampling != null) | [.accession,.purpose_of_sampling,.isolate.name] | @tsv' >ncbi_baseline.tsv
Most of that command line magic was provided by Eric Cox at NCBI-Datasets

@corneliusroemer
Copy link
Contributor

Ah very nice @aswarren! The open data comes gets to LAPIS via nextstrain/ncov-ingest and I don't think we currently use that purpose_of_sampling field there - though we definitely should.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants