Genomes from Elan 20220105 have been unpublished #179

SamStudio8 · 2022-01-06T12:12:59Z

During the handling of yesterday's data integrity incident (#178), the data set for 2022-01-05 was republished. It appears during this process the new data for 2022-01-05 was not added back to the data set; essentially republishing the 2022-01-04 data set. These genomes are now missing from the 2022-01-06 data set and will need to be reinserted.

SamStudio8 · 2022-01-06T12:24:42Z

Yesterday I should have noticed that there was no new work to do when republishing the data set; this should have been grounds to stop what we were doing as we were writing OVER the data set (and so there should have been plenty of work to do). The key mistake was I ran the first step of cog-publish BEFORE repointing the published/head symlink.

The name coded into the head dir name is used to populate $LAST_DATE (https://github.com/SamStudio8/elan-nextflow/blob/master/bin/control/cog-publish.sh#L11). By not repointing this, Majora had no new genomes to publish. Although I correctly repointed the dir to run reconcile step shortly afterward -- I only double checked the bad data had been removed (not realising all the rest had been removed too!).

SamStudio8 · 2022-01-06T12:29:48Z

I think we have two options:

Point head at 20220104 which could potentially republish the missing data (as well as today's) by extending the window in which reconcile is allowed to hit the file system
Update the PublishedArtifactGroup.published_date for the affected data in Majora, meaning they will look new for 2022-01-07

I'm tending towards the latter, if only because it would be more technically correct for the lost genomes to have their published date changed to the date they were inserted into the data set properly, and it's an easier solution to reason about.

BioWilko · 2022-01-06T12:31:13Z

The latter also strikes me as less likely to lead to some unforeseen consequence.

SamStudio8 · 2022-01-06T12:31:33Z

So:

Pick an option to reinstate 2022-01-05 data
Check they appear in 2022-01-07
We'll write a proper playbook on republishing the data set to stop this happening again.

SamStudio8 · 2022-01-06T12:46:40Z

We're going to go with the safer (and technically more correction Option 2). It should be straightforward (famous last words) as we can query Majora for published artifact groups with a published_date of 2022-01-05 and change them to 2022-01-07 (allowing them to be picked up as brand new tomorrow). I want to make sure we don't miss anything so @BioWilko is comparing the 2022-01-04 and 2022-01-05 data sets to ensure nothing will slip through the cracks. Once we've got an exact count of the number of affected PAGs we can make the update to Majora. We'll be able to check this has worked with Ocarina too.

I'm just chasing up a loose end at PHE as they are reporting the Asklepian genome table was 2 genomes smaller which is a discrepancy that doesn't fit expectations given what's happened.

SamStudio8 · 2022-01-06T13:12:05Z

The number of times I have typed 2021 in here is embarrassing

BioWilko · 2022-01-06T13:14:11Z

Wow I actually didn't notice which is possibly worse

SamStudio8 · 2022-01-06T13:16:30Z

I have consulted with the Majora oracle:

>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05").count()                                                                                                                                                                
14383
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc").count()                                                                       
14157
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc", is_suppressed=False).count()
13986
>>>

The number of affected sequences is officially 13,986.

SamStudio8 · 2022-01-06T13:18:06Z

I've also cleared up the -2 situation at PHE. I think we're ready to go and update the published dates for the affected sequences.

SamStudio8 · 2022-01-06T13:46:11Z

OK that's done.

>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-07", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc", is_suppressed=False).count()
13986

Note that we've left the rejected and suppressed genomes with the original published date because they are unaffected by this problem.

SamStudio8 · 2022-01-07T09:56:31Z

Looking good

[20220107]$ tail -f publish.log 
20220107
[CPUB] LAST_DATE=2022-01-06
23466 pass.fasta.ls
23466 pass.bam.ls
141 kill.fasta.ls
141 kill.bam.ls

BioWilko · 2022-01-07T10:31:28Z

I have checked the metadata TSV and big MSA and the missing samples appear to have been included in the latest dataset.

SamStudio8 added the incident label Jan 6, 2022

SamStudio8 self-assigned this Jan 6, 2022

BioWilko mentioned this issue Jan 6, 2022

[manualpipe] Asklepian 20220105 #178

Closed

8 tasks

SamStudio8 changed the title ~~Genomes from Elan 20210105 have been unpublished~~ Genomes from Elan 20220105 have been unpublished Jan 6, 2022

SamStudio8 closed this as completed Jan 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genomes from Elan 20220105 have been unpublished #179

Genomes from Elan 20220105 have been unpublished #179

SamStudio8 commented Jan 6, 2022 •

edited

Loading

SamStudio8 commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022 •

edited

Loading

BioWilko commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022 •

edited

Loading

SamStudio8 commented Jan 6, 2022 •

edited

Loading

SamStudio8 commented Jan 6, 2022

BioWilko commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022 •

edited

Loading

SamStudio8 commented Jan 7, 2022

BioWilko commented Jan 7, 2022

Genomes from Elan 20220105 have been unpublished #179

Genomes from Elan 20220105 have been unpublished #179

Comments

SamStudio8 commented Jan 6, 2022 • edited Loading

SamStudio8 commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022 • edited Loading

BioWilko commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022 • edited Loading

SamStudio8 commented Jan 6, 2022 • edited Loading

SamStudio8 commented Jan 6, 2022

BioWilko commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022

SamStudio8 commented Jan 6, 2022 • edited Loading

SamStudio8 commented Jan 7, 2022

BioWilko commented Jan 7, 2022

SamStudio8 commented Jan 6, 2022 •

edited

Loading

SamStudio8 commented Jan 6, 2022 •

edited

Loading

SamStudio8 commented Jan 6, 2022 •

edited

Loading

SamStudio8 commented Jan 6, 2022 •

edited

Loading

SamStudio8 commented Jan 6, 2022 •

edited

Loading