Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genomes from Elan 20220105 have been unpublished #179

Closed
SamStudio8 opened this issue Jan 6, 2022 · 12 comments
Closed

Genomes from Elan 20220105 have been unpublished #179

SamStudio8 opened this issue Jan 6, 2022 · 12 comments
Assignees
Labels

Comments

@SamStudio8
Copy link
Member

SamStudio8 commented Jan 6, 2022

During the handling of yesterday's data integrity incident (#178), the data set for 2022-01-05 was republished. It appears during this process the new data for 2022-01-05 was not added back to the data set; essentially republishing the 2022-01-04 data set. These genomes are now missing from the 2022-01-06 data set and will need to be reinserted.

@SamStudio8 SamStudio8 self-assigned this Jan 6, 2022
@SamStudio8
Copy link
Member Author

Yesterday I should have noticed that there was no new work to do when republishing the data set; this should have been grounds to stop what we were doing as we were writing OVER the data set (and so there should have been plenty of work to do). The key mistake was I ran the first step of cog-publish BEFORE repointing the published/head symlink.

The name coded into the head dir name is used to populate $LAST_DATE (https://github.com/SamStudio8/elan-nextflow/blob/master/bin/control/cog-publish.sh#L11). By not repointing this, Majora had no new genomes to publish. Although I correctly repointed the dir to run reconcile step shortly afterward -- I only double checked the bad data had been removed (not realising all the rest had been removed too!).

@SamStudio8
Copy link
Member Author

SamStudio8 commented Jan 6, 2022

I think we have two options:

  • Point head at 20220104 which could potentially republish the missing data (as well as today's) by extending the window in which reconcile is allowed to hit the file system
  • Update the PublishedArtifactGroup.published_date for the affected data in Majora, meaning they will look new for 2022-01-07

I'm tending towards the latter, if only because it would be more technically correct for the lost genomes to have their published date changed to the date they were inserted into the data set properly, and it's an easier solution to reason about.

@BioWilko
Copy link
Contributor

BioWilko commented Jan 6, 2022

The latter also strikes me as less likely to lead to some unforeseen consequence.

@SamStudio8
Copy link
Member Author

SamStudio8 commented Jan 6, 2022

So:

  • Pick an option to reinstate 2022-01-05 data
  • Check they appear in 2022-01-07
  • We'll write a proper playbook on republishing the data set to stop this happening again.

@SamStudio8
Copy link
Member Author

SamStudio8 commented Jan 6, 2022

We're going to go with the safer (and technically more correction Option 2). It should be straightforward (famous last words) as we can query Majora for published artifact groups with a published_date of 2022-01-05 and change them to 2022-01-07 (allowing them to be picked up as brand new tomorrow). I want to make sure we don't miss anything so @BioWilko is comparing the 2022-01-04 and 2022-01-05 data sets to ensure nothing will slip through the cracks. Once we've got an exact count of the number of affected PAGs we can make the update to Majora. We'll be able to check this has worked with Ocarina too.

I'm just chasing up a loose end at PHE as they are reporting the Asklepian genome table was 2 genomes smaller which is a discrepancy that doesn't fit expectations given what's happened.

@SamStudio8 SamStudio8 changed the title Genomes from Elan 20210105 have been unpublished Genomes from Elan 20220105 have been unpublished Jan 6, 2022
@SamStudio8
Copy link
Member Author

The number of times I have typed 2021 in here is embarrassing

@BioWilko
Copy link
Contributor

BioWilko commented Jan 6, 2022

Wow I actually didn't notice which is possibly worse

@SamStudio8
Copy link
Member Author

I have consulted with the Majora oracle:

>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05").count()                                                                                                                                                                
14383
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc").count()                                                                       
14157
>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-05", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc", is_suppressed=False).count()
13986
>>> 

The number of affected sequences is officially 13,986.

@SamStudio8
Copy link
Member Author

I've also cleared up the -2 situation at PHE. I think we're ready to go and update the published dates for the affected sequences.

@SamStudio8
Copy link
Member Author

SamStudio8 commented Jan 6, 2022

OK that's done.

>>> models.PublishedArtifactGroup.objects.filter(published_date="2022-01-07", quality_groups__is_pass=True, quality_groups__test_group__slug="cog-uk-elan-minimal-qc", is_suppressed=False).count()
13986

Note that we've left the rejected and suppressed genomes with the original published date because they are unaffected by this problem.

@SamStudio8
Copy link
Member Author

Looking good

[20220107]$ tail -f publish.log 
20220107
[CPUB] LAST_DATE=2022-01-06
23466 pass.fasta.ls
23466 pass.bam.ls
141 kill.fasta.ls
141 kill.bam.ls

@BioWilko
Copy link
Contributor

BioWilko commented Jan 7, 2022

I have checked the metadata TSV and big MSA and the missing samples appear to have been included in the latest dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants