Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[validator] Add additional countries to adm1 #50

Closed
1 of 5 tasks
tomrconnor opened this issue Apr 13, 2021 · 24 comments
Closed
1 of 5 tasks

[validator] Add additional countries to adm1 #50

tomrconnor opened this issue Apr 13, 2021 · 24 comments

Comments

@tomrconnor
Copy link

tomrconnor commented Apr 13, 2021

Brief description

COG-UK may begin to receive samples from British Overseas Territories or Crown Dependencies. This will require an update to adm1 in order to provide the option for sequences from these locations to be correctly linked. These may require some form of suppression post-upload, for example in the generation of data for the public MicroReact.

Detail

Validator Field Proposed option Justification
adm1 adm1 FK Falkland Islands are not covered by current adm1 options and have their own ISO 3166-2 identifier
adm1 adm1 GI Gibraltar is not covered by current adm1 options and have their own ISO 3166-2 identifier
adm1 adm1 JE Jersey is not covered by current adm1 options and have their own ISO 3166-2 identifier
adm1 adm1 IM The Isle of Man is not covered by current adm1 options and have their own ISO 3166-2 identifier
adm1 adm1 GG Guernsey is not covered by current adm1 options and have their own ISO 3166-2 identifier

Proposed by: Connor, T. R. PHWC

  • I have read the documentation for the existing validators and there is no suitable entry

  • Add field validation to Majora forms
  • Add field to API docs
  • Push to Majora prod
  • Update validator spreadsheet (if applicable)
@SamStudio8
Copy link
Member

SamStudio8 commented Apr 13, 2021

Discussed this with TRC on Slack but adding the salient points here:

  • Additions to adm1 validator is a straightforward request but there'll be a few tasks to tick off before we can set this up;

  • Deductive disclosure concerns are currently handled post-hoc before individual sequence-level data is made available to public sources (e.g. via Microreact). Standard approaches should apply here.

@tomrconnor can you confirm that the sequences from these new country codes will not need any other special handling? ie. Is there a problem with automatically uploading genomes from these countries with the basic public metadata using the same system currently provided for the home nations? Or providing the consortium level metadata to COG through the usual means?


  • Flag this to DIPI
  • Audit all code and pipelines that use adm1 and ensure they'll be compatible with upcoming changes e.g.:
    • Automated GISAID and ENA submissions will need to have their country code to full name mappings expanded
    • Datapipe geography munging code will need to be aware of the changes
    • Microreact team will need to be aware of the changes

@tomrconnor
Copy link
Author

As far as I am aware, deductive disclosure issues aside, there shouldn't be any need for special handling. I think the advice to the locations concerned will be to not upload other details, so we may need to think about what they put into adm2. I would expect that the upload of metadata would use the same system as the rest of COG-UK.

The locations concerned will be signing the data access agreement and will basically be treated like other sites.

@SamStudio8
Copy link
Member

SamStudio8 commented Apr 13, 2021 via email

@SamStudio8 SamStudio8 changed the title [validator] [validator] Add additional countries to adm1 Apr 14, 2021
@SamStudio8
Copy link
Member

Flagged this to DIPI. RC will follow up with VH and AT on geography cleaning steps. AU confirms this should fit in to standard practice with Microreact. Test data might be useful.

@SamStudio8
Copy link
Member

@tomrconnor Although adm2 is not controlled by Majora it would be useful to get an idea of what the adm2 data for each of these countries may look like, do you have any example sample-level metadata or aggregate adm2 counts that we might be able to feed back to geography teams so they can update their scripts?

@SamStudio8
Copy link
Member

SamStudio8 commented Apr 14, 2021

I have conducted a small scale audit of my software directory and spot the following:

  • Majora API (validator)
  • Majora API response (adm1_trans)
  • ocarina_resolve / Elan (used to convert country for inclusion in FASTA header via Elan)
  • ocarina (no direct handling)
  • asklepian (no direct handling)
  • ENA outbound (Elan/bam) (uses Majora adm1_trans, but fixed adm0)
  • ENA assembly outbound (no direct handling)
  • GISAID outbound (uses adm1_trans, but has fixed UK adm0)

I've realised, there is a slightly wider problem here, in that we've never asked for adm0, because it has always been assumed to be United Kingdom. All samples have their adm0 set to UK in Majora, but we will need to automatically fill in and handle adm0 with something appropriate going forward. Note that adm0 cannot be blank as it is a proxy for whether the biosample has been filled in (which is a different future issue).

Will need to compile the accepted formats for these new countries at GISAID and ENA.

Additionally, if PHE needs to receive the local lab ID field then we will need to revisit the configuration of the agreement within Majora to expand the country list from UK-ENG.

  • Ask GB GISAID for country list
  • Ask NR ENA about country validation

@SamStudio8
Copy link
Member

SamStudio8 commented Apr 15, 2021

Outbound lookups as follows:

adm1 Country Genbank[1] GISAID
FK Falkland Islands Falkland Islands (Islas Malvinas) South America / Falkland Islands
GI Gibraltar Gibraltar Europe / Gibraltar
JE Jersey Jersey Europe / Jersey
IM Isle of Man Isle of Man Europe / Isle of Man
GG Guernsey Guernsey Europe / Guernsey

[1] https://www.ncbi.nlm.nih.gov/genbank/collab/country/

@rmcolq
Copy link

rmcolq commented Apr 15, 2021

Just to check, the proposed adm1s would mean that instead of e.g. UK-ENG you might see FK (not UK-FK)?

@rambaut
Copy link

rambaut commented Apr 15, 2021

I thought it was GB-ENG?

@SamStudio8
Copy link
Member

SamStudio8 commented Apr 15, 2021

@rmcolq Yes; the proposed adm1 will be integrated as-is rather than prefixed (ie. using the proper ISO 3166-2 codes). The home nations will continue to use the existing adm1 (which are modified ISO 3166-2 using UK over GB).

@rambaut
Copy link

rambaut commented Apr 15, 2021

Complication is none of those new places are technically part of the UK (but then Northern Ireland is not part of Great Britain but still has GB-NIR under ISO 3166-2).

@SamStudio8
Copy link
Member

SamStudio8 commented Apr 15, 2021

@rambaut Indeed - there is a bunch of hard coding assuming United Kingdom dotted around a few things on this end of Elan. I'm not familiar with the NUTS (hah) and bolts of the geography cleaning but presumably this is a bit of work to integrate. There won't be any changes in Majora until we're confident that everything downstream will accept the new codes without undefined behaviour.

@rmcolq
Copy link

rmcolq commented Apr 21, 2021

Verity tells me that geography_cleaning.py now handles these 5 additional adm1 (and those that were added before with postcodes from these regions and an adm1 of England, although I assume these will be overwritten in majora once the new ones are allowed?). I've a few small changes to datapipe and publishing steps which filter on 'is_uk' but these would not break the pipeline, only cause sequences to be included/excluded where we don't want.

@SamStudio8
Copy link
Member

Awesome. I see the changes in COG-UK/geography_cleaning@54f5c76 tracked by COG-UK/geography_cleaning#2. @ViralVerity from the look of the patch this will work even if users don't change to the "correct" adm1 too? I don't plan to automatically update the existing data immediately so wouldn't want to break things downstream.

@rmcolq
Copy link

rmcolq commented Apr 22, 2021

Yes it will work even if you don't update the existing ones. Datapipe/Phylopipe2 changes made (largely just the publishing step), just testing now.

@ViralVerity
Copy link

Yes - it should propagate up to country level! There's been quite a lot of sequences from the new adm1s so far that just had postcodes, so it will pick those up. I'm also happy to take advisement on what the cleaned up country looks like in terms of underscores vs spaces and capitalisation.

@SamStudio8
Copy link
Member

SamStudio8 commented Apr 22, 2021

@rmcolq @ViralVerity Thanks both! Looks like it's mostly with me to sort out the outbound pipes and Majora itself now... The main thing I still need to sort out is whether these new country codes need to come under the same treatment as the UK-ENG code in terms of specimen ID data sharing with PHE.

@ViralVerity I'll be using the countries as in the table above but whatever is more consistent with the cleaning you do already should be fine I think. We could set up a page on the docs site about geography cleaning if it would be helpful for people?

@ViralVerity
Copy link

Great ok, yeah I've got those as inputs!
I think adding the geography cleaning readme onto the docs site would be a good idea, I suspect some of the columns as not quite self explanatory

@SamStudio8 SamStudio8 added the next On the cusp of being worked on label Apr 22, 2021
@SamStudio8
Copy link
Member

Health Informatics group has asked for an update on this. It looks like we're ready downstream. Before we carry on I just want to chase up whether each of these adm1 are signed on to the agreement such that we can handle them as standard.

@ViralVerity
Copy link

I can confirm geography cleaning takes the countries exactly as above (ie no "UK" and in capitals) and will return prettier, human readable versions, in line with how the current adm1s are treated.

@SamStudio8
Copy link
Member

@tomrconnor Do you know if all these new adm1 have signed up to COG now, don't want to inadvertently trigger an ethics meeting!

@SamStudio8 SamStudio8 added blocked in progress and removed next On the cusp of being worked on labels May 4, 2021
@tomrconnor
Copy link
Author

They were in the process of it, I think. PHE are sequencing for some of these locations; think that the other sites won't be told they can have access to CLIMB until the HI group knows that this work has been done. So the downstream work being done is great, and I think the next thing is to pass this back to the HI group for them to manage the next actions.

@SamStudio8
Copy link
Member

@tomrconnor Any update from HI?

@SamStudio8
Copy link
Member

Bumping to backlog #62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants