# NCBI Genome RefSeq ingestion script

This script reads the listing for the NCBI genome/refseq via the URL https://ftp.ncbi.nih.gov/genomes/refseq/assembly_summary_refseq.txt and attempts to map the columns to Darwin Core (DwC), using any of the DwC extenesions that might apply. More info in [this issue](https://github.com/AtlasOfLivingAustralia/arga-issues/issues/8).

In [40]:
import pandas as pd
df = pd.read_table("https://ftp.ncbi.nih.gov/genomes/refseq/assembly_summary_refseq.txt", header=1, dtype = 'object', parse_dates=['seq_rel_date'])

Look at the data types detected by pandas

In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261398 entries, 0 to 261397
Data columns (total 33 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   geneticAccessionNumber          261398 non-null  object        
 1   ncbi_bioproject                 261260 non-null  object        
 2   ncbi_biosample                  249679 non-null  object        
 3   ncbi_nuccore                    218468 non-null  object        
 4   ncbi_refseq_category            261398 non-null  object        
 5   taxonID                         261398 non-null  object        
 6   specific_host                   261398 non-null  object        
 7   scientificName                  261398 non-null  object        
 8   cloneStrain                     242405 non-null  object        
 9   source_mat_id                   17724 non-null   object        
 10  ncbi_version_status             261398 non-null  object 

In [30]:
df.head(100)

Unnamed: 0,# assembly_accession,bioproject,biosample,wgs_master,refseq_category,taxid,species_taxid,organism_name,infraspecific_name,isolate,...,genome_rep,seq_rel_date,asm_name,submitter,gbrs_paired_asm,paired_asm_comp,ftp_path,excluded_from_refseq,relation_to_type_material,asm_not_live_date
0,GCF_000001215.4,PRJNA164,SAMN02803731,,reference genome,7227,7227,Drosophila melanogaster,,,...,Full,2014-08-01,Release 6 plus ISO1 MT,The FlyBase Consortium/Berkeley Drosophila Gen...,GCA_000001215.4,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
1,GCF_000001405.40,PRJNA168,,,reference genome,9606,9606,Homo sapiens,,,...,Full,2022-02-03,GRCh38.p14,Genome Reference Consortium,GCA_000001405.29,different,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
2,GCF_000001635.27,PRJNA169,,,reference genome,10090,10090,Mus musculus,,,...,Full,2020-06-24,GRCm39,Genome Reference Consortium,GCA_000001635.9,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
3,GCF_000001735.4,PRJNA116,SAMN03081427,,reference genome,3702,3702,Arabidopsis thaliana,ecotype=Columbia,,...,Full,2018-03-15,TAIR10.1,The Arabidopsis Information Resource (TAIR),GCA_000001735.2,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
4,GCF_000001905.1,PRJNA70973,SAMN02953622,AAGU00000000.3,representative genome,9785,9785,Loxodonta africana,,ISIS603380,...,Full,2009-07-15,Loxafr3.0,Broad Institute,GCA_000001905.1,different,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,GCF_000007065.1,PRJNA224116,SAMN02603290,,representative genome,192952,2209,Methanosarcina mazei Go1,strain=Go1,,...,Full,2002-05-20,ASM706v1,Gottingen Genomics Laboratory,GCA_000007065.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
96,GCF_000007085.1,PRJNA224116,SAMN02603009,,representative genome,273068,911092,Caldanaerobacter subterraneus subsp. tengconge...,strain=MB4,,...,Full,2002-05-14,ASM708v1,Beijing Center.HGP,GCA_000007085.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,assembly from type material,na
97,GCF_000007105.1,PRJNA224116,SAMN02603579,,representative genome,264203,542,Zymomonas mobilis subsp. mobilis ZM4 = ATCC 31821,strain=ZM4,,...,Full,2010-01-12,ASM710v1,Macrogen Inc.,GCA_000007105.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,,na
98,GCF_000007125.1,PRJNA224116,SAMN02603416,,representative genome,224914,29459,Brucella melitensis bv. 1 str. 16M,strain=16M,,...,Full,2001-12-28,ASM712v1,Integrated Genomics,GCA_000007125.1,identical,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...,,assembly from type material,na


## Field mappings

Attempt to map NCBI fields to known [DwC terms](https://dwc.tdwg.org/list/), including extensions

NCBI fields described at https://ftp.ncbi.nih.gov/genomes/README_assembly_summary.txt

| # | field name | field description |
| ---- | ---- | ---- |
| 1 | "assembly_accession" | Assembly accession: the assembly accession.version reported in this field is a unique identifier for the set of sequences in this particular version of the genome assembly. |
| 2 | "bioproject" | BioProject: accession for the BioProject which produced the sequences in the genome assembly. A BioProject is a collection of biological data related to a single initiative, originating from a single organization or from a consortium. A BioProject record provides users a single place to find links to the diverse data types generated for that project. The record can be retrieved from the NCBI BioProject resource: https://www.ncbi.nlm.nih.gov/bioproject/ |
| 3 | "biosample" | BioSample: accession for the BioSample from which the sequences in the genome assembly were obtained. A BioSample record contains a description of the biological source material used in experimental assays. The record can be retrieved from the NCBI BioSample resource: https://www.ncbi.nlm.nih.gov/biosample/ |
| 4 | "wgs_master" | WGS-master: the GenBank Nucleotide accession and version for the master record of the Whole Genome Shotgun (WGS) project for the genome assembly. The master record can be retrieved from the NCBI Nucleotide resource: https://www.ncbi.nlm.nih.gov/nuccore Genome assemblies that are complete genomes, and those that are clone-based, do not have WGS-master records in which case this field will be empty. |
| 5 | "refseq_category" | RefSeq Category: whether the assembly is a reference or representative genome in the NCBI Reference Sequence (RefSeq) project classification. Values: **reference genome** - a manually selected high quality genome assembly that NCBI and the community have identified as being important as a standard against which other data are compared; **representative genome** - a genome computationally or manually selected as a representative from among the best genomes available for a species or clade that does not have a designated reference genome; **na** - no RefSeq category assigned to this assembly Prokaryotes may have more than one reference or representative genome per species. For more information see: https://www.ncbi.nlm.nih.gov/refseq/about/prokaryotes/#referencegenome Eukaryotes have no more than one reference or representative genome per species. If there are no assemblies in RefSeq for a particular eukaryotic species, then the GenBank assembly that RefSeq would select as the best available for that species will be designated as the representative genome. Viruses may have one or more reference genomes per species. The representative genome designation is not applied to viruses. |
| 6 | "taxid" | Taxonomy ID: the NCBI taxonomy identifier for the organism from which the genome assembly was derived. The NCBI Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. The taxonomy record can be retrieved from the NCBI Taxonomy resource: https://www.ncbi.nlm.nih.gov/taxonomy/ |
| 7 | "species_taxid" | Species taxonomy ID: the NCBI taxonomy identifier for the species from which the genome assembly was derived. The species taxid will differ from the organism taxid (column 6) only when the organism was reported at a sub- species or strain level. |
| 8 | "organism_name" | Organism name: the scientific name of the organism from which the sequences in the genome assembly were derived. This name is taken from the NCBI Taxonomy record for the taxid specified in column 6. Some older taxids were assigned at the strain level and for these the organism name will include the strain. Current practice is only to assign taxids at the species level; for these the organism name will be just the species, however, the strain name will be reported in the infraspecific_name field (column 9). |
| 9 | "infraspecific_name" | Infraspecific name: the strain, breed, cultivar or ecotype of the organism from which the sequences in the genome assembly were derived. Data are reported in the form tag=value, e.g. strain=AF16. Strain, breed, cultivar and ecotype are not expected to be used together, however, if they are then they will be reported in a list separated by ", /". Empty if no strain, breed, cultivar or ecotype is specified on the genomic sequence records. |
| 10 | "isolate" | Isolate: the individual isolate from which the sequences in the genome assembly were derived. Empty if no isolate is specified on the genomic sequence records. |
| 11 | "version_status" | Version status: the release status for the genome assembly version. Values: **latest** - the most recent of all the versions for this assembly chain; **replaced** - this version has been replaced by a newer version of the assembly in the same chain; **suppressed** - this version of the assembly has been suppressed An assembly chain is the collection of all versions for the same assembly accession. |
| 12 | "assembly_level" | Assembly level: the highest level of assembly for any object in the genome assembly. Values: **Complete genome** - all chromosomes are gapless and have no runs of 10 or more ambiguous bases (Ns), there are no unplaced or unlocalized scaffolds, and all the expected chromosomes are present (i.e. the assembly is not noted as having partial genome representation). Plasmids and organelles may or may not be included in the assembly but if present then the sequences are gapless. **Chromosome** - there is sequence for one or more chromosomes. This could be a completely sequenced chromosome without gaps or a chromosome containing scaffolds or contigs with gaps between them. There may also be unplaced or unlocalized scaffolds. Scaffold - some sequence contigs have been connected across gaps to create scaffolds, but the scaffolds are all unplaced or unlocalized. **Contig** - nothing is assembled beyond the level of sequence contigs |
| 13 | "release_type" | Release type: whether this version of the genome assembly is a major, minor or patch release. Values: **Major** - changes from the previous assembly version result in a significant change to the coordinate system. The first version of an assembly is always a major release. Most subsequent genome assembly updates are also major releases. **Minor** - changes from the previous assembly version are limited to the following changes, none of which result in a significant change to the coordinate system of the primary assembly-unit: - adding, removing or changing a non-nuclear assembly-unit - dropping unplaced or unlocalized scaffolds - adding up to 50 unplaced or unlocalized scaffolds which are shorter than the current scaffold-N50 value - replacing a component with a gap of the same length Patch - the only change is the addition or modification of a patch assembly-unit. See the NCBI Assembly model web page (https://www.ncbi.nlm.nih.gov/assembly/ model/#asmb_def) for definitions of assembly-units and genome patches. |
| 14 | "genome_rep" | Genome representation: whether the goal for the assembly was to represent the whole genome or only part of it. Values: **Full** - the data used to generate the assembly was obtained from the whole genome, as in Whole Genome Shotgun (WGS) assemblies for example. There may still be gaps in the assembly. **Partial** - the data used to generate the assembly came from only part of the genome. Most assemblies have full genome representation with a minority being partial genome representation. See the Assembly help web page (https://www.ncbi.nlm.nih.gov/assembly/help/) for reasons that the genome representation would be set to partial. |
| 15 | "seq_rel_date" | Sequence release date: the date the sequences in the genome assembly were released in the International Nucleotide Sequence Database Collaboration (INSDC) databases, i.e. DDBJ, ENA or GenBank. |
| 16 | "asm_name" | Assembly name: the submitter's name for the genome assembly, when one was provided, otherwise a default name, in the form ASM#####v#, is provided by NCBI. Assembly names are not unique. |
| 17 | "submitter" | Submitter: the submitting consortium or first position if a list of organizations. The full submitter information is available in the NCBI BioProject resource: www.ncbi.nlm.nih.gov/bioproject/ |
| 18 | "gbrs_paired_asm" | GenBank/RefSeq paired assembly: the accession.version of the GenBank assembly that is paired to the given RefSeq assembly, or vice-versa. "na" is reported if the assembly is unpaired. |
| 19 | "paired_asm_comp" | Paired assembly comparison: whether the paired GenBank & RefSeq assemblies are identical or different. Values: **identical** - GenBank and RefSeq assemblies are identical; **different** - GenBank and RefSeq assemblies are not identical; **na** - not applicable since the assembly is unpaired |
| 20 | "ftp_path" | FTP path: the path to the directory on the NCBI genomes FTP site from which data for this genome assembly can be downloaded. |
| 21 | "excluded_from_refseq" | Excluded from RefSeq: reasons the assembly was excluded from the NCBI Reference Sequence (RefSeq) project, including any assembly anomalies. See: https://www.ncbi.nlm.nih.gov/assembly/help/anomnotrefseq/ |
| 22 | "relation_to_type_material" | Relation to type material: contains a value if the sequences in the genome assembly were derived from type material. Values: **assembly from type material** - the sequences in the genome assembly were derived from type material; **assembly from synonym type material** - the sequences in the genome assembly were derived from synonym type material; **assembly from pathotype material** - the sequences in the genome assembly were derived from pathovar material;** **assembly designated as neotype** - the sequences in the genome assembly were derived from neotype material; **assembly designated as reftype** - the sequences in the genome assembly were derived from reference material where type material never was available and is not likely to ever be available ICTV species exemplar - the International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly as the exemplar for the virus species ICTV additional isolate - the International Committee on Taxonomy of Viruses (ICTV) designated the genome assembly an additional isolate for the virus species |
| 23 | "asm_not_live_date" | Assembly no longer live date: the date the assembly transitioned from version_status latest to either replaced or suppressed. When the assembly is in status latest, "na" is reported. |

In [41]:
# 1:1 mapping to DwC and extension terms
new_names = {
  '# assembly_accession': 'geneticAccessionNumber',
  'bioproject': 'ncbi_bioproject',
  'biosample': 'ncbi_biosample',
  'wgs_master': 'ncbi_nuccore',
  'refseq_category': 'ncbi_refseq_category',
  'taxid': 'taxonID',
  'species_taxid': 'specific_host',
  'organism_name': 'scientificName',
  'infraspecific_name': 'cloneStrain',
  'isolate': 'source_mat_id',
  'version_status': 'ncbi_version_status',
  'assembly_level': 'ncbi_assembly_level',
  'release_type': 'ncbi_release_type',
  'genome_rep': 'ncbi_genome_rep',
  'seq_rel_date': 'eventDate',
  'asm_name': 'otherCatalogNumbers',
  'submitter': 'redcordedBy',
  'gbrs_paired_asm': 'ncbi_gbrs_paired_asm',
  'paired_asm_comp': 'ncbi_paired_asm_comp',
  'ftp_path': 'geneticAccessionURI',
  'excluded_from_refseq': 'ncbi_excluded_from_refseq',
  'relation_to_type_material': 'ncbi_relation_to_type_material',
  'asm_not_live_date': 'ncbia_sm_not_live_date'
}

# keep copies of DwC matched columns (just in case)
df['ncbi_assembly_accession'] = df['# assembly_accession'] 
df['ncbi_taxid'] = df['taxid'] 
df['ncbi_species_taxid'] = df['species_taxid'] 
df['ncbi_organism_name'] = df['organism_name']
df['ncbi_infraspecific_name'] = df['infraspecific_name']
df['ncbi_isolate'] = df['isolate']
df['ncbi_seq_rel_date'] = df['seq_rel_date'] 
df['ncbi_asm_name'] = df['asm_name']
df['ncbi_submitter'] = df['submitter']
df['ncbi_ftp_path'] = df['ftp_path']

# do rename
df.rename(
  columns = new_names, 
  inplace = True
)


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261398 entries, 0 to 261397
Data columns (total 33 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   geneticAccessionNumber          261398 non-null  object        
 1   ncbi_bioproject                 261260 non-null  object        
 2   ncbi_biosample                  249679 non-null  object        
 3   ncbi_nuccore                    218468 non-null  object        
 4   ncbi_refseq_category            261398 non-null  object        
 5   taxonID                         261398 non-null  object        
 6   specific_host                   261398 non-null  object        
 7   scientificName                  261398 non-null  object        
 8   cloneStrain                     242405 non-null  object        
 9   source_mat_id                   17724 non-null   object        
 10  ncbi_version_status             261398 non-null  object 

In [42]:
df.head(100)

Unnamed: 0,geneticAccessionNumber,ncbi_bioproject,ncbi_biosample,ncbi_nuccore,ncbi_refseq_category,taxonID,specific_host,scientificName,cloneStrain,source_mat_id,...,ncbi_assembly_accession,ncbi_taxid,ncbi_species_taxid,ncbi_organism_name,ncbi_infraspecific_name,ncbi_isolate,ncbi_seq_rel_date,ncbi_asm_name,ncbi_submitter,ncbi_ftp_path
0,GCF_000001215.4,PRJNA164,SAMN02803731,,reference genome,7227,7227,Drosophila melanogaster,,,...,GCF_000001215.4,7227,7227,Drosophila melanogaster,,,2014-08-01,Release 6 plus ISO1 MT,The FlyBase Consortium/Berkeley Drosophila Gen...,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
1,GCF_000001405.40,PRJNA168,,,reference genome,9606,9606,Homo sapiens,,,...,GCF_000001405.40,9606,9606,Homo sapiens,,,2022-02-03,GRCh38.p14,Genome Reference Consortium,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
2,GCF_000001635.27,PRJNA169,,,reference genome,10090,10090,Mus musculus,,,...,GCF_000001635.27,10090,10090,Mus musculus,,,2020-06-24,GRCm39,Genome Reference Consortium,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
3,GCF_000001735.4,PRJNA116,SAMN03081427,,reference genome,3702,3702,Arabidopsis thaliana,ecotype=Columbia,,...,GCF_000001735.4,3702,3702,Arabidopsis thaliana,ecotype=Columbia,,2018-03-15,TAIR10.1,The Arabidopsis Information Resource (TAIR),https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
4,GCF_000001905.1,PRJNA70973,SAMN02953622,AAGU00000000.3,representative genome,9785,9785,Loxodonta africana,,ISIS603380,...,GCF_000001905.1,9785,9785,Loxodonta africana,,ISIS603380,2009-07-15,Loxafr3.0,Broad Institute,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,GCF_000007065.1,PRJNA224116,SAMN02603290,,representative genome,192952,2209,Methanosarcina mazei Go1,strain=Go1,,...,GCF_000007065.1,192952,2209,Methanosarcina mazei Go1,strain=Go1,,2002-05-20,ASM706v1,Gottingen Genomics Laboratory,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
96,GCF_000007085.1,PRJNA224116,SAMN02603009,,representative genome,273068,911092,Caldanaerobacter subterraneus subsp. tengconge...,strain=MB4,,...,GCF_000007085.1,273068,911092,Caldanaerobacter subterraneus subsp. tengconge...,strain=MB4,,2002-05-14,ASM708v1,Beijing Center.HGP,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
97,GCF_000007105.1,PRJNA224116,SAMN02603579,,representative genome,264203,542,Zymomonas mobilis subsp. mobilis ZM4 = ATCC 31821,strain=ZM4,,...,GCF_000007105.1,264203,542,Zymomonas mobilis subsp. mobilis ZM4 = ATCC 31821,strain=ZM4,,2010-01-12,ASM710v1,Macrogen Inc.,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
98,GCF_000007125.1,PRJNA224116,SAMN02603416,,representative genome,224914,29459,Brucella melitensis bv. 1 str. 16M,strain=16M,,...,GCF_000007125.1,224914,29459,Brucella melitensis bv. 1 str. 16M,strain=16M,,2001-12-28,ASM712v1,Integrated Genomics,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...


In [45]:
df.to_csv('../data/assembly_summary_refseq.csv', index=False)

In [34]:
df = pd.read_csv("../data/assembly_summary_refseq.csv")

  df = pd.read_csv("../data/assembly_summary_refseq.csv")


In [46]:
df

Unnamed: 0,geneticAccessionNumber,ncbi_bioproject,ncbi_biosample,ncbi_nuccore,ncbi_refseq_category,taxonID,specific_host,scientificName,cloneStrain,source_mat_id,...,ncbi_assembly_accession,ncbi_taxid,ncbi_species_taxid,ncbi_organism_name,ncbi_infraspecific_name,ncbi_isolate,ncbi_seq_rel_date,ncbi_asm_name,ncbi_submitter,ncbi_ftp_path
0,GCF_000001215.4,PRJNA164,SAMN02803731,,reference genome,7227,7227,Drosophila melanogaster,,,...,GCF_000001215.4,7227,7227,Drosophila melanogaster,,,2014-08-01,Release 6 plus ISO1 MT,The FlyBase Consortium/Berkeley Drosophila Gen...,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
1,GCF_000001405.40,PRJNA168,,,reference genome,9606,9606,Homo sapiens,,,...,GCF_000001405.40,9606,9606,Homo sapiens,,,2022-02-03,GRCh38.p14,Genome Reference Consortium,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
2,GCF_000001635.27,PRJNA169,,,reference genome,10090,10090,Mus musculus,,,...,GCF_000001635.27,10090,10090,Mus musculus,,,2020-06-24,GRCm39,Genome Reference Consortium,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
3,GCF_000001735.4,PRJNA116,SAMN03081427,,reference genome,3702,3702,Arabidopsis thaliana,ecotype=Columbia,,...,GCF_000001735.4,3702,3702,Arabidopsis thaliana,ecotype=Columbia,,2018-03-15,TAIR10.1,The Arabidopsis Information Resource (TAIR),https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
4,GCF_000001905.1,PRJNA70973,SAMN02953622,AAGU00000000.3,representative genome,9785,9785,Loxodonta africana,,ISIS603380,...,GCF_000001905.1,9785,9785,Loxodonta africana,,ISIS603380,2009-07-15,Loxafr3.0,Broad Institute,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/0...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261393,GCF_936394065.1,PRJNA224116,SAMEA13317838,CAKYAI000000000.1,na,1280,1280,Staphylococcus aureus,,164625,...,GCF_936394065.1,1280,1280,Staphylococcus aureus,,164625,2022-04-16,Assembly164625,istituto zooprofilattico sperimentale della lo...,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/9...
261394,GCF_937468385.1,PRJNA224116,SAMEA10017782,CALBWS000000000.1,na,2880965,2880965,Neobacillus sp. JJ-3,,CIP111895,...,GCF_937468385.1,2880965,2880965,Neobacillus sp. JJ-3,,CIP111895,2022-05-02,JJ-3,Institut Pasteur,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/9...
261395,GCF_938192085.1,PRJNA224116,SAMEA14037638,CAKZRX000000000.1,na,562,562,Escherichia coli,,Fecal samples,...,GCF_938192085.1,562,562,Escherichia coli,,Fecal samples,2022-05-02,LBC9d_2,umr1137,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/9...
261396,GCF_939535315.1,PRJNA224116,SAMEA8919069,CALNEA000000000.1,na,562,562,Escherichia coli,strain=ROAR-440 / Onovel10:O92 / fimH1284 / 11...,Faeces,...,GCF_939535315.1,562,562,Escherichia coli,strain=ROAR-440 / Onovel10:O92 / fimH1284 / 11...,Faeces,2022-05-03,ROAR-440,UMR 1137 IAME,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/9...
