Skip to content

NCBI-Codeathons/Host_Phage_Interactions

Repository files navigation

Host Phage Interactions

Linking Phages to Microbial Hosts

Hosts (black circles) Linked to Viruses (red circles)

alt_text

Viruses As Indeces For NCBI SRA Metagenomes

Viruses that infect microbes (both Bacteria and Archaea) have a direct impact on the organisms present in all microbiomes, regardless of environment. Like the microbiome, viruses present in a sample can be diagnostic of the conditions (e.g., microbial community structure, underlying physiochemical conditions, etc.), but unlike the microbial hosts, viruses have smaller genomes that may more easily assemble in a metagenomic sample. Indexing metagenomic samples on the viral community will provide a way categorize metagenomic samples in NCBI SRA that will allow for the identification of related sample types. The term phage specifically refers to viruses that infect bacteria.

Detecting Host and Phage Signatures

alt text

The Host Phage Interaction (HPI) group developed three tracks and databases to explore the nature of host-phage interactions, specifically:

  • Known Host-Virus Interactions Database, detailing the current known linkages of infection, collected from NCBI
  • CRISPR Spacer Database, constructed by extracting spacer information from CRISPR arrays representing a large breadth of microorganisms
  • E. coli Prophage Virulence Factor Database, a collection of clinical and environmental E. coli screened for prophage sequences and corresponding bacterial virulence factors

Known Host-Virus Interactions Database

Existing databases providing information on hosts (including Bacteria, Archaea, and Eukarya) and the identity of confirmed viral agents were gathered from PhagesDB and NCBI VirusHostDB/NCBI Virus-Host Database (CITE) were combined and standardized.

The comprehensive database table consists of 44,975 virus-host pairs. The number of unique viruses and hosts in the database are 29,847 and 7,974, respectively.

The database can be accessed in KnownInteractionsDB.csv

alt_text

CRISPR Spacer Database

CRISPR-Cas adaptive immune systems are a unique form of adaptive immunity found in prokaryotes wherein viral DNA or RNA sequences are stored on the host genome in the form of short (~30bp) "spacers". These spacers, stored at a repetitive CRISPR "array" can be used to reliably associate viruses with their hosts. This approach, though precise, will inevitably miss host-virus pairs where the host lacks a CRISPR array ( a majority of bacteria, especially certain groups), or where the diversity of host CRISPR spacers has been inadequately profiled. Nevetheless, this method can give highly confident host-virus pairs when a perfect match is found.

CRISPR spacers were compiled from four distinct sources:

All genomes/MAGs were provided a standardized taxonomy based on the GTDB taxonomy.

In total this resulted in 1M unique spacer sequences (2.6M non-redundant spacer sequences) linked to a formalized source taxonomy.

alt_text alt_text

E. coli Prophage Virulence Factor Database

Members of the "E. coli and Shigella" tax group were selected from the NCBI Pathogen Database via the Isolates Browser. A subset of 20,000 genomes were identified based on the filters "Host:Human" or "Host:Homo sapiens", "Isolation type:clinical" or "Isolation type:environmental/other", and "Scientific Name:Escherichia coli". Separated into two groups for 10,000 genomes, "environmental" and "clinical", and were sorted by GenBank Assembly ID. From which, 3,500 genomes were downloaded for each group. Genomes were downloaded using the R script ecoli_download_links.R and requires pathogens_combined_table.csv and assembly_summary_genbank.txt, which can be accessed directly from NCBI wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt (~100MB)

VirSorter v1.0.5 was applied to all E. coli genomes to detect prophage integrated into the genome with the parameters --db 2. Detected prophages were annotated using Prokka v1.14.0 with the an additional database created from the PARTIC database v3.5.43 to target bacterial virulence factors, including ARDB, CARD, NDARO, PATRIC_VF, TCDB, VFDB, and Victors.

alt_text

Procedures and Results

Target Discovery in NCBI Assembled Metagenomes

The CRISPR Spacer Database was used to search 2,953 assembled metagenomes subsampled from the NCBI SRA using BLASTn. Matches to spacers were extracted as assemblies of interest.

A combined BLAST database was constructed from the assembled metagenomes makeblastdb -in mg_assemblies.fa -dbtype nucl -title mg_assemblies -out mg_assemblies

The BLASTn search was parallelized using GNU parallel parallel --block 100k --recstart '>' --pipe with the BLAST specific parameters -task blastn-short -evalue 0.01 -outfmt 6 -gapopen 10 -gapextend 2 -penalty "-1" -word_size 7 -dust no -db mg_assemblies adjusted to account for the short length of the spacer sequences.

cat spacer_db.fasta \
| parallel --block 100k --recstart '>' --pipe \
blastn \
  -task blastn-short \
  -evalue 0.01 \
  -outfmt 6 \
  -gapopen 10 \
  -gapextend 2 \
  -penalty "-1" \
  -word_size 7 \
  -dust no \
  -db mg_assemblies/blastdb/mg_assemblies -query - \

alt_text

CRISPR Spacer Database Matched to Known Viruses

The CRISPR Spacer Database was used to search against two known virus databases that consisted of NCBI Viral RefSeq representatives and NCBI Virus Variation Resource from GenBank. The BLASTn search used the same parameters as described above.

The CRISPR spacers matched X viruses in these representative databases. Cross-referenced to the known host-virus interaction database, X% matched virus-host and spacer-source.

Clinical and Environmental E. coli Virulence Factor Occurrence

Expanding the CRISPR Spacer Database

Search through novel genomes to detect additional CRISPR spacer host links minced -spacers -gffFull NAME.fasta NAME.crisprs NAME.gff

More to come.

How to Relaunch From Progress Made in Virus Codeathon 2

Zendo DOI