Skip to content

SCG sets

Mike Lee edited this page May 2, 2024 · 26 revisions

There are 15 single copy gene (SCG)-set HMMs included with GToTree spanning multiple taxonomic ranks. 14 of these are newly created as described below, and one set is included that covers all 3 domains (from Hug et al. 2016).

We can view which HMM files are available by running gtt-hmms by itself.

Here are the currently available HMM-sets:

HMM set Number of gene targets
Actinobacteria 138
Alphaproteobacteria 117
Archaea 76
Bacteria 74
Bacteria_and_Archaea 25
Bacteroidetes 90
Betaproteobacteria 203
Chlamydiae 286
Cyanobacteria 251
Epsilonproteobacteria 260
Firmicutes 119
Gammaproteobacteria 172
Proteobacteria 119
Tenericutes 99
Universal (Hug et al.) 16

More information on each is provided in the "hmm-sources-and-info.tsv" file found here, and in the GToTree installed location (can see by running gtt-hmms).

Here is the general gist of how the new sets were generated, with exact code to generate the bacterial set presented below.

  1. All 17,929 Pfams from release 32.0 were downloaded. Those that didn’t cover more than 50% (on average) of the underlying protein sequences that went into building that Pfam's HMM were filtered out (this ensures there would be no two Pfams from the same source protein included). This left 8,924 Pfams.

  2. All available “complete” bacterial (11,405 as accessed on 09-DEC-2018) and archaeal (309 as accessed on 15-DEC-2018) assembly genes were downloaded from NCBI.

  3. The 8,924 filtered-Pfams were searched against all coding sequences with hmmsearch (default settings except for specifying the --cut_ga flag to utilize the gathering thresholds stored in the curated Pfam models, and the number of hits to each Pfam were recorded for each genome.

  4. SCG-sets were generated for 1) bacteria, 2) archaea, 3) all phyla with at least 100 genomes, and 4) all proteobacterial classes with at least 100 genomes. For each major taxon, only Pfams with exactly 1 hit in at least 90% of all genomes within that taxon were selected to make up the SCG-set for that taxon.

Code

Here are presented all of the steps taken to filter the Pfams of release 32.0 and generate the bacterial gene-set.

Downloading and filtering Pfams

###### This log holds the command-line steps take to generate the Pfam list that was used to search all genomes. ######
## downloading Pfam release 32.0
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz

## Filtering Pfam-HMMs by coverage of underlying proteins that built the model
# some unique Pfam-HMMs hold distinct domains from the same proteins
# this is not ideal when the purpose is to identify single-copy genes
# for completeness/redundancy estimates or for phylogenomics
# To help alleviate that (or most likely eliminate it completely), i only
# considered Pfam-HMMs that covered on average more than 50% of the
# underlying proteins that built the model.
# Pfam provides this information in the table downloaded here:

wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/database_files/pfamA.txt.gz
gunzip pfamA.txt.gz

# cutting down to just pfam ID and coverage
cut -f 1,36 pfamA.txt > All_pfam_avg_covs.tsv

# filtering to keep only those with greater than 50% coverage
awk ' $2 > 50 { print $1 } ' All_pfam_avg_covs.tsv > coverage-filtered-pfam-IDs-no-version-info

## making subset hmm file
# hmmfetch won't work without the version numbers attached, which are not in the Pfam IDs from the "pfamA.txt" summary table

# so creating a list of all with the current version numbers attached:
grep "^ACC" Pfam-A.hmm | tr -s " " "\t" | cut -f2 > all-pfam-IDs

# pulling out targets:
for i in $(cat coverage-filtered-pfam-IDs-no-version-info); do grep -m1 "$i" all-pfam-IDs; done > coverage-filtered-pfam-IDs

# now generating subset:
hmmfetch -f Pfam-A.hmm coverage-filtered-pfam-IDs > Pfam-A_coverage_filtered.hmm

###### That subset, "Pfam-A_coverage_filtered.hmm", is what was used to search the bacterial and archaeal genomes. ######

Scanning all complete bacterial genomes from NCBI

The two helper scripts used below can be downloaded with the following:

curl -O https://raw.githubusercontent.com/AstrobioMike/AstrobioMike.github.io/master/misc/random_scripts/unzip_rename_cat_genes.sh

curl -O https://raw.githubusercontent.com/AstrobioMike/AstrobioMike.github.io/master/misc/random_scripts/pfam_counting.py

Note that one of those also needs a script that's available from my toolset here (bit-rename-fasta-headers). Sorry :( this is just here for reproducibility's sake, not for mass use (I don't have something up for that yet). The bit toolset can be installed with conda, or you can grab just the script from here and it should work fine if you have biopython installed.

## downloading bacterial genome info with E-Direct (this returned 11,405 with this search on 09-DEC-2018) ##
esearch -query '"bacteria"[filter] AND "Complete genome"[filter] AND latest[filter] NOT anomalous[filter] AND "has annotation"[Properties]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt > Complete_genome_bacteria_ncbi_info.tsv

 # adding header
cat <(echo "AssemblyAccession Taxid SpeciesTaxid Organism AssemblyStatus RefSeq_category FtpPath_GenBank FtpPath_RefSeq FtpPath_Assembly_rpt" | tr " " "\t") Complete_genome_bacteria_ncbi_info.tsv > t && mv t Complete_genome_bacteria_ncbi_info.tsv

 # getting all genome accs
tail -n +2 Complete_genome_bacteria_ncbi_info.tsv | cut -f1 > all_genome_accs

### building up links to download genes in parallel, splitting which have genbank accs and which have refseq accs because that determines which ftp path to use ###
grep "^GCA_" Complete_genome_bacteria_ncbi_info.tsv > genbanks
grep "^GCF_" Complete_genome_bacteria_ncbi_info.tsv > refseqs

cut -f1,7 genbanks > genbank_accs_and_base_links # making 2-column table of accessions and base links
cut -f 2 genbank_accs_and_base_links > genbank_targets_p1 # getting base link alone
cut -f10 -d "/" genbank_targets_p1 > genbank_targets_p2 # to build the filename to the genes explicitly, need this component
paste -d "/" genbank_targets_p1 genbank_targets_p2 | sed 's/$/_protein.faa.gz/' > genbank_targets # finishing the full path to the genes download
cut -f1 genbank_accs_and_base_links | sed 's/$/_protein.faa.gz/' > genbank_names # creating file names to store the downloads


# doing the same as above but for the refseq accessions
cut -f1,8 refseqs > refseq_accs_and_base_links
cut -f 2 refseq_accs_and_base_links > refseq_targets_p1
cut -f10 -d "/" refseq_targets_p1 > refseq_targets_p2
paste -d "/" refseq_targets_p1 refseq_targets_p2 | sed 's/$/_protein.faa.gz/' > refseq_targets
cut -f1 refseq_accs_and_base_links | sed 's/$/_protein.faa.gz/' > refseq_names


# sticking both sets together
cat genbank_names refseq_names > all_names
cat genbank_targets refseq_targets > all_targets


### downloading in parallel ###
parallel --xapply -j 30 curl -L {1} -o {2} :::: all_targets :::: all_names

###  unzipping, renaming headers of each to have the assembly accession for easy parsing after hmm search, and catting all genes together into one file ###
bash unzip_rename_cat_genes.sh all_genome_accs

# total genes
grep -c ">" all_genes.faa
42829397

### splitting to be able to search in parallel ###
mkdir split_seqs
cd split_seqs
split -d -l 856588 ../all_genes.faa sub_
cd ../

### running hmmsearch with >= 50% coverage-filtered Pfams
time find split_seqs/ -name "sub*" | parallel -I% --max-args 1 -j 50 hmmsearch --cut_ga --cpu 2 --tblout %_pfam_hits.tab Pfam-A_coverage_filtered.hmm % > /dev/null


### combining all hmm-results files ###
cat split_seqs/*.tab > All_bacterial_pfam_hmm_results.tab

### counting how many hits to each pfam for each genome ###

# getting all coverage-filtered pfam IDs
grep "^ACC" Pfam-A_coverage_filtered.hmm | tr -s " " "\t" | cut -f2 > coverage_filtered_pfam_ids

python pfam_counting.py -p coverage_filtered_pfam_ids -g all_genome_accs -H All_bacterial_pfam_hmm_results.tab -o All_bacterial_pfam_hmm_counts.tsv
  ## this output table holds all genomes as rows, all include pfams as columns, and counts in the cells

###### "All_bacterial_pfam_hmm_counts.tsv" read into R for parsing  ######
### in R ###
tab <- read.table("All_bacterial_pfam_hmm_counts.tsv", header=T, row.names=1, sep="\t")

head(tab[,1:10])
dim(tab)

all_genomes <- row.names(tab)

# getting how many genomes have exactly 1 hit for each pfam
num_genomes_with_1_hit_for_each_pfam <- apply(tab, 2, function(x) sum(x == 1))

# getting percentages of genomes with exactly 1 hit for each pfam
perc_genomes_with_1_hit_for_each_pfam <- round(num_genomes_with_1_hit_for_each_pfam / length(all_genomes) * 100, 2)

# getting pfam ids for only those where 90% or greater of genomes have exactly 1 hit
target_pfam_ids <- names(perc_genomes_with_1_hit_for_each_pfam[perc_genomes_with_1_hit_for_each_pfam >= 90])

# writing out
write(target_pfam_ids, "Bacterial_SCG_pfam_IDs.txt")
### out of R ###

### then the hmm file was filtered down to these target IDs
hmmfetch -f Pfam-A_coverage_filtered.hmm Bacterial_SCG_pfam_IDs.txt > Bacteria.hmm