Skip to content
Tahiraj edited this page May 28, 2024 · 23 revisions

Welcome to the Wiki page of Phoenix-Novel-Species. Please find the several code snippets used for characterising the 53 novel strains isolated from the NASA cleanroom Payload Hazardous Servicing Facility (PHSF) located in Kennedy Space Center (KSC), Florida, during the Phoenix mission spacecraft assembly.

1. Genome assembly

We conducted quality checks of the raw reads (single-end) using FastQC v.0.12.0 (https://github.com/s-andrews/FastQC)

fastqc --threads 16 $r1 -o qscores

Further, Canu v.2.2 (https://github.com/marbl/canu), Flye v.2.9.1 (https://github.com/fenderglass/Flye), and Unicycler v.0.5.0 (https://github.com/rrwick/Unicycler) were utilised on the filtered reads for de novo assembly of the genomes

# Canu
canu -d ${assemblyDir} -p ${prefix} genomeSize=5m -nanopore ${lRead} useGrid=true minReadLength=5000 gridOptions="-t 72:00:00"

# Flye
flye --nano-corr ${lRead} -g 5m --asm-coverage 50 -o ${assemblyDir} --scaffold -t 8 -i 2

# Unicycler
unicycler -t 8  -l ${lR} -o ${assemblyDir}

Genome dereplication was performed using dRep v. 3.4.5 (https://github.com/MrOlm/drep)

dRep dereplicate ${drepDir} -g ${genomeDir}/*.fasta -p 16

Completion and contamination of the assembled genomes were identified using CheckM v.1.2.2 (https://github.com/Ecogenomics/CheckM)

checkm lineage_wf ${genomeDir}/ checkm/ -x .fasta -t 32

Further, the presence of plasmids was identified using MOB-suite v.3.1.8 (https://github.com/phac-nml/mob-suite)

mob_typer --infile ${genomeDir}/${sample}.fasta --out_file Plasmid/${sample}_mobtyper.txt

The final genomes were considered for the calculation of related indices with other type strains.

2. Overall Genome Relatedness Indices (OGRI) calculation

We collected all validly published representative genomes of the identified genera from NCBI using the command line tool bit (https://github.com/AstrobioMike/bit) and NCBI Entrez-direct

# installation
conda install -c bioconda entrez-direct 
conda install -c conda-forge -c bioconda -c defaults -c astrobiomike bit 

# esearch to get the list of representative species accession IDs
esearch -query '<Agrococcus>[ORGN] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter]' -db assembly | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession > <Agrococcus>_repr.txt

# bit to fetch the genomes
bit-dl-ncbi-assemblies -w <Agrococcus>_repr.txt -f fasta -j 100

# unzipping
gunzip *.gz

We calculated the Average Nucleotide Identity of the novel species with the other type strains using FastANI v.1.34 (https://github.com/ParBLiSS/FastANI)

# get all the paths of the genomes
find ${genome_dir} -iname *.fasta > Novel_Genomes_Paths.txt

# perform all vs all ANI with the genomes
fastANI --rl Novel_Genomes_Paths.txt --ql Novel_Genomes_Paths.txt --threads 50 --matrix -o Novel_Genomes_ANI.tsv

Further, Average Amino acid Identity (AAI) values were computed using aai.rb function from the Enveomics collection toolbox (https://github.com/lmrodriguezr/enveomics)

aai.rb --seq1 <genome> --seq2 <ref_genome> --threads 50 --res AAI_output.txt

We annotated 53 novel strains using the widely-used command-line annotation tool Prokka (https://github.com/tseemann/prokka)

# prokka command
# From the directory of the .fasta files
for file in *.fasta; do prokka --outdir ../Prokka_Annotation/"$file" --prefix "$file" --cpus 40 "$file"; done

From the annotated genomes, 16S rRNA and DNA gyrase subunit-B (gyrB) genes were extracted for the novel species and the closest type strain, and sequence identity was estimated using Blast v.2.13.0

# 16S rRNA sequence extraction
pattern="16S ribosomal RNA"

IFS=$'\n'
for i in $(find $INPUT_FOLDER -iname "*.ffn")
do
	filename=$(basename $i | sed -e 's/ /_/g' | cut -d . -f 1-2)
	echo "Getting 16S sequence for: $i"
	# Use awk to find and print the longest sequence between headers containing the pattern
	awk -v pattern="$pattern" '
	 BEGIN { RS=">"; FS="\n" } 
	 tolower($1) ~ tolower(pattern) {
	   sequence = $0
	   gsub(/\n/, "", sequence)  # Remove newline characters
	   length_seq = length(sequence)
	   
	   # Update max_length and max_sequence if the current sequence is longer
	   if (length_seq > max_length) {
	     max_length = length_seq
	     max_sequence = ">" $0
	   }
	 }
	 END { printf "%s", max_sequence }' "$i" > "$OUTPUT_FOLDER/${filename}_16S.fasta"
done
# gyrB sequence extraction
pattern_gyrb="DNA gyrase subunit B"

echo "Getting gyrB sequences"

IFS=$'\n'
for i in $(find $INPUT_FOLDER -iname "*.ffn")
do
	filename=$(basename $i | sed -e 's/ /_/g' | cut -d . -f 1-2)
	echo "Getting 16S sequence for: $i"
	# Use awk to find and print the longest sequence between headers containing the pattern
	awk -v pattern="$pattern_gyrb" '
	 BEGIN { RS=">"; FS="\n" } 
	 tolower($1) ~ tolower(pattern) {
	   sequence = $0
	   gsub(/\n/, "", sequence)  # Remove newline characters
	   length_seq = length(sequence)
	   
	   # Update max_length and max_sequence if the current sequence is longer
	   if (length_seq > max_length) {
	     max_length = length_seq
	     max_sequence = ">" $0
	   }
	 }
	 END { printf "%s", max_sequence }' "$i" > "$OUTPUT_FOLDER/${filename}_gyrB.fasta"
done
# BLAST
blastn -db 16S_ribosomal_RNA -query $i -task blastn -dust no -max_target_seqs 1 -outfmt "10 delim=, qacc sacc evalue bitscore qcovus pident staxid ssciname" -num_threads 40

3. WGS-based phylogeny

We utilised GToTree v.1.8.2 (https://github.com/AstrobioMike/GToTree) for preparing the phylogenetic trees using single-copy core genes, followed by IQTREE-2 v.2.2.0.3 (https://github.com/iqtree/iqtree2) to perform 1000 ultrafast bootstrap replicates

# GToTree
GToTree -a ${genus}_non_novel_accession.txt -f ${genus}_paths.txt -H $scg_set -t -L Species,Strain -T IQ-TREE -j 50 -m GToTree_Mapping_Files/${genus}_GToTree_Mapping.tsv -o $genus

# IQTREE-2
iqtree2 -s $(genus}/Aligned_SCGs_mod_names.faa -spp ${genus}/run_files/Partitions.txt -m txt -m MFP -bb 1000 -nt 40 -pre ${genus}_IQ_Tree

4. Genome characterisation

We identified the Cluster of Orthologous Genes (COGs) using the Prokka annotated genomes. We used python-package cogclassifier https://pypi.org/project/cogclassifier/)

# cogclassfier command
# Once Prokka is over, utilise .faa files for COG identification
for file in ./*/*.faa; do cp "$file" ./../COGs/; done

# From the directory of the .faa files
for files in *.faa; do COGclassifier -i "$files" -o ./COG_Output/"$files"; done

We then predicted, annotated and analysed the secondary metabolite biosynthesis gene clusters in these novel strains using command-line antiSMASH 7.0 (https://antismash.secondarymetabolites.org/#!/about)

# antismash command
# From the directory of the .fasta files
for file in *.fasta; do antismash "$file" --taxon bacteria --cpus 20 --genefinding-tool prodigal --cb-knownclusters --output-dir ./../antiSMASH/"$file" --cc-mibig --fullhmmer; done

Further, we employed the command-line tool Resistance Gene Identifier (RGI) (https://github.com/arpcard/rgi) to predict antibiotic resistome(s) in these novel genomes. We searched these antibiotic resistance genes (ARGs) against the Comprehensive Antibiotic Resistance Database (CARD)(https://card.mcmaster.ca/)

# rgi command
# From the directory of the .fasta files
for file in *.fasta; do rgi main -i "$file" -o ./RGI_CARD/"$file" -t contig -a BLAST -n 50; done

5. Mapping metagenomics samples to novel isolates

We used fastp v.0.22.0 (https://github.com/OpenGene/fastp) to filter the raw shotgun sequences obtained from NASA cleanrooms

fastp.0.23.4 --in1 ${metagenomeFolder}/$read1 --out1 ${read1}_output.fastq --in2 ${metagenomeFolder}/{read2} --out2 ${read2}_output.fastq -w 16 --json ${metagenomeFolder}_fastp.json --html ${metagenomeFolder}_fastp.html

Then we utilised MetaCompass v.2.0 (https://github.com/marbl/MetaCompass) to align the filtered reads to newly identified genomes and determine their abundance in the NASA cleanrooms based on mapped reads

python3 ../../MetaCompass/go_metacompass.py -r $genome -1 "$read1" -2 "$read2" -l 150 -t 70 -y 50 -o Metagenomic_Mapping/${name}_${i}