First steps are procedural: we search our query proteins against each of our protein databases.

See Preparing_query_proteins.sh and Preparing_databases.sh for more information on these files.

The we parse the search results to identify genus-specific ORFan candidates to be carried forward in our next step of analysis.

We're selecting genus-specific ORFans, but the subsequent steps can be applied to other sets of interest as well—-e.g., species-specific ORFans, phylogroup-specific ORFans, etc.

We then subject these ORFan candidates to a series of other blast searches to undetected coding or non-coding homologs.

## **Protein blasts and ORFan candidates**

The following section has two chunks:

1. Conduct six blast searches and parse the results to identify genus/species-specific ORFans,
2. Make a presence table that indicates which taxa (genus, species, lineage) each ORFan is present in

### **1. Protein blasts and ORFan candidate selection**

In [None]:
#Six blasts in total:
#Escherichia-excluded, annotated proteins
#Escherichia-excluded, translated ORFs
#Ecoli-excluded, annotated proteins
#Ecoli-excluded, translated ORFs
#Pangenome, annotated proteins
#Pangenome, translated ORFs

#Re: options:
# -k 0 returns all target hits
# -b8 -c1 is based on the suggestion printed on the terminal once the code was run without those options

#Escherichia-excluded, annotated proteins
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond blastp -q Ecoli.all.filtered.prot.clusters.longestrepresentative.faa -d /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Escherichia_excluded.proteins.dmnd --outfmt 6 qseqid sseqid pident nident qcovhsp length mismatch gapopen gaps qstart qend sstart send qlen slen evalue bitscore --ultra-sensitive --out all_proteins_vs_GBRS_annotated.111025.tsv -k 0 -b8 -c1
#Escherichia-excluded, translated ORFs
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond blastp -q Ecoli.all.filtered.prot.clusters.longestrepresentative.faa -d /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Escherichia_excluded.genomes.orfipy.prot.dmnd --outfmt 6 qseqid sseqid pident nident qcovhsp length mismatch gapopen gaps qstart qend sstart send qlen slen evalue bitscore --ultra-sensitive --out all_proteins_vs_GBRS_ORFs.tsv -k 0 -b8 -c1
#Ecoli-excluded, annotated proteins
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond blastp -q Ecoli.all.filtered.prot.clusters.longestrepresentative.faa -d /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Ecoli_excluded.proteins.dmnd --outfmt 6 qseqid sseqid pident nident qcovhsp length mismatch gapopen gaps qstart qend sstart send qlen slen evalue bitscore --ultra-sensitive --out all_proteins_vs_noncoliEscherichia_annotated.tsv -k 0 -b8 -c1
#Ecoli-excluded, translated ORFs
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond blastp -q Ecoli.all.filtered.prot.clusters.longestrepresentative.faa -d /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Ecoli_excluded.genomes.orfipy.prot.dmnd --outfmt 6 qseqid sseqid pident nident qcovhsp length mismatch gapopen gaps qstart qend sstart send qlen slen evalue bitscore --ultra-sensitive --out all_proteins_vs_noncoliEscherichia_ORFs.tsv -k 0 -b8 -c1
#Pangenome, annotated proteins
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond blastp -q Ecoli.all.filtered.prot.clusters.longestrepresentative.faa -d Ecoli.all.prot.pangenomedb.dmnd --outfmt 6 qseqid sseqid pident nident qcovhsp length mismatch gapopen gaps qstart qend sstart send qlen slen evalue bitscore --ultra-sensitive --out all_proteins_vs_pangenome_annotated.tsv -k 0 -b8 -c1
#Pangenome, translated ORFs
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond blastp -q Ecoli.all.filtered.prot.clusters.longestrepresentative.faa -d ../Ecoli.all.orfipy.dmnd --outfmt 6 qseqid sseqid pident nident qcovhsp length mismatch gapopen gaps qstart qend sstart send qlen slen evalue bitscore --ultra-sensitive --out all_proteins_vs_pangenome_ORFs.tsv -k 0 -b8 -c1

#The non-Escherichia databases have some Escherichia contaminants from earlier, which are stored in this file: all_Escherichia_slippedby_IDs.txt
#See Preparing_databases.sh about how we made it
#Also, I'm searching with mmseqs2 results, but we clustered further with silix
#Meaning, in the output file containing ORFans, some of them need to be replaced with other representative sequences

#ORFans:
cat all_proteins_vs_GBRS_annotated.tsv all_proteins_vs_GBRS_ORFs.tsv |
grep -v -F -f all_Escherichia_slippedby_IDs.txt - | #remove contaminants
awk -F '\t' '($5>60&&$16<0.001)' | cut -f1 | sort -u > step1_genusspecific_nonORFan.txt

grep -vf step1_genusspecific_nonORFan.txt Ecoli.all.filtered.prot.clusters.longestrepresentative.faa | grep "^>" | tr -d ">" > step1_genusspecific_ORFan.txt
#Replace with the mmseqs2-silix representative sequences
awk 'FNR==NR {rep[$2]=$1; next} {for(i=1;i<=NF;i++) if ($i in rep) $i=rep[$i]; print}' mmseqs_silix_reclustering/replacements.tsv step1_genusspecific_ORFan.txt | sort -u > step1_genusspecific_ORFan.replaced.txt
#Pull CDS and protein sequences
faSomeRecords Ecoli.all.filtered.prot.clusters.longestrepresentative.faa step1_genusspecific_ORFan.replaced.txt step1_genusspecific_ORFans.faa
faSomeRecords Ecoli.all.filtered.prot.clusters.longestrepresentative.CDS.faa step1_genusspecific_ORFan.replaced.txt step1_genusspecific_ORFans.CDS.faa

### **2. Making the protein blast-based presence table**

In [None]:
#This code chunk generates a two-column table
#First column is gene (ORFan) name, second column is all the outgroups it's present in
#We don't expect ORFans to have outgroups in non-Escherichia genomes, but the ORFan file can be replaced with any set of genes of interest

#Note the names are messed up, but have been preserved to make the rest of the code consistent

#First, list all protein or contig IDs the genes are present in:

#Annotated, protein IDs:
cat all_proteins_vs_GBRS_annotated.tsv | awk -F '\t' '($5>60&&$16<0.001)' | cut -f1,2 | sort -u > presenceabsence.interim.0
cat all_proteins_vs_noncoliEscherichia_annotated.tsv | awk -F '\t' '($5>60&&$16<0.001)' | cut -f1,2 | sort -u > presenceabsence.interim.1
cat all_proteins_vs_pangenome_annotated.tsv | awk -F '\t' '($5>60&&$16<0.001)' | cut -f1,2 | sort -u > presenceabsence.interim.2
#ORFs, contig IDs (each ORF ID also mentions the contig its harbored in):
cat all_proteins_vs_GBRS_ORFs.tsv | awk -F '\t' '($5>60&&$16<0.001)' | cut -f1,2 | rev | cut -f2- -d "_" | rev | sort -u > presenceabsence.interim.0.ORF
cat all_proteins_vs_noncoliEscherichia_ORFs.tsv | awk -F '\t' '($5>60&&$16<0.001)' | cut -f1,2 | rev | cut -f2- -d "_" | rev | sort -u > presenceabsence.interim.3
cat all_proteins_vs_pangenome_ORFs.tsv | awk -F '\t' '($5>60&&$16<0.001)' | cut -f1,2 | rev | cut -f2- -d "_" | rev | sort -u > presenceabsence.interim.4

#Only extract those rows with my genes of interest as query
#proteins
grep -w -F -f step1_genusspecific_ORFan.replaced.txt presenceabsence.interim.0 > all_genes_of_interest.extragenus.protein.presenceabsence.tsv
grep -w -F -f step1_genusspecific_ORFan.replaced.txt presenceabsence.interim.1 | rev | cut -f2- -d "(" | rev > all_genes_of_interest.intragenus.protein.presenceabsence.tsv
sed "s/$/\t/g" step1_genusspecific_ORFan.replaced.txt | grep -F -f - presenceabsence.interim.2 > all_genes_of_interest.pangenome.protein.presenceabsence.tsv
#contigs
grep -w -F -f step1_genusspecific_ORFan.replaced.txt presenceabsence.interim.0.ORF > all_genes_of_interest.extragenus.contig.presenceabsence.tsv
grep -w -F -f step1_genusspecific_ORFan.replaced.txt presenceabsence.interim.3 > all_genes_of_interest.intragenus.contig.presenceabsence.tsv
sed "s/$/\t/g" step1_genusspecific_ORFan.replaced.txt | grep -F -f - presenceabsence.interim.4 > all_genes_of_interest.pangenome.contig.presenceabsence.tsv

#Now join these with pre-generated taxonomy files to identify which taxa each gene is present in
#See Preparing_databases.sh to see how the taxonomy files were generated
sort -k2 all_genes_of_interest.extragenus.protein.presenceabsence.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_protein_contig_taxa.reduced.noescherichia.tsv | awk '{print $2,$NF}' | sort -u > all_genes_of_interest.presence.tsv
sort -k2 all_genes_of_interest.intragenus.protein.presenceabsence.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_protein_taxa.tsv | awk '{print $2,$NF}' | sort -u >> all_genes_of_interest.presence.tsv
sort -k2 all_genes_of_interest.pangenome.protein.presenceabsence.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_protein_taxa.tsv | awk '{print $2,$NF}' | sort -u >> all_genes_of_interest.presence.tsv
sort -k2 all_genes_of_interest.extragenus.contig.presenceabsence.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_genome_contig_taxa.reduced.noescherichia.tsv | awk '{print $2,$NF}' | sort -u >> all_genes_of_interest.presence.tsv
sort -k2 all_genes_of_interest.intragenus.contig.presenceabsence.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_contig_taxa.tsv | awk '{print $2,$NF}' | sort -u >> all_genes_of_interest.presence.tsv
sort -k2 all_genes_of_interest.pangenome.contig.presenceabsence.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_contig_taxa.tsv | awk '{print $2,$NF}' | sort -u >> all_genes_of_interest.presence.tsv

For a subsequent step in the pipeline, I also need to generate tables for matches with a query cover below 60%.

In [None]:

#Partial/noncoding:

#Annotated, protein IDs:
cat all_proteins_vs_GBRS_annotated.tsv | awk -F '\t' '($5<60&&$16<0.001)' | cut -f1,2 | sort -u > presenceabsence.interim.0.partial
cat all_proteins_vs_noncoliEscherichia_annotated.tsv | awk -F '\t' '($5<60&&$16<0.001)' | cut -f1,2 | sort -u > presenceabsence.interim.1.partial
cat all_proteins_vs_pangenome_annotated.tsv | awk -F '\t' '($5<60&&$16<0.001)' | cut -f1,2 | sort -u > presenceabsence.interim.2.partial
#ORFs, contig IDs (each ORF ID also mentions the contig its harbored in):
cat all_proteins_vs_GBRS_ORFs.tsv | awk -F '\t' '($5<60&&$16<0.001)' | cut -f1,2 | rev | cut -f2- -d "_" | rev | sort -u > presenceabsence.interim.0.ORF.partial
cat all_proteins_vs_noncoliEscherichia_ORFs.tsv | awk -F '\t' '($5<60&&$16<0.001)' | cut -f1,2 | rev | cut -f2- -d "_" | rev | sort -u > presenceabsence.interim.3.partial
cat all_proteins_vs_pangenome_ORFs.tsv | awk -F '\t' '($5<60&&$16<0.001)' | cut -f1,2 | rev | cut -f2- -d "_" | rev | sort -u > presenceabsence.interim.4.partial

#Only extract those rows with my genes of interest as query
#proteins
grep -w -F -f step2_genusspecific_ORFan.replaced.txt presenceabsence.interim.0.partial > all_genes_of_interest.extragenus.protein.presenceabsence.partial.tsv
grep -w -F -f step2_genusspecific_ORFan.replaced.txt presenceabsence.interim.1.partial | rev | cut -f2- -d "(" | rev > all_genes_of_interest.intragenus.protein.presenceabsence.partial.tsv
sed "s/$/(/g" step2_genusspecific_ORFan.replaced.txt | grep -F -f - presenceabsence.interim.2.partial > all_genes_of_interest.pangenome.protein.presenceabsence.partial.tsv
#contigs
grep -w -F -f step2_genusspecific_ORFan.replaced.txt presenceabsence.interim.0.ORF.partial > all_genes_of_interest.extragenus.contig.presenceabsence.partial.tsv
grep -w -F -f step2_genusspecific_ORFan.replaced.txt presenceabsence.interim.3.partial > all_genes_of_interest.intragenus.contig.presenceabsence.partial.tsv
sed "s/$/(/g" step2_genusspecific_ORFan.replaced.txt | grep -F -f - presenceabsence.interim.4.partial > all_genes_of_interest.pangenome.contig.presenceabsence.partial.tsv

#Now join these with pre-generated taxonomy files to identify which taxa each gene is present in
#See Preparing_databases.sh to see how the taxonomy files were generated
sort -k2 all_genes_of_interest.extragenus.protein.presenceabsence.partial.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_protein_contig_taxa.reduced.noescherichia.tsv | awk '{print $2,$NF}' | sort -u > all_genes_of_interest.presencepartial.tsv
sort -k2 all_genes_of_interest.intragenus.protein.presenceabsence.partial.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_protein_taxa.tsv | awk '{print $2,$NF}' | sort -u >> all_genes_of_interest.presencepartial.tsv
sort -k2 all_genes_of_interest.extragenus.contig.presenceabsence.partial.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_genome_contig_taxa.reduced.noescherichia.tsv | awk '{print $2,$NF}' | sort -u >> all_genes_of_interest.presencepartial.tsv
sort -k2 all_genes_of_interest.intragenus.contig.presenceabsence.partial.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_contig_taxa.tsv | awk '{print $2,$NF}' | sort -u >> all_genes_of_interest.presencepartial.tsv
sort -k2 all_genes_of_interest.pangenome.protein.presenceabsence.partial.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_protein_taxa.tsv | awk '{print $2,$NF,$3}' | sort -u >> all_genes_of_interest.presencepartial.tsv
sort -k2 all_genes_of_interest.pangenome.contig.presenceabsence.partial.tsv | join -1 2 -2 2 - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_contig_taxa.tsv | awk '{print $2,$NF,$3}' | sort -u >> all_genes_of_interest.presencepartial.tsv

## **Protein-against-genome blastn**

Now begins the next steps of curating ORFans with more sensitive searches, with a view to both eliminating false positives as well as detecting non-coding homologs.

The first step is conducting a blastn with ORFan sequences as query and genomes as targets.

### **1. Run blasts and massage files for running mview**

In [None]:
#Regular blastn, without regard to flanks:

#Re: options:
#The num_alignments and num_descriptions reflect the number of target sequences (contigs) in each database
#We set -outfmt to 0, as we'll be processing the results with mview later
#evalue is kept at an arbitrarily high number——this search is not restricted by evalue cutoff
#Escherichia-excluded:
blastn -query step1_genusspecific_ORFans.CDS.faa -db /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Escherichia_excluded.genomes -outfmt 0 -num_threads 72 -num_descriptions 109734 -num_alignments 109734 -evalue 200000 -out Ecoli_extragenus_regular_blastn
#Ecoli-excluded:
blastn -query step1_genusspecific_ORFans.CDS.faa -db /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Ecoli_excluded.genomes -outfmt 0 -num_threads 72 -num_descriptions 195119 -num_alignments 195119 -evalue 200000 -out Ecoli_intragenus_regular_blastn
#Pangenome:
blastn -query step1_genusspecific_ORFans.CDS.faa -db ../Ecoli.all.genomes -outfmt 0 -num_threads 72 -num_descriptions 86906 -num_alignments 86906 -evalue 200000 -out Ecoli_pangenome_regular_blastn

mkdir regular
mv *_regular_blastn regular
cd regular

#We split each of the blast files to generate query-specific files
#Each resulting file is named after the query sequence

for i in extragenus intragenus pangenome
do
awk -v prefix="$i"_ '/^Query=/ {close(file); match($0, / (.*)\(/, arr); file = prefix arr[1] "_blastn"} {if (file) print > file}' Ecoli_"$i"_regular_blastn
done

#Delete the query-specific files that contain "No hits found", i.e. no blast alignment

grep "No hits found" *_blastn | grep -v "regular" | cut -f1 -d ":" | sed "s/^/rm /g" | bash

#For subsequent processing with mview, we then prepend and append blast file fluff to beginning and end of each file

for i in extragenus intragenus pangenome
do
header=$(head -14 Ecoli_"$i"_regular_blastn) #assign header to variable
footer=$(tail -11 Ecoli_"$i"_regular_blastn) #assign footer to variable
for file in $(ls "$i"_*blastn); do
    # Prepend the header to the file
    { echo "$header"; cat "$file"; } > temp_file && mv temp_file "$file"

    # Append the footer to the file
    { cat "$file"; echo "$footer"; } > temp_file && mv temp_file "$file"
done
done

### **2. Only retain those sequences with >50% query cover**

In [None]:
#Now we convert the blast output to a parseable mview file
export PERL5LIB=/stor/scratch/Ochman/hassan/genomics_toolbox/mview-1.67/lib/ #necessary to run this first for mviewed to run

#Since mview takes a while, we put the code in "echo" to parallelize it
for i in $(ls *_blastn | grep -v "regular" | rev | cut -f2- -d "_" | rev)
do
echo "/stor/scratch/Ochman/hassan/genomics_toolbox/mview-1.67/bin/mview -in blast ${i}_blastn > ${i}_blastn_mviewed"
done > running.sh

/stor/work/Ochman/hassan/tools/parallelize_run.sh #helper script used for parallelization

#We now parse the mviewed file according to some criteria
#We get rid of any case where the total combined length of all alignments doesn't cover 50% of query length
#And also targets whose alignments don't cover the query by at least 50%
#Since there's no e-value cutoff, it's necessary to utilize a query coverage cutoff to get rid of spurious hits
#This might miss out on very small regions of homologous non-coding sequences
#But that's fine, some of this information is recovered at the next stage of flank/synteny-based analysis

for i in $(ls *mviewed | grep -v "regular" | rev | cut -f3- -d "_" | rev)
do
querylength=$(echo $i | cut -f2- -d "_" | sed "s/$/(/g" | grep -F -A1 -f - ../step1_genusspecific_ORFans.CDS.faa | grep -v "^>" | awk '{print length($0)}') #length of query
ratio=$(tail -n+8 "$i"_blastn_mviewed | head -1 | awk '{print $(NF-1)}' | sed "s/:/\t/g" | awk -v var=$querylength -F '\t' '{print ($2-$1+1)/var}') #ratio of the total combined length of any and all alignments and the query length
if (( $(echo "$ratio > 0.5" | bc -l) )) #Only consider the total length of the query covered by any and all alignments is over 50% of the query length
then
tail -n+9 "$i"_blastn_mviewed | head -n-3 | awk '{print $2,$(NF-4),$NF}' | sed "s/%//g" | awk '($2>50)' | awk '{print $1,$3}' | sed "s/^/>/g" | sed "s/ /\n/g" > "$i"_blastn_seq.faa #Massaging
fi
done

#We then put the different massaged blast sequences in the same file:
for i in $(ls *_blastn_seq.faa | cut -f2- -d "_" | rev | cut -f3- -d "_" | rev)
do
cat extragenus_"$i"_blastn_seq.faa intragenus_"$i"_blastn_seq.faa pangenome_"$i"_blastn_seq.faa >> "$i"_blastn.seq.faa
done

### **3. Remove all hits against a taxon that was already detected in previous blastp step**

In [None]:
#In the next step, we remove all cases where there was a hit against a protein

#Attach taxonomy info to each file
#Make an interim file that contains the taxonomy information for all target contigs
#Otherwise searching the massive database would take a long time
cat *_blastn.seq.faa | grep "^>" | tr -d ">" | sort -u > alltaxa.compiled_intervalinfo.txt #intervalinfo is a vestigial name but we roll with it
cat /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_genome_contig_taxa.reduced.noescherichia.tsv /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_contig_taxa.tsv /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_contig_taxa.tsv | grep -w -F -f alltaxa.compiled_intervalinfo.txt - | cut -f2- > alltaxa.compiled_intervalinfo.interim
sort -k1 alltaxa.compiled_intervalinfo.interim -o alltaxa.compiled_intervalinfo.interim

#Attach taxonomy information from the nascent file to a linearized blast alignment file
#The script is wrapped in echo to parallelize
for i in $(ls *_blastn.seq.faa | rev | cut -f2- -d "_" | rev)
do
echo "seqkit fx2tab ${i}_blastn.seq.faa | sed \"s/\t$//g\" | sort -k1 > ${i}_blastn.seq.tsv" #linearize blast sequence file
echo "cut -f1 ${i}_blastn.seq.tsv | sort -u > ${i}.temp" #get target contig names
echo "grep -w -F -f ${i}.temp alltaxa.compiled_intervalinfo.interim | sort -k1 | join -1 1 -2 1 - ${i}_blastn.seq.tsv > ${i}_blastn.seq.interim.tsv" #add taxonomy info to each target sequence
done

#If a target had a protein-based hit, remove it from the list of targets
for i in $(ls *_blastn.seq.interim.tsv | rev | cut -f2- -d "_" | rev)
do
echo "$i" | sed "s/$/(/g" | grep -F -f - ../all_genes_of_interest.presence.tsv | cut -f2 -d " " | grep -v -w -F -f - "$i"_blastn.seq.interim.tsv |
awk '{print ">regular_"$2"_"$1,$3}' | sed "s/ /\n/g" > "$i"_blastn.seq.regular.faa #mark the sequences "regular" to indicate how these hits were found
done

wc -l *_blastn.seq.regular.faa | awk '($1==0)' | grep -v "total" | rev | cut -f1 -d " " | rev | sed "s/^/rm /g" | bash #If any homology files are empty after the removal step, get rid of them