# **Preparing databases**

The heart of this pipeline involves mapping the presence and absence of genes across a large number of genomes: to other E. coli genomes (within-pangenome), to non-E. coli genomes that are nonetheless very closely related (Escherichia genomes excluding E. coli), and bacterial genomes in general (genomes other than those of the Esherichia genus). This involves making databases out of genomes and proteins from these groups, against which our genes of interest are to be subsequently searched.

### **Step 0. Downloading files**

First, download all genomes and proteins for complete bacterial genomes from GenBank. Outgroup databases could be larger, but for our purposes we're just going with all outgroup sequences in GenBank that are "complete".

The outgroup databases (i.e., those consisting of non-E. coli genomes and proteins) are constructed with ATB and GenBank. ATB files were previously downloaded from their ftp site, but now it seems like they're hosted on a different platform.

The E. coli genomes were available in the form of .gff files, which contained their sequence as well.

In [None]:
cd /stor/scratch/Ochman/hassan/100724_Complete_Genomes

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt #Get GenBank assembly summary
mv assembly_summary.txt 102824_GenBank_bacteria_assembly_summary.txt #rename the file with the date of download

awk -F '\t' '($12=="Complete Genome")' 102824_GenBank_bacteria_assembly_summary.txt | cut -f20 |
awk -F '/' '{print $0"/"$NF"_genomic.fna.gz"}' > 102824_GenBank_bacteria_genome_URLs.txt #Extract URLs for all complete genomes
awk -F '\t' '($12=="Complete Genome")' 102824_GenBank_bacteria_assembly_summary.txt | cut -f20 |
awk -F '/' '{print $0"/"$NF"_protein.faa.gz"}' > 102824_GenBank_bacteria_protein_URLs.txt #...and all corresponding proteins

mkdir genomes && cp 102824_GenBank_bacteria_genome_URLs.txt genomes #prepare sub-directories to house these files
cd genomes && wget -i 102824_GenBank_bacteria_genome_URLs.txt && gunzip *gz #download, unzip
cd ..
mkdir proteins && cp 102824_GenBank_bacteria_protein_URLs.txt proteins
cd proteins && wget -i 102824_GenBank_bacteria_protein_URLs.txt && gunzip *gz #same with proteins

## **Preparing outgroup databases**

To construct outgroup databases, GenBank taxonomies cannot be relied upon. Each record must be validated to make sure there are no *E. coli* or *Escherichia* contaminants.

(This needs not be done in case of ATB, in which species taxonomy designations were independently validated. I did it for them anyways, though)

In the next code chunk, we generate ANI values between one focal strain of E. coli (K-12 MG1655) and all other bacteria.

The focal strain was downloaded from here - https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000005845.2/

And placed here - /stor/work/Ochman/hassan/protogene_extension/Ecoli_list/sequence_RS.fasta

In [None]:
#For GenBank:

mkdir /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_temp
cd /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_temp

ls /stor/scratch/Ochman/hassan/100724_Complete_Genomes/genomes | grep "^GC" | #get all genome files
cut -f1,2 -d "_" | sed "s/^/fastANI -q \/stor\/work\/Ochman\/hassan\/protogene_extension\/Ecoli_list\/sequence_RS.fasta -r \/stor\/scratch\/Ochman\/hassan\/100724_Complete_Genomes\/genomes\//g" | sed "s/$/\*fna/g" | awk -F '/' '{print $0" -o "$NF}' | sed "s/$/t/g" | sed "s/\*fnat/_fastANI/g" | #write fastANI code
split -l 600 #split into 600-line chunks to run in parallel

ls x* | sed "s/^/bash /g" > running.sh #all files resulting from the split command start with an "x"
/stor/work/Ochman/hassan/tools/parallelize_run.sh #parallelize_run.sh is a helper script, described elsewhere

#For ATB:
#ATB files are stored in this directory: /stor/scratch/Ochman/hassan/0318_AllTheBacteria/
#And all contigs from ATB are kept here: /stor/scratch/Ochman/hassan/0318_AllTheBacteria/AllTheBacteria_OG/
#Only calculate fastANI values for records marked "Escherichia", to make sure none of them are very close to E. coli

egrep "Escherichia" /stor/scratch/Ochman/hassan/0318_AllTheBacteria/hq_dataset.species_calls.tsv | grep -v "Escherichia coli" | cut -f1 | sed "s/^/fastANI -q \/stor\/work\/Ochman\/hassan\/protogene_extension\/Ecoli_list\/sequence_RS.fasta -r \/stor\/scratch\/Ochman\/hassan\/0318_AllTheBacteria\/AllTheBacteria_OG\/\*\//g" | sed "s/$/\.fa/g" | awk -F '/' '{print $0" -o "$NF}' | sed "s/$/t/g" | sed "s/\.fat/_fastANI/g" |
split -l 100 - y #split into 100-line chunks to run in parallel. Make sure all resulting files start with a "y"

ls y* | sed "s/^/bash /g" > running.sh #all resulting files start with an "y"
/stor/work/Ochman/hassan/tools/parallelize_run.sh

Parse the fastANI results to identify only genuine outgroup genomes:

In [None]:
ls | egrep "fastANI$" | sed "s/^/cat /g" | sed "s/$/ >> Escherichia_all_fastANI/g" | bash #Compile all fastANI files together
#This includes both ATB and GenBank entries

#Parse the fastANI results to make two-column files with accession ID and ANI
#For GenBank:
awk -F '\t' '($2~"GCA_")' Escherichia_all_fastANI | cut -f2,3 |
sed "s/\/stor\/scratch\/Ochman\/hassan\/100724_Complete_Genomes\/genomes\///g" | sed 's/_/\t/2' | cut -f1,3 > Escherichia_accession_ANI.tsv
#For ATB:
awk -F '\t' '($2!~"GCA_")' Escherichia_all_fastANI | cut -f2,3 | rev | cut -f1 -d '/' | rev | sed "s/\.fa\t/\t/g" >> Escherichia_accession_ANI.tsv
sort -k1 Escherichia_accession_ANI.tsv -o Escherichia_accession_ANI.tsv #preparing a clean two-column tsv file - accession and ANI

#There's a gap at 89% corresponding to Escherichia genus, exclude all accessions that's above this
awk -F '\t' '($2>89)' Escherichia_accession_ANI.tsv | grep "GCA_" | cut -f1 > Escherichia_genusexcluded_accesions.txt
#we also exclude any record marked Escherichia in the assembly file, regardless of ANI:
awk -F '\t' '($12=="Complete Genome")' 102824_GenBank_bacteria_assembly_summary.txt | grep "Escherichia" |
cut -f1 | sort -u >> Escherichia_genusexcluded_accesions.txt

Make Escherichia-excluded outgroup genome and protein databases by excluding all Escherichia genomes from GB:

In [None]:
cd /stor/scratch/Ochman/hassan/100724_Complete_Genomes
cp Escherichia_temp/Escherichia_genusexcluded_accesions.txt .

#Escherichia_excluded genomes:
ls genomes/ | grep "^GCA" | grep -F -v -f Escherichia_genusexcluded_accesions.txt |
sed "s/^/cat \/stor\/scratch\/Ochman\/hassan\/100724_Complete_Genomes\/genomes\//g" | sed "s/$/ >> Escherichia_excluded.genomes.faa/g" | bash
#Escherichia_excluded proteins:
ls proteins/ | grep "^GCA" | grep -F -v -f Escherichia_genusexcluded_accesions.txt |
sed "s/^/cat \/stor\/scratch\/Ochman\/hassan\/100724_Complete_Genomes\/proteins\//g" | sed "s/$/ >> Escherichia_excluded.proteins.faa/g" | bash
#Escherichia_excluded ORFs:
conda activate orfipy
orfipy --start ATG,TTG,GTG --stop TGA,TAG,TAA --pep Escherichia_excluded.genomes.orfipy Escherichia_excluded.genomes.faa

Similarly, in the next chunk make Ecoli-excluded outgroup genome and protein databases. This database includes all genomes in the Escherichia genus except those in E. coli.

This database has two complications that the Escherichia-excluded one did not. First, this database includes genomes from both GB and ATB that need to be processed slightly differently.

Second, the ATB genomes don't have annotations (they didn't when they were downloaded), so all E. coli-excluded Escherichia genomes are to be annotated with prodigal.

More fundamentally, the strategy here differs from the previous code chunk: in the last one, we were excluding genomes. Here, we're excluding (<93) but also including (>89). The exclusion criteria is to remove E. coli genomes, while the inclusion criteria is to only retain genomes from other Escherichia species.

In [None]:
cd /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db

#Get accessions of Ecoli-excluded genomes:
#Genbank:
awk -F '\t' '($3<93&&$3>89)' Escherichia_accession_ANI.tsv | egrep -v "Escherichia_coli|Escherichia_sp" | grep "^GCA" | cut -f1 > Ecoli_excluded_accessions.GBRS.txt
#ATB:
awk -F '\t' '($3<93&&$3>89)' Escherichia_accession_ANI.tsv | egrep -v "Escherichia_coli|Escherichia_sp" | grep -v "^GCA" | cut -f1 > Ecoli_excluded_accessions.ATB.txt

#ls genomes/ | grep "^GCA" | grep -F -f Ecoli_excluded_accessions.GBRS.txt | sed "s/^/cat \/stor\/scratch\/Ochman\/hassan\/100724_Complete_Genomes\/genomes\//g" | sed "s/$/ >> Ecoli_excluded.genomes.faa/g" | bash
#Concatenate Ecoli-excluded genomes into a multifasta database:
#Genbank:
sed "s/^/cat \/stor\/scratch\/Ochman\/hassan\/100724_Complete_Genomes\/genomes\//g" Ecoli_excluded_accessions.GBRS.txt | sed "s/$/* >> Ecoli_excluded.genomes.faa/g" | bash
#ATB:
sed "s/^/cat \/stor\/scratch\/Ochman\/hassan\/0318_AllTheBacteria\/AllTheBacteria_OG\/\*\//g" Ecoli_excluded_accessions.ATB.txt | sed "s/$/\.fa >> Ecoli_excluded.genomes.faa/g" | bash

#Now for proteins. These genomes need to be annotated with prodigal.
#We first make helper files
sed "s/^/ls \/stor\/scratch\/Ochman\/hassan\/100724_Complete_Genomes\/genomes\//g" Ecoli_excluded_accessions.GBRS.txt | sed "s/$/*/g" | bash > prodigal_interim
sed "s/^/ls \/stor\/scratch\/Ochman\/hassan\/0318_AllTheBacteria\/AllTheBacteria_OG\/\*\//g" Ecoli_excluded_accessions.ATB.txt | sed "s/$/\.fa/g" | bash >> prodigal_interim

#Run prodigal for all Ecoli-excluded genomes:
sed "s/^/prodigal -i /g" prodigal_interim | sed "s/$/ -f gff -o /g" | awk -F '/' '{print $0$NF}' | sed "s/ -f gff -o $//g" | sed "s/_genomic\.fna$/\.gff/g" | sed "s/\.fa$/\.gff/g" | split -l100 #split them into 100 chunks for easier parallelization
ls x* | sed "s/^/bash /g" > running.sh
/stor/work/Ochman/hassan/tools/parallelize_run.sh

#Convert the resulting gff files to gtf:
for i in $(ls *gff | rev | cut -f2- -d "." | rev)
do
grep -v "#" "$i".gff | awk -F '\t' '($3=="CDS")' | awk -F '\t' '{OFS=FS}{print $1,$2,$3,$4,$5,$6,$7,$8,$1}' | rev | cut -f 2- -d "." | rev | awk -F '\t' '{OFS=FS}{print $1,$2,$3,$4,$5,$6,$7,$8,"transcript_id \""$9"_"NR"\";gene_id \""$9"_"NR"\";"}'
done

#Pull sequences from genomes using the coordinates in gtf:
#For GenBank:
for i in $(ls gtf/GCA*gtf | rev | cut -f2- -d "." | rev | cut -f2- -d "/")
do
cat gtf/"$i".gtf | gtf2bed | bedtools getfasta -s -name -fi /stor/scratch/Ochman/hassan/100724_Complete_Genomes/genomes/"$i"_genomic.fna -bed - > cds/"$i".cds.faa
done

#For ATB:
for i in $(ls gtf/SA*gtf | rev | cut -f2- -d "." | rev | cut -f2- -d "/")
do
cat gtf/"$i".gtf | gtf2bed | bedtools getfasta -s -name -fi /stor/scratch/Ochman/hassan/0318_AllTheBacteria/AllTheBacteria_OG/*/$i.fa -bed - > cds/$i.cds.faa
done

#cds to protein:
for i in $(ls *.cds.faa | rev | cut -f3- -d '.' | rev | cut -f2- -d "/")
do
/stor/work/Ochman/hassan/tools/faTrans -stop $i.cds.faa $i.protein.faa
done

#Compile these together to make Ecoli_excluded proteins:
cat *protein.faa >> Ecoli_excluded.proteins.faa

#Ecoli_excluded ORFs:
orfipy --start ATG,TTG,GTG --stop TGA,TAG,TAA --pep Ecoli_excluded.genomes.orfipy Ecoli_excluded.genomes.faa

# **Preparing pangenome databases**



Steps:

1.   Extract the genomes from the .gff files. The gffs were available already
2.   Annotate the genomes with genemarks2 (gffs are already the product of prodigal)
3.   Convert both types of .gff files to .gtf files
4. Extract cds using gtf
5. Convert cds to proteins
6. Only retain those genomes that are in the 44 lineages under study
7. Make ORF-based database

In [None]:
#move the gffs to appropriate directory and rename them to indicate annotation source:

mkdir /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gffs/
cd /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gffs/
ls *gff | rev | cut -f2- -d '.' | rev | awk '{print "mv "$0".gff "$0".prodigal.gff"}' | bash

#Step-1

#Extract genomes from gffs:
for i in $(ls *gff | rev | cut -f3- -d "." | rev)
do
awk '/##FASTA/ {found=1; next} found' "$i".prodigal.gff > "$i".fasta
done

#Move the genome files to appropriate directories:
mkdir /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_genomes
mv *fasta /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_genomes/

#Step-2

#Annotate the genomes with genemarks2:
#genemarks2 requires the "key" file in the directory in which it's run, plus in the home directory

for i in $(ls /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_genomes/*fasta | rev | cut -f1 -d '/' | cut -f2- -d '.' | rev)
do
/stor/work/Ochman/hassan/tools/gms2_linux_64/gms2.pl --seq /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_genomes/"$i".fasta --genome-type bacteria --gcode 11 --output "$i".genemarks2.gff --format gff3
done

#Step-3

#Convert both gff types into gtfs:

for i in $(ls *prodigal.gff | rev | cut -f3- -d "." | rev)
do
awk '/##FASTA/{exit} {print}' "$i".prodigal.gff |
grep -v "#" | awk -F '\t' '($3=="CDS")' |
cut -f1 -d ';' | sed "s/ID=//g" |
awk -F '\t' '{OFS=FS}{print $1,$2,$3,$4,$5,$6,$7,$8,"transcript_id \""$9"\";\gene_id \""$9"\";"}' > "$i".prodigal.gtf
done

for i in $(ls *genemarks2.gff | rev | cut -f3- -d "." | rev)
do
awk -F '\t' '($3=="CDS")' "$i".genemarks2.gff |
cut -f1 -d ";" | sed "s/ID=/genemarks2_/g" |
awk -F '\t' '{OFS=FS}{print $1,$2,$3,$4,$5,$6,$7,$8,"transcript_id \""$9"\";\gene_id \""$9"\";"}' > "$i".genemarks2.gtf
done

#Move them to appropriate directories:
mkdir /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/prodigal_gtfs
mkdir /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/genemarks2_gtfs
mv /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/*.prodigal.gtf /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/prodigal_gtfs
mv /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/*.genemarks2.gtf /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/genemarks2_gtfs

#Step-4

#Now to extract the corresponding sequences and put concatenate them to make genome and protein files

mkdir pangenome_database_interim

#gtf to cds:

for i in $(ls ../Ecoli_gtfs/prodigal_gtfs/ | grep -v "filter" | rev | cut -f3- -d "." | rev) #the "grep -v filter" is needed to account for other files that were created in other steps of the process
do
cat ../Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.gtf | gtf2bed | bedtools getfasta -s -name -fi /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_genomes/"$i".fasta -bed - > pangenome_database_interim/"$i".prodigal.CDS.faa
done

for i in $(ls ../Ecoli_gtfs/genemarks2_gtfs/ | grep -v "filter" | rev | cut -f3- -d "." | rev) #the "grep -v filter" is needed to account for other files that were created in other steps of the process
do
cat ../Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.gtf | gtf2bed | bedtools getfasta -s -name -fi /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_genomes/"$i".fasta -bed - > pangenome_database_interim/"$i".genemarks2.CDS.faa
seqkit fx2tab pangenome_database_interim/"$i".genemarks2.CDS.faa | sed "s/\t/,/g" | egrep ",ATG|,TTG|,GTG" | egrep "TAA,|TGA,|TAG," | sed "s/^/>/g" | sed "s/,$//g" | sed "s/,/\n/g" > interim && mv interim pangenome_database_interim/"$i".genemarks2.CDS.faa
done

#Step-5

#cds to proteins:

for i in $(ls pangenome_database_interim/*faa | rev | cut -f3- -d '.' | cut -f1 -d "/" | rev)
do
/stor/work/Ochman/hassan/tools/faTrans -stop pangenome_database_interim/"$i".CDS.faa pangenome_database_interim/"$i".prot.faa
done

#Step-6

#Only retain proteins corresponding to the 44 lineages under study:
#the /stor/work/Ochman/hassan/Ecoli_pangenome/500_ipp_lineagedesignations.tsv file was extracted from supplement

for i in $(awk -F '\t' '($5<45)' /stor/work/Ochman/hassan/Ecoli_pangenome/500_ipp_lineagedesignations.tsv | cut -f1)
do
sed "s/>/>"$i"@/g" pangenome_database_interim/"$i"*prot.faa >> Ecoli.all.prot.pangenomedb.faa #Each protein is marked with the genome ID, so we know their source
done

#Step-7

#To make ORF-based database, first conncatenate all genomes in the 44 lineages under study:
for i in $(awk -F '\t' '($5<45)' /stor/work/Ochman/hassan/Ecoli_pangenome/500_ipp_lineagedesignations.tsv | cut -f1)
do
sed "s/>/>"$i"@/g" Ecoli_genomes/"$i".fasta >> Ecoli.all.genomes.faa #Each contig is marked with the genome ID, so we know their source
done

#Extract proteins encoded from all ORFs:
orfipy --start ATG,TTG,GTG --stop TGA,TAG,TAA --pep Ecoli.all.genomes.orfipy Ecoli.all.genomes.faa


Now we convert each of these genomes, proteins, and translated ORF files into blast or diamond databases.

In [None]:
#Escherichia and Ecoli_excluded genome blast databases:
makeblastdb -in Escherichia_excluded.genomes.faa -dbtype nucl -out Escherichia_excluded.genomes
makeblastdb -in Ecoli_excluded.genomes.faa -dbtype nucl -out Ecoli_excluded.genomes
makeblastdb -in Ecoli.all.genomes.faa -dbtype nucl -out Ecoli.all.genomes
#Escherichia, Ecoli_excluded and pangenome protein diamond databses:
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond makedb --in Escherichia_excluded.proteins.faa --db Escherichia_excluded.proteins
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond makedb --in Ecoli_excluded.proteins.faa --db Ecoli_excluded.proteins
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond makedb --in Ecoli.all.prot.pangenomedb.faa --db Ecoli.all.prot.pangenomedb
#Escherichia, Ecoli_excluded and pangenome translated ORF diamond databases:
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond makedb --in orfipy_Escherichia_excluded.genomes.faa_out/Escherichia_excluded.genomes.orfipy --db Escherichia_excluded.genomes.orfipy.prot
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond makedb --in orfipy_Ecoli_excluded.genomes.faa_out/Ecoli_excluded.genomes.orfipy --db Ecoli_excluded.genomes.orfipy.prot
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond makedb --in orfipy_Ecoli.all.genomes.faa_out/Ecoli.all.genomes.orfipy --db Ecoli.all.orfipy


There's one last convenience step to take care of. In a subsequent step, we run samtools to get sequences based on coordinates, and that step is significantly sped up if we place each contig in individual files named after those contigs.

For this, we prepare the individual_genomes directory.

In [None]:
#Make helper directories for each of our three databases

/stor/scratch/Ochman/hassan/100724_Complete_Genomes/
mkdir individual_genomes
cd individual_genomes
mkdir Escherichia_excluded
mkdir Ecoli_excluded
mkdir Ecoli_all

#Move all genome files here
cp Escherichia_excluded.genomes.faa individual_genomes/Escherichia_excluded
cp Ecoli_excluded.genomes.faa individual_genomes/Ecoli_excluded
cp Ecoli.all.genomes.faa individual_genomes/Ecoli_all


cd Escherichia_excluded
#Linearize files first
seqkit fx2tab Escherichia_excluded.genomes.faa | sed "s/\t$//g" | sed "s/^/>/g" | sed "s/\t/\n/g" > temp && mv temp Escherichia_excluded.genomes.faa
#Split the file into two-line files, first line identifier, second line sequence
split -l 2 Escherichia_excluded.genomes.faa
#This results in a bunch of files start with "x"
#Rename them to their sequence name
for i in x*
do
rename=$(grep "^>" $i | cut -f1 -d " " | tr -d ">")
mv $i $rename
done
cd ..

#Same with the Ecoli_excluded database

cd Ecoli_excluded
seqkit fx2tab Ecoli_excluded.genomes.faa | sed "s/\t$//g" | sed "s/^/>/g" | sed "s/\t/\n/g" > temp && mv temp Ecoli_excluded.genomes.faa
split -l 2 Ecoli_excluded.genomes.faa
for i in x*
do
rename=$(grep "^>" $i | cut -f1 -d " " | tr -d ">")
mv $i $rename
done
cd ..

#Same with the Ecoli_all database

cd Ecoli_all
seqkit fx2tab Ecoli.all.genomes.faa | sed "s/\t$//g" | sed "s/^/>/g" | sed "s/\t/\n/g" > temp && mv temp Ecoli.all.genomes.faa
split -l 2 Ecoli.all.genomes.faa
for i in x*
do
rename=$(grep "^>" $i | cut -f1 -d " " | tr -d ">")
mv $i $rename
done
cd ..

#Move them all back to main individual_genomes directory
mv Escherichia_excluded/* .
mv Ecoli_excluded/* .
mv Ecoli_all/* .

# **Download and prepare the nr database**

Finally, for a sanity check step near the end of the pipeline, I also search the proteins against the nonredundant protein database. In the chunk below I download and prepare the database by incorporating the taxonomy information.

In [None]:
mkdir /stor/scratch/Ochman/hassan/nr
cd /stor/scratch/Ochman/hassan/nr

#Download the nr file

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz

#Download associated taxonomy information files

wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.FULL.gz
gunzip prot.accession2taxid.FULL.gz

wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz

#Make diamond database:
diamond makedb \
    --in nr.gz \
    --db 03272024_nr \
    --taxonmap prot.accession2taxid.FULL \
    --taxonnames names.dmp \
    --taxonnodes nodes.dmp

# **Extract taxonomy information for each database entry**

For subsequent steps, it's important to get the taxonomic information to each target genome and protein. The goal is to make three-column tsv files, with the first column listing genome accession, the second column listing contig or protein ID, and the third column listing appropriate taxonomic information -- genus in case of non-Escherichia records, species in case of non-E. coli records, and lineage in case of pangenome records.

In [None]:
cd /stor/scratch/Ochman/hassan/100724_Complete_Genomes/

#First, make a two-column file with accession number and species name:
awk -F '\t' '($12=="Complete Genome")' 102824_GenBank_bacteria_assembly_summary.txt | cut -f1,8 | sed "s/ /_/g" > accession_taxonomy.tsv
sort -k1 accession_taxonomy.tsv -o accession_taxonomy.tsv

#For proteins and accessions:
cd /stor/scratch/Ochman/hassan/100724_Complete_Genomes/proteins
#Two-column file with accession and protein ID:
for i in $(ls | grep "GCA" | cut -f 1,2 -d "_"); do grep "^>" "$i"*protein.faa | cut -f1 -d " " | tr -d ">" | sed "s/^/"$i"\t/g" >> accession_proteinID.tsv; done
#Join this with the file containing accession and taxonomy:
sort -k1 accession_proteinID.tsv | join -1 1 -2 1 - ../accession_taxonomy.tsv | sed "s/ /\t/g" > accession_proteinID_taxonomy.tsv #tie this to taxonomy
sort -k2 /stor/scratch/Ochman/hassan/100724_Complete_Genomes/proteins/accession_proteinID_taxonomy.tsv -o /stor/scratch/Ochman/hassan/100724_Complete_Genomes/proteins/accession_proteinID_taxonomy.tsv

#For genomes and accessions:
cd /stor/scratch/Ochman/hassan/100724_Complete_Genomes/genomes
#Two-column file with accession and genome ID:
for i in $(ls | grep "GCA" | cut -f 1,2 -d "_"); do grep "^>" "$i"*genomic.fna | cut -f1 -d " " | tr -d ">" | sed "s/^/"$i"\t/g" >> accession_genomeID.tsv; done
#Join this with the file containing accession and taxonomy:
sort -k1 accession_genomeID.tsv | join -1 1 -2 1 - ../accession_taxonomy.tsv | sed "s/ /\t/g" > accession_genomeID_taxonomy.tsv #tie this to taxonomy
sort -k2 /stor/scratch/Ochman/hassan/100724_Complete_Genomes/genomes/accession_genomeID_taxonomy.tsv -o /stor/scratch/Ochman/hassan/100724_Complete_Genomes/genomes/accession_genomeID_taxonomy.tsv

#Only extract those rows corresponding to genome IDs in the GenBank Escherichia_excluded database:
cd /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db
grep "^>" Escherichia_excluded.genomes.faa | cut -f1 -d " " | tr -d ">" > Escherichia_excluded_accessions.GBRS.txt
cat Escherichia_excluded_accessions.GBRS.txt | grep -w -F -f - /stor/scratch/Ochman/hassan/100724_Complete_Genomes/genomes/accession_genomeID_taxonomy.tsv | sort -k2 > Ecoli_extragenus_genome_contig_taxa.tsv
#Only retain genus names (and get rid of non-useful prefixes) — that's what makes it "reduced":
sed "s/^\[//g" Ecoli_extragenus_genome_contig_taxa.tsv | sed "s/uncultured_//g" | sed "s/Candidatus_//g" | sed "s/^'//g" | sed "s/candidatus_//g" | awk '$3 ~ /^[A-Z]/' | awk '{sub(/_.*/, "", $3)}1' | sed "s/ /\t/g" | sort -k2 > Ecoli_extragenus_genome_contig_taxa.reduced.tsv
#Apparently some E_coli IDs snuck into the database...
grep -v "Escherichia" /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_genome_contig_taxa.reduced.tsv > /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_genome_contig_taxa.reduced.noescherichia.tsv

#Only extract those rows corresponding to genome IDs in the GenBank Escherichia_excluded database:
grep "^>" Escherichia_excluded.proteins.faa | cut -f1 -d " " | tr -d ">" > Escherichia_db/Escherichia_excluded_accessions.proteins.GBRS.txt
cat Escherichia_excluded_accessions.proteins.GBRS.txt | grep -w -F -f - proteins/accession_proteinID_taxonomy.tsv | sort -k2 > Ecoli_extragenus_protein_contig_taxa.tsv
#Only retain genus names (and get rid of non-useful prefixes) — that's what makes it "reduced":
sed "s/^\[//g" Ecoli_extragenus_protein_contig_taxa.tsv | sed "s/uncultured_//g" | sed "s/Candidatus_//g" | sed "s/^'//g" | sed "s/candidatus_//g" | awk '$3 ~ /^[A-Z]/' | awk '{sub(/_.*/, "", $3)}1' | sed "s/ /\t/g" | sort -k2 > Ecoli_extragenus_protein_contig_taxa.reduced.tsv
#Apparently some E_coli IDs snuck into the database...
grep -v "Escherichia" Ecoli_extragenus_protein_contig_taxa.reduced.tsv > /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_protein_contig_taxa.reduced.noescherichia.tsv

#These Escherichia contaminants need to be removed from the blast search results
#We store their accession in a separate file for this process:
grep "Escherichia" /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_protein_contig_taxa.reduced.tsv | cut -f2 > /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/rethinking_clustering/all_Escherichia_slippedby_IDs.txt
grep "Escherichia" /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_genome_contig_taxa.reduced.tsv | cut -f2 >> /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/rethinking_clustering/all_Escherichia_slippedby_IDs.txt

#Same with intragenus (i.e., E. coli-excluded), and for genomes:
#For GenBank:
cat Ecoli_excluded_accessions.GBRS.txt | grep -w -F -f - genomes/accession_genomeID_taxonomy.tsv | sort -k2 > Ecoli_intragenus_genome_contig_taxa.tsv
#For ATB:
grep "^>" Ecoli_excluded.genomes.faa | awk '($1~"contig")' | cut -f1 -d " " | sort -u | tr -d ">" | awk -F '.' '{print $1,$0}' | sort -nrk1 > temp
cut -f1 -d " " temp | grep -w -F -f - ../0318_AllTheBacteria/hq_dataset.species_calls.tsv | sed "s/ /_/g" | sort -nrk1 | join -1 1 -2 1 - temp | awk '{print $1"\t"$3"\t"$2}' >> Ecoli_intragenus_genome_contig_taxa.tsv

#Same with intragenus (i.e., E. coli-excluded), and for protein:
#GenBank:
#ProteinID-Accession tsv:
grep "^>" Escherichia_db/proteins/*protein.faa | grep "GCA" | rev | cut -f1 -d "/" | rev | sed "s/_/\t/2" | sed "s/>/\t/g" | cut -f1,3 | rev | cut -f2- -d "(" | rev > Ecoli_intragenus_genome_protein.tsv
#ATB:
grep "^>" Escherichia_db/proteins/*protein.faa | grep -v "GCA" | rev | cut -f1 -d "/" | rev | sed "s/\./\t/1" | sed "s/>/\t/g" | cut -f1,3 | rev | cut -f2- -d "(" | rev >> Ecoli_intragenus_genome_protein.tsv
sort -k1 Ecoli_intragenus_genome_protein.tsv -o Ecoli_intragenus_genome_protein.tsv
#Join this with the file containing genome accession and species:
sort -k1 Ecoli_intragenus_genome_contig_taxa.tsv | join -1 1 -2 1 - Ecoli_intragenus_genome_protein.tsv | awk '{print $1"\t"$4"\t"$3}' > Ecoli_intragenus_genome_protein_taxa.tsv

#Slight fixes to some species names:
sed -i "s/_KF1//g" /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_contig_taxa.tsv
sed -i "s/_KF1//g" /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_protein_taxa.tsv
sed -i "s/_ATCC_35469//g" /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_contig_taxa.tsv
sed -i "s/_ATCC_35469//g" /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_protein_taxa.tsv

#For pangenome:
cd /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3

#genomes:
grep "^>" Ecoli.all.genomes.faa | tr -d ">" | awk -F '@' '{print $1"\t"$0}' | sort -k1 > temp
awk -F '\t' '($5<45)' /stor/work/Ochman/hassan/Ecoli_pangenome/500_ipp_lineagedesignations.tsv | awk -F '\t' '{print $1"\tEcoli@"$5}' | sort -k1 | join -1 1 -2 1 - temp | awk '{print $1"\t"$3"\t"$2}' > Ecoli_pangenome_genome_contig_taxa.tsv

#proteins:
grep "^>" Ecoli.all.prot.faa | tr -d ">" | awk -F "@" '{print $1"\t"$0}' | sort -k1 > temp
awk -F '\t' '($5<45)' /stor/work/Ochman/hassan/Ecoli_pangenome/500_ipp_lineagedesignations.tsv | awk -F '\t' '{print $1"\tEcoli@"$5}' | sort -k1 | join -1 1 -2 1 - temp | awk '{print $1"\t"$3"\t"$2}' > Ecoli_pangenome_genome_protein_taxa.tsv

cp Ecoli_pangenome_genome_contig_taxa.tsv /stor/scratch/Ochman/hassan/100724_Complete_Genomes
cp Ecoli_pangenome_genome_protein_taxa.tsv /stor/scratch/Ochman/hassan/100724_Complete_Genomes
sort -k2 /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_protein_taxa.tsv -o /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_protein_taxa.tsv
sort -k2 /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_contig_taxa.tsv -o /stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_contig_taxa.tsv

In [None]:
#Final set of databases:

#Genomes:
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Escherichia_excluded.genomes
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Ecoli_excluded.genomes
/stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli.all.genomes

#Proteins:
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Escherichia_excluded.proteins.dmnd
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Ecoli_excluded.proteins.dmnd
/stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli.all.prot.dmnd

#ORFs:
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Escherichia_excluded.proteins.dmnd
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Escherichia_db/Ecoli_excluded.proteins.dmnd
/stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli.all.orfipy.dmnd

#nr:
/stor/scratch/Ochman/hassan/nr.dmnd

#Final set of TSV files that contain genome+protein IDs and corresponding taxonomy:

/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_genome_contig_taxa.reduced.noescherichia.tsv
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_contig_taxa.tsv
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_contig_taxa.tsv

/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_extragenus_protein_contig_taxa.reduced.noescherichia.tsv
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_intragenus_genome_protein_taxa.tsv
/stor/scratch/Ochman/hassan/100724_Complete_Genomes/Ecoli_pangenome_genome_protein_taxa.tsv