##**Filtering proteins**

Our point of departure are the prodigal and genemarks2 gtf files generated in the first step of this pipeline (see preparing_databases.sh). These files are kept in /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/

We exclude genes that are:

1. Only found at contig edges -- i.e., have no genes upstream or downstream; or
2. Only annotated by one of the two annotation programs; or
3. Lacking conventional starts and/or stops.

# **Preparing query proteins**

To detect ORFan genes in a bacterial species, the first step is to screen all proteins encoded by that species. In this project, we're only interested in well-established proteins that are picked out by two annotation programs -- prodigal and genemarks2, in our case. Furthermore, we'll be excluding proteins that are exclusively located at the edge of contigs, or genes that don't have an ATG/TTG/GTG start or an TAG/TGA/TAA stop.

After collecting these proteins from a set of genomes, we have to cluster them based on similarity so we're ultimately left with a manageable number of gene families. It's not practical to search millions of proteins across the database. Instead, we cluster them and pick just one representative sequence from the cluster.

This notebook is divided into two halves: the filtering step, and the clustering step.

### **Filter 1. Genes at edges**

In [None]:
cd /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/rethinking_clustering

#For prodigal, pick out the edge genes:

for i in $(ls ../Ecoli_gtfs/prodigal_gtfs/*gtf | rev | cut -f3- -d "." | cut -f1 -d "/" | rev)
do
awk -F'\t' 'BEGIN{OFS="\t"}{if(!($1 in seen)){order[++n]=$1;seen[$1]=1;first[$1]=$0}last[$1]=$0}END{for(i=1;i<=n;i++){k=order[i];print first[k];if(first[k]!=last[k])print last[k]}}' ../Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.gtf | cut -f2 -d "\"" | sed "s/^/"$i",/g" >> prodigal_edge_genes.csv
done

#For genemarks2, pick out the edge genes:

for i in $(ls ../Ecoli_gtfs/genemarks2_gtfs/*gtf | rev | cut -f3- -d "." | cut -f1 -d "/" | rev)
do
awk -F'\t' 'BEGIN{OFS="\t"}{if(!($1 in seen)){order[++n]=$1;seen[$1]=1;first[$1]=$0}last[$1]=$0}END{for(i=1;i<=n;i++){k=order[i];print first[k];if(first[k]!=last[k])print last[k]}}' ../Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.gtf | cut -f2 -d "\"" | sed "s/^/"$i",/g" >> genemarks2_edge_genes.csv
done

#The files are two-column csvs, with first column = genome name, second column = gene/protein name

### **Filter 2. Lacking dual annotation**

In [None]:
#For both cases, identify genes that aren't picked out by both annotation programs
#To this end, identify what genes are annotated by what program:

cd /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gffs/

for i in $(ls *genemarks2.gff | rev | cut -f3- -d "." | rev) #Loop using file basenames
do
#The first line makes non-redundant gtf files by first merging prodigal and genemarks2 annotations
#Then checking to see if any gene has the same contig and stop codon information
#In which case, the longer one is retained
cat /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.gtf /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.gtf | awk '{stop = ($7 == "+") ? $5 : $4; feature_length = $5 - $4; key = $1 ":" stop; if (!(key in data) || feature_length > data[key]) {data[key] = feature_length; lines[key] = $0}} END {for (k in lines) {print lines[k]}}' > "$i".nr.gtf
#The next four lines take the non-redundant gtf file as input
#Checks to see if any entry in the prodigal/genemarks2 gtf shares its contig and stop codon information
#Based on this, it evaluates if the gene is found by prodigal and/or genemarks2
awk -F'\t' 'NR==FNR { seen[$1, $5, $7] = $9; next } ($7 == "+" && ($1, $5, $7) in seen) { print seen[$1, $5, $7], "prodigal" }' "$i".nr.gtf /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.gtf > "$i".tmp
awk -F'\t' 'NR==FNR { seen[$1, $4, $7] = $9; next } ($7 == "-" && ($1, $4, $7) in seen) { print seen[$1, $4, $7], "prodigal" }' "$i".nr.gtf /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.gtf >> "$i".tmp
awk -F'\t' 'NR==FNR { seen[$1, $5, $7] = $9; next } ($7 == "+" && ($1, $5, $7) in seen) { print seen[$1, $5, $7], "genemarks2" }' "$i".nr.gtf /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.gtf >> "$i".tmp
awk -F'\t' 'NR==FNR { seen[$1, $4, $7] = $9; next } ($7 == "-" && ($1, $4, $7) in seen) { print seen[$1, $4, $7], "genemarks2" }' "$i".nr.gtf /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.gtf >> "$i".tmp
rev "$i".tmp | cut -f-2 -d " " | rev | tr -d "\"" | sed "s/;//g" | awk '{print $2"\t"$1}' | awk -F'\t' '{ values[$2] = (values[$2] == "" ? $1 : values[$2] ", " $1) } END { for (value in values) { print value "\t" values[value] } }' | sed "s/ //g" > "$i".annotcomp.tsv
done

#Concatenate the annotation presence/absence info for each gene into a singular file
#Don't forget to mark protein names with their genome info so we know where they came from
for i in *annotcomp.tsv; do cat $i | sed "s/^/"$i"@/g"; done | sed "s/.annotcomp.tsv//g" >> all_annotcomp.tsv

#Pick out ones without dual annotation:
awk -F '\t' '($2!~",")' ../Ecoli_gffs/all_annotcomp.tsv | cut -f1 | sort -u | sed "s/@/,/g" > genes_with_oneannotation.csv

### **3. Lacking normal start/stop codons**

In [None]:
cd /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/rethinking_clustering

#Genes that begin with other than ATG, TTG or GTG:

for i in $(ls ../Ecoli_gtfs/prodigal_gtfs/*gtf | grep -v "filter" | rev | cut -f3- -d "." | cut -f1 -d "/" | rev)
do
cat ../Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.gtf | gtf2bed | bedtools getfasta -s -name -fi ../Ecoli_genomes/"$i".fasta -bed - |
seqkit fx2tab | sed "s/\t/,/g" | egrep -v ",ATG|,TTG|,GTG" | cut -f1 -d "(" | sed "s/^/"$i",/g"
done >> prodigal_genes_without_start.csv

for i in $(ls ../Ecoli_gtfs/genemarks2_gtfs/*gtf | grep -v "filter" | rev | cut -f3- -d "." | cut -f1 -d "/" | rev)
do
cat ../Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.gtf | gtf2bed | bedtools getfasta -s -name -fi ../Ecoli_genomes/"$i".fasta -bed - |
seqkit fx2tab | sed "s/\t/,/g" | egrep -v ",ATG|,TTG|,GTG" | cut -f1 -d "(" | sed "s/^/"$i",/g"
done >> genemarks2_genes_without_start.csv

#Those missing stops:

for i in $(ls ../Ecoli_gtfs/prodigal_gtfs/*gtf | grep -v "filter" | rev | cut -f3- -d "." | cut -f1 -d "/" | rev)
do
cat ../Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.gtf | gtf2bed | bedtools getfasta -s -name -fi ../Ecoli_genomes/"$i".fasta -bed - |
seqkit fx2tab | sed "s/\t/,/g" | egrep -v "TGA,|TAA,|TAG," | cut -f1 -d "(" | sed "s/^/"$i",/g"
done >> prodigal_genes_without_stop.csv

for i in $(ls ../Ecoli_gtfs/genemarks2_gtfs/*gtf | grep -v "filter" | rev | cut -f3- -d "." | cut -f1 -d "/" | rev)
do
cat ../Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.gtf | gtf2bed | bedtools getfasta -s -name -fi ../Ecoli_genomes/"$i".fasta -bed - |
seqkit fx2tab | sed "s/\t/,/g" | egrep -v "TGA,|TAA,|TAG," | cut -f1 -d "(" | sed "s/^/"$i",/g"
done >> genemarks2_genes_without_stop.csv

### **4. Remove unwanted genes**

Now filter each .gtf file by removing edge, single-annotation, unconventional-startstop genes

In [None]:
cd /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/rethinking_clustering

#First exclude genes from prodigal gtfs:

cat genes_with_oneannotation.csv | grep -v "genemarks2" | #Only select prodigal-annotated genes without dual annotation
cat - prodigal_genes_without_start.csv prodigal_genes_without_stop.csv prodigal_edge_genes.csv | #Add to this the non-ORF genes, the edge genes
sort -u > all_prodigal_exclude.csv #Complete exclusion list

for i in $(ls ../Ecoli_gtfs/prodigal_gtfs/*gtf | rev | cut -f3- -d "." | cut -f1 -d "/" | rev)
do
gene_name=$(awk -v var="$i" -F ',' '($1==var)' all_prodigal_exclude.csv | cut -f2 -d ',') #Gene name in second column
grep -v -w -F "$gene_name" ../Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.gtf >> ../Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.filtered.gtf #Exclude the genes from the corresponding .gtf
done

#Likewise with genemarks2:

cat genes_with_oneannotation.csv | grep "genemarks2" |
cat - genemarks2_genes_without_start.csv genemarks2_genes_without_stop.csv genemarks2_edge_genes.csv |
sort -u > all_genemarks2_exclude.csv

for i in $(ls ../Ecoli_gtfs/genemarks2_gtfs/*gtf | rev | cut -f3- -d "." | cut -f1 -d "/" | rev)
do
gene_name=$(awk -v var="$i" -F ',' '($1==var)' all_genemarks2_exclude.csv | cut -f2 -d ',')
grep -v -w -F "$gene_name" ../Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.gtf >> ../Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.filtered.gtf
done

For each genome, we combine their filtered annotations to yield nonredundant .gtf files:

In [None]:
for i in $(ls ../Ecoli_gtfs/prodigal_gtfs/*prodigal.filtered.gtf | rev | cut -f4- -d "." | cut -f1 -d "/" | rev)
do
cat ../Ecoli_gtfs/prodigal_gtfs/"$i".prodigal.filtered.gtf ../Ecoli_gtfs/genemarks2_gtfs/"$i".genemarks2.filtered.gtf | #concatenating both .gtf files
#As before, we concatenate the two filtered .gtf files and remove redundancy by retaining the longer gene corresponding to same stop codon in the same contig
awk '{stop = ($7 == "+") ? $5 : $4; feature_length = $5 - $4; key = $1 ":" stop; if (!(key in data) || feature_length > data[key]) {data[key] = feature_length; lines[key] = $0}} END {for (k in lines) {print lines[k]}}' > ../Ecoli_gtfs/"$i".filtered.nr.gtf
done


### **5. Make corresponding CDS and protein files from 44 lineages**

Finally, it's time to:

1.   Extract CDS corresponding to these filtered .gtf files
2.   Convert to proteins
3.   Only retain those from the 450 good genomes corresponding to 44 lineages

In [None]:
#Extract CDS:

for i in $(ls ../Ecoli_gtfs/prodigal_gtfs/*prodigal.filtered.gtf | rev | cut -f4- -d "." | cut -f1 -d "/" | rev)
do
cat ../Ecoli_gtfs/"$i".filtered.nr.gtf | gtf2bed | bedtools getfasta -s -name -fi /stor/work/Ochman/hassan/MS_Ecoli_ORFans_Ch3/Ecoli_genomes/"$i".fasta -bed - > ../Ecoli_cds/"$i".filtered.CDS.faa
done

#Translate to protein:

for i in $(ls ../Ecoli_cds/*filtered.CDS.faa | rev | cut -f4- -d "." | cut -f1 -d "/" | rev)
do
/stor/work/Ochman/hassan/tools/faTrans -stop ../Ecoli_cds/"$i".filtered.CDS.faa ../Ecoli_prot/"$i".filtered.prot.faa
done

#Retain only those proteins annotated in the 450 good genomes, and concatenate them all into one file:

for i in $(awk -F '\t' '($5<45)' /stor/work/Ochman/hassan/Ecoli_pangenome/500_ipp_lineagedesignations.tsv | cut -f1)
do
sed "s/>/>"$i"@/g" ../Ecoli_prot/"$i".filtered.prot.faa >> Ecoli.all.filtered.prot.faa
sed "s/>/>"$i"@/g" ../Ecoli_cds/"$i".filtered.CDS.faa >> Ecoli.all.filtered.CDS.faa
done

#Number of proteins - 2,066,371

## **Clustering proteins**

The next step(s) involve clustering these 2 million+ proteins into gene families, based on their similarities. We can collapse orthologs and paralogs.

Mmseqs2+linclust is a widely adopted and rapid clustering process. We use --cov-mode 0, which requires the alignment to cover 80% of the length of both query and protein. This process is too permissive and will fail to cluster proteins with genuine outgroup hits.

To account for this, we pick the longest representative sequence from mmseqs2 results and run an all-vs-all protein blast on them, followed by an additional clustering step with silix.

## **1. Clustering by MMseqs2**

In [None]:
#mmseqs + linclust

mmseqs createdb Ecoli.all.filtered.prot.faa Ecoli.all.filtered.prot

mkdir mmseqs_covmode0
cp Ecoli.all.filtered.prot
cd mmseqs_covmode0

mmseqs search Ecoli.all.filtered.prot Ecoli.all.filtered.prot resultDB tmp --min-seq-id 0.8 -c 0.8 --cov-mode 0
mmseqs convertalis Ecoli.all.filtered.prot Ecoli.all.filtered.prot resultDB resultDB.m8
mmseqs linclust Ecoli.all.filtered.prot clusterDB tmp --min-seq-id 0.8 -c 0.8 --cov-mode 0
mmseqs createtsv Ecoli.all.filtered.prot Ecoli.all.filtered.prot clusterDB Ecoli.all.filtered.prot.clusters.covmode0.tsv

#38,797 proteins

#At this stage, I'm setting the similarity and query coverage cutoffs pretty high
#I don't want to be over-enthusiastic in collapsing actually distinct proteins into the same family
#This is just a heuristic step to reduce the number of searches I have to wade through—
#...doesn't necessarily reflect "real" protein families.
#In subsequent phases of this pipeline, if (when) I discover similarities between proteins, we might re-cluster

#Extract longest sequence from each cluster
#For this purpose, first extract lengths
seqkit fx2tab Ecoli.all.filtered.prot.faa | awk -F '\t' '{print $1,length($2)}' | sort -k1 > Ecoli.all.filtered.prot.lengths.tsv

sort -k2 mmseqs_covmode0/Ecoli.all.filtered.prot.clusters.covmode0.tsv | join -1 2 -2 1 - Ecoli.all.filtered.prot.lengths.tsv | #attaching lengths to each gene
sort -k2 | awk '{print $0 | "sort -k2,2 -k3,3n"}' | awk '!seen[$2]++' | sort -k2 > representative_sequences.interim.tsv #Longest sequence per cluster retained

#Make a three-column tsv file which lists longest sequence per cluster in col1, representative sequence picked by mmseqs2 in column3, and genes belonging to that cluster in col2
sort -k1 mmseqs_covmode0/Ecoli.all.filtered.prot.clusters.covmode0.tsv | join -1 1 -2 2 -o 1.1 2.2 1.2 2.1 - representative_sequences.interim.tsv | awk '{print $4,$3,$1}' | sed "s/ /\t/g" > Ecoli.all.filtered.prot.clusters.longestrepresentative.tsv

#Get CDS and protein sequences
seqkit fx2tab Ecoli.all.filtered.prot.faa | sed "s/\t$//g" | sed "s/^/>/g" | sed "s/\t/\n/g" > Ecoli.all.filtered.prot.linear.faa
cut -f1 Ecoli.all.filtered.prot.clusters.longestrepresentative.tsv | sort -u | grep --no-group-separator -A1 -F -f - Ecoli.all.filtered.prot.linear.faa > Ecoli.all.filtered.prot.clusters.longestrepresentative.faa
cut -f1 Ecoli.all.filtered.prot.clusters.longestrepresentative.tsv | sort -u | grep --no-group-separator -A1 -F -f - Ecoli.all.filtered.CDS.faa > Ecoli.all.filtered.prot.clusters.longestrepresentative.CDS.faa

##Debugging

In [None]:


#Footnote: I used a different approach to assign longest sequence per cluster and the results varied in case of 394 proteins
#Debugging this might not be worth it, as I doubt it'll have a large footprint on subsequent steps

#Assign cluster numbers to each representative sequence:
sort -k1 mmseqs_covmode0/Ecoli.all.filtered.prot.clusters.covmode0.tsv | awk 'BEGIN{c=0;prev=""} $1!=prev{c++;prev=$1} {print "cluster_" c, $0}' | awk '{print $1"\t"$2"\t"$3}' | sort -k3 > Ecoli.all.filtered.prot.clusters.covmode0.interim.tsv
#Get all protein lengths:
seqkit fx2tab Ecoli.all.filtered.prot.linear.faa | awk -F '\t' '{print $1,length($2)}' | sort -k1 | sed "s/ /\t/g" > genelengths_interim.tsv
#Join these:
join -1 1 -2 3 genelengths_interim.tsv Ecoli.all.filtered.prot.clusters.covmode0.interim.tsv | sort -k4,4 -k2,2 | awk '!seen[$4]++' | cut -f1,4 -d " " | sed "s/ /\t/g" | sort -k2 > 1st_longest_2nd_OG.tsv
sort -k2 Ecoli.all.filtered.prot.clusters.covmode0.interim.tsv | join -1 2 -2 2 - 1st_longest_2nd_OG.tsv | awk '{print $4"\t"$3"\t"$1}' | sort -k2 | sed "1s/^/longestrep\tgene\tOGcluster\n/g" > Ecoli.all.filtered.prot.clusters.longestrepresentative.secondshot.tsv
#compare this with previous:
cat Ecoli.all.filtered.prot.clusters.longestrepresentative.tsv | sed "s/\t/,/g" | paste - Ecoli.all.filtered.prot.clusters.longestrepresentative.secondshot.tsv | rev | sed "s/\t/,/1" | sed "s/\t/,/1" | rev | awk '($1!=$2)' > weird
#isolate cases where the longest rep seq differ:
sed "s/\t/,/g" weird  | cut -f1,4 -d ',' | sort -u | cat -n | sed "s/^ *//g" | sed "s/^/group_/g" | sed "s/\t/,/g" > weird.interim
#Hypothesis: second approach picks out longer ones
for i in $(cat weird.interim)
do
echo $i | cut -f2 -d ',' | sed "s/^/>/g" | grep -A1 -F -f - Ecoli.all.filtered.prot.linear.faa | seqkit fx2tab | awk '{print length($2)}' > temp1
echo $i | cut -f3 -d ',' | sed "s/^/>/g" | grep -A1 -F -f - Ecoli.all.filtered.prot.linear.faa | seqkit fx2tab | awk '{print length($2)}' > temp2
paste temp1 temp2 | sed "s/^/"$i"\t/g" >> weird.lengths.csv
rm temp*
done

## **2. Clustering by Silix**

In [None]:
#Let's cluster mmseqs2 clusters with silix:

#Run an all-vs-all blastp with diamond:
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond makedb --in Ecoli.all.filtered.prot.clusters.longestrepresentative.faa --db Ecoli.all.filtered.prot.clusters.longestrepresentative
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond blastp -q Ecoli.all.filtered.prot.clusters.longestrepresentative.faa -d Ecoli.all.filtered.prot.clusters.longestrepresentative --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore --ultra-sensitive --out Ecoli.all.filtered.prot.clusters.longestrepresentative.allvall.tsv -k 0 -b8 -c1

#Parse the results with silix:
#The options -q -2 capture the mmseqs2 logic
#-q -2: coverage of the longer sequence -- meaning both of the sequences need to be covered at 80%
#-s 3: identity is calculated by counting identical positions divided by aligned sequence length
/stor/work/Ochman/hassan/tools/silix-1.3.0/src/silix -f mmseqs2_cluster_ -i 0.8 -r 0.8 -q -2 -s 3 Ecoli.all.filtered.prot.clusters.longestrepresentative.faa Ecoli.all.filtered.prot.clusters.longestrepresentative.allvall.tsv > mmseqs2_recluster_silix_output.tsv

#33,245 clusters—14.3% reduction

#Now to make a replacements file: a two-column file clustering file that says which sequence is to be replaced with which
#For each silix cluster, all genes in the cluster are to be replaced with the longest sequence in that cluster
#Extract the sequences in each cluster and place them in files
cut -f1 mmseqs2_recluster_silix_output.tsv | sort | uniq -c | grep -v " 1 " | rev | cut -f1 -d " " | rev | sed "s/$/\t/g" | grep -F -f - mmseqs2_recluster_silix_output.tsv | sed "s/\t/,/g" > mmseqs2_recluster_silix_output.interim.tsv
for i in $(cat mmseqs2_recluster_silix_output.interim.tsv | cut -f1 -d ",")
do
echo $i | sed "s/$/,/g" | grep -F -f - mmseqs2_recluster_silix_output.interim.tsv | cut -f2 -d ',' | sed "s/^/>/g" | grep --no-group-separator -A1 -F -f - ../Ecoli.all.filtered.prot.clusters.longestrepresentative.faa > "$i".seqs.faa
done

#Now to set up two-column replacements, where the sequences in column 2 are to be replaced by those of column1:
#Column1 contains the longest seq per cluster
for i in $(ls mmseqs2_cluster_*.seqs.faa)
do
temp=$(seqkit fx2tab $i | awk -F '\t' '{print $1,length($2)}' | sort -nrk2 | head -1 | cut -f1 -d " ")
grep "^>" $i | tr -d ">" | sed "s/^/"$temp"\t/g" >> replacements.tsv
done



## Debugging

In [None]:
#Why is there a reduction in gene family number when silix is run on mmseqs2 results?
#Are the genes placed in silix clusters actually similar?

#Let's look at the sequences that are being clustered together at this stage
cut -f1 mmseqs2_recluster_silix_output.tsv | sort | uniq -c | grep -v " 1 " | rev | cut -f1 -d " " | rev | sed "s/$/\t/g" | grep -F -f - mmseqs2_recluster_silix_output.tsv | sed "s/\t/,/g" > mmseqs2_recluster_silix_output.interim.tsv

for i in $(cat mmseqs2_recluster_silix_output.interim.tsv | cut -f1 -d ",")
do
echo $i | sed "s/$/,/g" | grep -F -f - mmseqs2_recluster_silix_output.interim.tsv | cut -f2 -d ',' | sed "s/^/>/g" | grep --no-group-separator -A1 -F -f - ../Ecoli.all.filtered.prot.clusters.longestrepresentative.faa > "$i".seqs.faa
done

#When I check the sequences placed in the same cluster, they are similar
#So silix is indeed picking up similar sequences that mmseqs2 missed

#Could this have been an artifact of picking the longest sequences?
#To check, let's re-run silix using the representative sequences suggested by mmseqs2:
makedir OG_repseq_clustering
cd OG_repseq_clustering
cut -f1 ../mmseqs_covmode0/Ecoli.all.filtered.prot.clusters.covmode0.tsv | sed "s/^/>/g" | grep --no-group-separator -A1 -F -f - ../Ecoli.all.filtered.prot.linear.faa > Ecoli.all.filtered.prot.clusters.OGcluster.faa
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond makedb --in Ecoli.all.filtered.prot.clusters.OGcluster.faa --db Ecoli.all.filtered.prot.clusters.OGcluster
/stor/work/Ochman/hassan/E.coli_ORFan/E.coli_ORFan_pipeline_8-10/diamond blastp -q Ecoli.all.filtered.prot.clusters.OGcluster.faa -d Ecoli.all.filtered.prot.clusters.OGcluster --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore --ultra-sensitive --out Ecoli.all.filtered.prot.clusters.OGcluster.allvall.tsv -k 0 -b8 -c1
/stor/work/Ochman/hassan/tools/silix-1.3.0/src/silix -f mmseqs2_cluster_ -i 0.8 -r 0.8 -q -2 -s 3 Ecoli.all.filtered.prot.clusters.OGcluster.faa Ecoli.all.filtered.prot.clusters.OGcluster.allvall.tsv > mmseqs2_recluster_silix_output.tsv

#33818 clusters—12.8%. Not as drastic a reduction as before, but still a reduction. So the Silix-mediated reduction of gene family number is not an artifact
