Skip to content
vpbrendel edited this page Jul 28, 2019 · 2 revisions

I want to run locally obtained data files through fidibus and get an empty *.prot.fa file. That does not seem right. What is the matter and how can I fix it?

Yes, you are right to suspect something amiss The answer is most likely a mismatch between your input files with respect to naming the protein models. Let's illustrate by one of our own examples.

We are interested in the Daphnia pulex genome/annotation as deposted at JGI. We create a working directory and download the relevant files like this

mkdir ~/DPWORK
cd ~/DPWORK
curl -O http://genome.jgi.doe.gov/Dappu1/download/Daphnia_pulex.fasta.gz
curl -O http://genome.jgi.doe.gov/Dappu1/download/FrozenGeneCatalog20110204.gff3.gz
curl -O http://genome.jgi.doe.gov/Dappu1/download/FrozenGeneCatalog20110204.proteins.fasta.gz
singularity pull --name aeagean.simg shub://BrendelGroup/AEGeAn

and then run our fidibus analysis:

singularity exec -e -B ~/DPWORK aegean.simg fidibus \
	--workdir=./ \
        --numprocs=2 \
        --local \
        --label=Dpul \
        --gdna=Daphnia_pulex.fasta.gz \
        --gff3=FrozenGeneCatalog20110204.gff3.gz \
        --prot=FrozenGeneCatalog20110204.proteins.fasta.gz \
        download prep iloci breakdown stats

only to get the following error message:

[]wc -l DpulBAD/*prot*
  190120 Dpul/Dpul.all.prot.fa
   30615 Dpul/Dpul.protein2ilocus.repr.tsv
   30811 Dpul/Dpul.protein2ilocus.tsv
       0 Dpul/Dpul.prot.fa
   30614 Dpul/Dpul.protids.txt

It takes a bit of detective work, but at the end, the problem is the mismatch between the naming of proteins in the *.gff3 file and the *proteins.fasta file; see the FASTA headers in *proteins.fasta versuse the Name=* tags in the GFF3 file. Even after getting rid of the jgi|Dappu1|| prefix in the FASTA headers, there are remaining problems of names involving "|" (never a good idea in Linux ...). We can fix all of those issues with

gunzip -c FrozenGeneCatalog20110204.gff3.gz > FrozenGeneCatalog20110204.gff3
sed -e "s/|/./g" FrozenGeneCatalog20110204.gff3 > FrozenGeneCatalog20110204FIXED.gff3
gzip FrozenGeneCatalog20110204FIXED.gff3
\rm FrozenGeneCatalog20110204.gff3

gunzip -c FrozenGeneCatalog20110204.proteins.fasta.gz > FrozenGeneCatalog20110204.proteins.fasta
sed -e "s/^>jgi|Dappu1|[^|]*|/>/" FrozenGeneCatalog20110204.proteins.fasta | sed -e "s/|/./g" > FrozenGeneCatalog20110204FIXED.proteins.fasta
gzip FrozenGeneCatalog20110204FIXED.proteins.fasta
\rm FrozenGeneCatalog20110204.proteins.fasta

and now

singularity exec -e -B ~/DPWORK aegean.simg fidibus \
	--workdir=./ \
        --numprocs=2 \
        --local \
        --label=Dpul \
        --gdna=Daphnia_pulex.fasta.gz \
        --gff3=FrozenGeneCatalog20110204FIXED.gff3.gz \
        --prot=FrozenGeneCatalog20110204FIXED.proteins.fasta.gz \
        download prep iloci breakdown stats

works correctly:

[]wc -l Dpul/*prot*
  190120 Dpul/Dpul.all.prot.fa
   30615 Dpul/Dpul.protein2ilocus.repr.tsv
   30811 Dpul/Dpul.protein2ilocus.tsv
  189210 Dpul/Dpul.prot.fa
   30614 Dpul/Dpul.protids.txt
Clone this wiki locally