Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting up blast databases in TBro in Docker in Amazon AWS Lightsail #49

Open
000generic opened this issue Apr 30, 2017 · 14 comments
Open

Comments

@000generic
Copy link

000generic commented Apr 30, 2017

Hi! Thanks for your help on the last two issues :)

I'm now having trouble running phing to set up the blast databases.

From the documentation, I'm not sure if I should be running phing at the TBro command line - or at the Ubuntu command line. At the TBro command line, phing can not be run or installed in my hands. At the Ubuntu command line, phing can be installed but does not run successfully.

Also, I don't know how to locate TBro directories when I am not at the TBro command line. Is it possible to enter TBro directories when I am at the Ubuntu command line?

I'm also unsure what you mean by "main TBro directory" in the documentation. Is this the default directory when I start up TBro? Or the directory I created to store my data in?

Details follow:

In TBro:

First I should move to my "main TBro directory" - I am guessing this is the directory I created to store all my data in when I set up TBro...?

cd /squid

I then follow TBro documentation instructions but the phing command is not found when run at the TBro command line:

oot@9b953c8ae04e: /root@9b953c8ae04e:/# phing queue-install-db
bash: phing: command not found

When I try to install phing I get the following error:

sudo apt-get update
sudo apt-get install phing

Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package phing

so I'm not sure how to install phing at the TBro command line.

Outside TBro

I am able to install phing at the Ubuntu command line but I don't know how to locate my main TBro directory inside of Docker - is it possible to enter TBro directories from outside Docker/TBro? When I run the required phing command in different folders I get the following error:

ubuntu@ip-172-26-13-108:/$ phing queue-install-db
Buildfile: build.xml does not exist!

I'm not sure what this error is in reference to but maybe its related to being in the wrong directory?

So I'm not sure 1) what directory I should be running phing in, 2) if I should run phing the Ubuntu or in TBro command line, and 3) if I should run phing in TBro, I'm not sure how to install it.

Any suggestions would be greatly appreciated.

Thank-you

@phryneas
Copy link
Contributor

you should run phing from inside the TBro main container - if you installed everything according to the documentation, you can enter that container using docker exec -it TBro_official /bin/bash and that container should already contain an installation of phing.

if you are still missing it, you should be able to install it using composer global require phing/phing

inside that container, running phing database-initialize should be possible from /home/tbro

in any case, the folder to run phing from is the folder where the build.xml is located in.

@000generic
Copy link
Author

000generic commented Apr 30, 2017

I tried to follow your installation directions closely when I installed TBRo - and I have reinstalled repeatedly without success when I get to the phing command.

Phing appears to be installed but a 'command not found' error is given when I try to run phing in the TBro directory that has the build.xml file - or in any directory in TBro.

Following your directions above:

ubuntu@ip-172-26-13-108:~$ docker exec -it TBro_official /bin/bash

oot@9b953c8ae04e: /root@9b953c8ae04e:/# phing queue-install-db
bash: phing: command not found

oot@9b953c8ae04e: /root@9b953c8ae04e:/# composer global require phing/phing
Changed current directory to /root/.composer
Running composer as root/super user is highly discouraged as packages, plugins and scripts cannot always be trusted
Using version ^2.16 for phing/phing
./composer.json has been updated
Loading composer repositories with package information
Updating dependencies (including require-dev)
Nothing to install or update
Generating autoload files

oot@9b953c8ae04e: /root@9b953c8ae04e:/# cd home/tbro/
oot@9b953c8ae04e: /home/tbroroot@9b953c8ae04e:/home/tbro# phing database-initialize
bash: phing: command not found

oot@9b953c8ae04e: /home/tbroroot@9b953c8ae04e:/home/tbro# ls
INSTALLATION build.xml doc src
README.md build_installation.sh enable_AllowOverride_Apache2.sed test
build.properties composer.json phpunit.xml update_config.sed
build.properties.example composer.lock queue_config.example.sql update_installation.sh

oot@9b953c8ae04e: /home/tbroroot@9b953c8ae04e:/home/tbro# phing queue-install-db
bash: phing: command not found

...I just realized the file phing will generate is already in the directory - queue_config.example.sql
I'm not sure how it was generated, as I still haven't gotten phing to work but I'll try working with it.

...it looks like queue_config.example.sql was generated when things were built by docker exec -i -t TBro_official /home/tbro/build_installation.sh

screen output:

Buildfile: /home/tbro/build.xml
[property] Loading /home/tbro/./build.properties

tbro > queue-install-db:

 [copy] Copying 1 file to /home/tbro
 [echo] an example configuration has been copied to /home/tbro/queue_config.example.sql!
 [echo] modify it to your needs and load it into your blast database

BUILD FINISHED

Total time: 0.4795 seconds

@000generic
Copy link
Author

000generic commented May 1, 2017

Two new questions:

1) To move zipped blast database files into my docker container using

curl --data-binary --ftp-pasv --user "$WORKERFTP_FTP_USER":"$WORKERFTP_FTP_PW" -T cannabis_sativa_transcriptome.zip ftp://$WORKERFTP_IP/

how can I determine what the values of the three variables

$WORKERFTP_FTP_USER
$WORKERFTP_FTP_PW
$WORKERFTP_IP

are for my Docker container?

When I run set in TBro, nothing shows up for the three variables:

oot@dddabdc84640: /root@dddabdc84640:/# set | grep WORKERFTP
WORKERFTP_ENV_FTP_PW=ftp
WORKERFTP_ENV_FTP_USER=tbro
WORKERFTP_NAME=/TBro_official/WORKERFTP
WORKERFTP_PORT=tcp://172.17.0.4:21
WORKERFTP_PORT_21_TCP=tcp://172.17.0.4:21
WORKERFTP_PORT_21_TCP_ADDR=172.17.0.4
WORKERFTP_PORT_21_TCP_PORT=21
WORKERFTP_PORT_21_TCP_PROTO=tcp

so I'm not sure where to find values for the them.

I tried

$WORKERFTP_ENV_FTP_USER
$WORKERFTP_ENV_FTP_PW
$WORKERFTP_PORT_21_TCP_ADDR

in the curl command but it didn't seem to work:

curl --data-binary --ftp-pasv --user “tbro”:”ftp” -T blastdb-Harvard-AA.zip ftp://172.17.0.4/

curl: (67) Access denied: 530 when run from Ubuntu
curl: (6) Could not resolve host: tbro when run from TBro

2) How do I "run the queue_config.sql commands in your queue database." ?

Thank-you!

@phryneas
Copy link
Contributor

phryneas commented May 1, 2017

I think I'll ping @greatfireball or @iimog on this, this is getting too specialized with the setup for me now, as they created the docker containers.

@iimog
Copy link
Member

iimog commented May 3, 2017

Hi @000generic,
sorry for the confusion. I think the documentation needs some serious improvements. First of all "the main TBro directory" is indeed /home/tbro/ so the directory containing the source code (I will clarify that in the docs). phing is installed via composer so it is available in ~/.composer/vendor/bin this is added to the path via the ~/.bash_profile which is apparently not loaded when entering the container. You can fix that by either entering:
source ~/.bash_profile
or
export PATH=~/.composer/vendor/bin:$PATH
But anyway you are right phing queue-install-db is already executed when following the installation instructions (by build_installation.sh)

You are also right regarding the environment variables (I will update them in the docs). However, the curl command should work from TBro. Can you please try again this one:

curl --data-binary --ftp-pasv --user $WORKERFTP_ENV_FTP_USER:$WORKERFTP_ENV_FTP_PW -T blastdb-Harvard-AA.zip ftp://"$WORKERFTP_PORT_21_TCP_ADDR"/

To import the content of the queue_config.sql file into the queue database execute (from TBro):

PGPASSWORD=$WORKER_ENV_DB_PW psql -U $WORKER_ENV_DB_USER -h $WORKER_PORT_5432_TCP_ADDR -p $WORKER_PORT_5432_TC
P_PORT <queue_config.sql

@000generic
Copy link
Author

000generic commented May 3, 2017

Getting closer....

I was able to run both the curl and PGPASSWORD commands successfully now - but nothing is showing up in TBro as a blast database to blast against. Specifically, I did the following:

cd /sono/peptides # this is where I placed by zipped blast databases
curl --data-binary --ftp-pasv --user tbro:ftp -T blastdb-barnacle-AA.zip ftp://172.17.0.4/
curl --data-binary --ftp-pasv --user tbro:ftp -T blastdb-barnacle-TR.zip ftp://172.17.0.4/

cd /home/tbro
mv queue_config.example.sql queue_config.sql
nano queue_config.sql

-- database files available. name is the name it will be referenced by, md5 is the zip file's sum, download_uri specifies where the file can be retreived
INSERT INTO database_files
(name, md5, download_uri) VALUES
('blastdb-barnacle-AA', '50e7cb5a77f37641a648edc59abcc11a', 'ftp://172.17.0.4/blastdb-barnacle-AA.zip'),
('blastdb-barnacle-TR', '7fc500cce7bb9ac925c39e5d1f986640', 'ftp://172.17.0.4/blastdb-barnacle-TR.zip’);

...etc

-- contains information which program is available for which program.
-- additionally, 'availability_filter' can be used to e.g. restrict use for a organism-release combination
INSERT INTO program_database_relationships
(programname, database_name, availability_filter) VALUES
('blastn','blastdb-barnacle-TR', 'barnacle-T1'),
('blastp','blastdb-barnacle-AA', 'barnacle-T1'),
('blastx','blastdb-barnacle-AA', 'barnacle-T1'),
('tblastn','blastdb-barnacle-TR', 'barnacle-T1'),
('tblastx','blastdb-barnacle-TR', 'barnacle-T1’);

...etc

PGPASSWORD=worker psql -U worker -h 172.17.0.3 -p 5432 <queue_config.sql

I then tried to blast in TBro but no databases were offered as an option.

@iimog
Copy link
Member

iimog commented May 4, 2017

So sorry, another lack of documentation.
Whether a database shows up in TBro only depends on the queue_config.sql and specifically the section program_database_relationships. Here the availability_filter is key (and totally undocumented).
This column decides for which organism and release which blast database is shown. The format of this column is {organism_id}_{release} so in case of the demo data the organism_id is "13" and the release is "1.CasaPuKu" so for the blast db to show up the availability_filter had to be set to 13_1.CasaPuKu. If "barnacle-T1" is your release and 14 is your organism_id (you can check with tbro-db organism list) you have to change the availability filter in queue_config.sql to 14_barnacle-T1.
In order to import this file into the database again you have to remove all sections except the program_database_relationships (otherwise you get errors due to duplicate key value violating unique constraints).
I will add the documentation for the availability_filter column both to the example sql file and the documentation on readthedocs.

Thank you very much for your endurance and for reporting all the problems. This helps a lot in improving the documentation.

iimog added a commit to TBroTeam/Tutorial that referenced this issue May 4, 2017
@000generic
Copy link
Author

000generic commented May 4, 2017

Great! Now the blast databases are showing up in TBro - Thank-you :)

....but I think I have to correct the uri I am giving TBro, which I had guessed at after curling my zipped Blast database files into Docker.

curl --data-binary --ftp-pasv --user tbro:ftp -T blastdb-barnacle-AA.zip ftp://172.17.0.4/
curl --data-binary --ftp-pasv --user tbro:ftp -T blastdb-barnacle-TR.zip ftp://172.17.0.4/

When I then configure the queue.config.sql file with:

('barnacle-AA4', '50e7cb5a77f37641a648edc59abcc11a', 'ftp://172.17.0.4/blastdb-barnacle-AA.zip'),
('barnacle-TR4', '7fc500cce7bb9ac925c39e5d1f986640', 'ftp://172.17.0.4/blastdb-barnacle-TR.zip');

TBro throws an error:

There has been an error processing your job. Please review your job. If this keeps happening, notify the administrator.

These errors occured:
BLAST Database error: No alias or index file found for protein database [/tmp/queue-worker//barnacle-AA4.50e7cb5a77f37641a648edc59abcc11a/barnacle-AA4] in search path [/tmp/queue-worker::]

and when I configure the queue.config.sql file with:

('barnacle-AA5', '50e7cb5a77f37641a648edc59abcc11a', 'http://172.17.0.4/blastdb-barnacle-AA.zip'),
('barnacle-TR5', '7fc500cce7bb9ac925c39e5d1f986640', 'http://172.17.0.4/blastdb-barnacle-TR.zip');

TBro seems to hang up:

Blast Results

Your job is currently being processed. Please wait a moment.
This page will refresh in 2 seconds.

The page does an initial refresh saying it is one of one in queue - and then doesn't seem to refresh any more - and remains stalled after many minutes.

@iimog
Copy link
Member

iimog commented May 4, 2017

OK, now we are really closing in on this. The blastdb is visible in TBro and the download of the zip file seems to work as well. The ftp configuration is the correct one. The problem now is that after unpacking the zip file the blastdb files are not found. How are those named?
TBro expects the blastdb files in the zip to be named the same as the name in the database_files table so in your case (this is barnacle-AA3 and barnacle-TR3, right?) TBro will look for files barnacle-AA3.phr, barnacle-AA3.pin, barnacle-AA3.psq in your zip folder. If they are named differently they will not be found.
My suggestion: first clean up the old values from database_files and program_database_relationships table by executing this command:

PGPASSWORD=$WORKER_ENV_DB_PW psql -U $WORKER_ENV_DB_USER -h $WORKER_PORT_5432_TCP_ADDR -p $WORKER_PORT_5432_TCP_PORT -d $WORKER_ENV_DB_NAME -c 'TRUNCATE database_files CASCADE'

use with care as it will also remove all past and present blast jobs from the database.

Then re-import an sql file with the two sections for database_files and program_database_relationships with fixed name column where it corresponds to the name of the blastdb (without .p* or .n* ending). If you verify that it works I will update docu here as well.

@000generic
Copy link
Author

000generic commented May 4, 2017

Genius! That works great - now I am blasting against my blast databases :)

....however, while the blast hits show visual alignments with many good hits, they are are not showing any isoform information (instead under the Name column in the blast report the hits all say 'No') and the link in 'No' just goes to the TBro landing page.

screen shot 2017-05-04 at 6 14 22 am

This is true even when blasting, for instance, a protein that was used in building the blast database. When I search the same protein in TBro based on its id, the protein is returned as an isoform with a link that takes me to its TBro webpage. I checked and identifiers used in 1) the imported fasta files, 2) imported identifiers, 3) imported .tbl files, and4) in fastas used to build the blast databases are all the same. For instance:

barnacle-ee100-aa

screen shot 2017-05-04 at 6 45 05 am

So, it seems like the blast job is successful but the hits generated are not linking back to the TBro databases. I'll try rebuilding TBro again from the ground up but most likely I need to modify something somewhere along the way.

Once we have all this worked out, I can generalize the steps and provide them to you - or post in GitHub etc. I think the combination of free/cheap easy up/easy down Amazon cloud + TBro is really great. Rather than a long-term repository, often times its useful to make things available to collaborators (or myself) with many updates for just a few days to months, and I think the Amazon/Docker/TBro combo is going to be a great way to do this. There is already growing interest from others here at the Marine Biological Laboratory.

@iimog
Copy link
Member

iimog commented May 4, 2017

Nice! Happy to hear that it is finally working. The issue with showing "No" in the Name column is indeed very strange. TBro tries to map the name of the blast hit to an internal ID but even if it fails it should still show the original ID of the hit. This ID is parsed out of the Blast result xml. Something seems to go wrong there. Would you mind sharing the xml result? You can get this by calling the webservice directly via:
http://<your-tbro-machine>/ajax/queue/job_results?jobid=<your-jobid> replacing both your-tbro-machine and your-jobid with the respective values. The jobid is the one you get when starting a blast job.
If you do not want to share this file you can have a look yourself. In the <Hit_def> tag the first word is assumed to be the ID. For an example blast job on the public instance a Hit_def line might look like this:

<Hit_def>cds.comp234028_c1.1_seq4|m.808277 comp234028_c1.1_seq4|g.808277  ORF comp234028_c1.1_seq4|g.808277 comp234028_c1.1_seq4|m.808277 type:complete len:725 (+) comp234028_c1.1_seq4:254-2428(+)</Hit_def>

How does a Hit_def line look in your blast result xml?

I hope to sort out this last problem as well. A step by step guide for TBro on AWS would be really cool. If you don't mind I'd suggest including it as a separate section in the official documentation. Your contribution in improving and disseminating TBro is very much appreciated.

iimog added a commit that referenced this issue May 4, 2017
iimog added a commit that referenced this issue May 4, 2017
iimog added a commit to TBroTeam/Tutorial that referenced this issue May 4, 2017
@000generic
Copy link
Author

Sure - here is the xml file. It looks like 'No' is short for 'No definition line' - I'll try rebuilding things and see if I can get lucky and solve anything.

{
"job_status": "PROCESSED",
"additional_data": {
"organism": "16",
"release": "barnacle-T1"
},
"processed_results": [
{
"query": ">barnacle-ee100\nTTAGGAGCAAATGAAAAGAAGAAAGCTGGAAAAAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC",
"status": "PROCESSED",
"result": "\n\n\n <BlastOutput_program>blastn</BlastOutput_program>\n <BlastOutput_version>BLASTN 2.2.28+</BlastOutput_version>\n <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>\n <BlastOutput_db>/tmp/queue-worker//blastdb-barnacle-TR.7fc500cce7bb9ac925c39e5d1f986640/blastdb-barnacle-TR</BlastOutput_db>\n <BlastOutput_query-ID>Query_1</BlastOutput_query-ID>\n <BlastOutput_query-def>barnacle-ee100</BlastOutput_query-def>\n <BlastOutput_query-len>467</BlastOutput_query-len>\n <BlastOutput_param>\n \n <Parameters_expect>0.1</Parameters_expect>\n <Parameters_sc-match>2</Parameters_sc-match>\n <Parameters_sc-mismatch>-3</Parameters_sc-mismatch>\n <Parameters_gap-open>5</Parameters_gap-open>\n <Parameters_gap-extend>2</Parameters_gap-extend>\n <Parameters_filter>L;m;</Parameters_filter>\n </Parameters>\n </BlastOutput_param>\n<BlastOutput_iterations>\n\n <Iteration_iter-num>1</Iteration_iter-num>\n <Iteration_query-ID>Query_1</Iteration_query-ID>\n <Iteration_query-def>barnacle-ee100</Iteration_query-def>\n <Iteration_query-len>467</Iteration_query-len>\n<Iteration_hits>\n\n <Hit_num>1</Hit_num>\n <Hit_id>barnacle-ee100</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee100</Hit_accession>\n <Hit_len>467</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>843.46</Hsp_bit-score>\n <Hsp_score>934</Hsp_score>\n <Hsp_evalue>0</Hsp_evalue>\n <Hsp_query-from>1</Hsp_query-from>\n <Hsp_query-to>467</Hsp_query-to>\n <Hsp_hit-from>1</Hsp_hit-from>\n <Hsp_hit-to>467</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>1</Hsp_hit-frame>\n <Hsp_identity>467</Hsp_identity>\n <Hsp_positive>467</Hsp_positive>\n <Hsp_gaps>0</Hsp_gaps>\n <Hsp_align-len>467</Hsp_align-len>\n <Hsp_qseq>TTAGGAGCAAATGAAAAGAAGAAAGCTGGAAAAAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC</Hsp_qseq>\n <Hsp_hseq>TTAGGAGCAAATGAAAAGAAGAAAGCTGGAAAAAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC</Hsp_hseq>\n <Hsp_midline>|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n\n <Hit_num>2</Hit_num>\n <Hit_id>barnacle-ee232648</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee232648</Hit_accession>\n <Hit_len>493</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>547.707</Hsp_bit-score>\n <Hsp_score>606</Hsp_score>\n <Hsp_evalue>7.56147e-155</Hsp_evalue>\n <Hsp_query-from>68</Hsp_query-from>\n <Hsp_query-to>467</Hsp_query-to>\n <Hsp_hit-from>3</Hsp_hit-from>\n <Hsp_hit-to>406</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>1</Hsp_hit-frame>\n <Hsp_identity>364</Hsp_identity>\n <Hsp_positive>364</Hsp_positive>\n <Hsp_gaps>4</Hsp_gaps>\n <Hsp_align-len>404</Hsp_align-len>\n <Hsp_qseq>ACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGG----TGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC</Hsp_qseq>\n <Hsp_hseq>ACACAAAATCCGGACAAAACCGCACGATGGACATGTTTTGGTCGTTCACAAAGCAGAACCTCTCCTGAGGAAAACACTCCGGAACCTAGCGAAGGTCTTGGCATGGGAAATGACGAGCGTTTGGGATTTGCAAAATTTGCACAATGTGTCTAAGAAACAGATGACCGACACATTTTGCTACGATATGCTGCGAAAAAATTGCCGCTGGCACCTCCCATTTGTTCAAGAATGGGGTTGTTTTGAAAAGGAAAATGGTACCTACTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTGGTGCCAGAGAAGACCTACGAAATACTCACGGATGGTGTCGATGCAAGCTATCGACTTTCAGGGAGAAACGGATGTTGCCAGGAAGCTCGTCAAACCTAACTCCGAC</Hsp_hseq>\n <Hsp_midline>||||||||||| || |||||| ||||||||||| |||| || |||| |||||||||||| ||||| ||||||||||||||| |||| |||||||||||||||||||| ||||||||||| |||| |||||||||||||||||||||||||| ||||||||| | |||||||||| |||||| ||||| ||||||||| || | ||||||| ||||||||||||| ||||||||||||||||||||||||||||| | |||||||||||||||||||||||||||||||||||| | |||||||||||||||| |||||||||||||||||||||||||||||| ||| ||||||||||||| ||||||||||| || |||||||||||||||| ||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n\n <Hit_num>3</Hit_num>\n <Hit_id>barnacle-ee310756</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee310756</Hit_accession>\n <Hit_len>475</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>462.949</Hsp_bit-score>\n <Hsp_score>512</Hsp_score>\n <Hsp_evalue>2.47404e-129</Hsp_evalue>\n <Hsp_query-from>33</Hsp_query-from>\n <Hsp_query-to>353</Hsp_query-to>\n <Hsp_hit-from>155</Hsp_hit-from>\n <Hsp_hit-to>475</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>1</Hsp_hit-frame>\n <Hsp_identity>295</Hsp_identity>\n <Hsp_positive>295</Hsp_positive>\n <Hsp_gaps>0</Hsp_gaps>\n <Hsp_align-len>321</Hsp_align-len>\n <Hsp_qseq>AAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGA</Hsp_qseq>\n <Hsp_hseq>AAGAGGCAGATCTGGAGCAAATAGCTTTCTTTTACACACAAAATCCGGACAAAACCGCACGATGGACAGGTTTTGGCCGTTTACAAAGCAGAACCTCTCCTGAGGAAAACACTCCGGAACCCAGCGAAGGTCTTGGCATTGGGAATGACGAGCGTTCGGGATTTGCAAATTTTGCAAAATGTGTCTAAGAAACAGATGGCCGACACATTCTGCTACGATATGCTGCGAAAAATTTGCCCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGA</Hsp_hseq>\n <Hsp_midline>|||||||||||||| ||| |||| ||||||||| ||||||||||| || |||||| |||||||||||||||| |||||||||||||||||||| ||||| ||||||||||||||| |||| ||||||||||||||||| |||||||||||||| | || ||||||||| |||||| ||||||||| ||||||||| | ||||||||||||||||| ||||| |||||||||||| | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n\n <Hit_num>4</Hit_num>\n <Hit_id>barnacle-ee7959</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee7959</Hit_accession>\n <Hit_len>612</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>354.747</Hsp_bit-score>\n <Hsp_score>392</Hsp_score>\n <Hsp_evalue>9.23619e-97</Hsp_evalue>\n <Hsp_query-from>33</Hsp_query-from>\n <Hsp_query-to>288</Hsp_query-to>\n <Hsp_hit-from>358</Hsp_hit-from>\n <Hsp_hit-to>612</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>1</Hsp_hit-frame>\n <Hsp_identity>233</Hsp_identity>\n <Hsp_positive>233</Hsp_positive>\n <Hsp_gaps>1</Hsp_gaps>\n <Hsp_align-len>256</Hsp_align-len>\n <Hsp_qseq>AAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTG</Hsp_qseq>\n <Hsp_hseq>AAGAGGCAGATCTGGAGCAAATAGCTTTCTTTTACACACAAAATCCCGACAAAACCGCACGATGGACAGGTTTTGGCCGTTTACAAAGCAGAACCTCTCCTGAGGAAAACACTCCGCAACCCAGCGAAGGTCTCGGCATGGGGAATGAAGAGCGTTTGGGATTTGCAAAATTTGCA-AATGTGTCTAAGAAACAGATGGCCGACACATTCTGCTACGATATGCTGCGAAAAATTTGCCCCTGGCACTTCCCATTTG</Hsp_hseq>\n <Hsp_midline>|||||||||||||| ||| |||| ||||||||| |||||||||||||| |||||| |||||||||||||||| |||||||||||||||||||| ||||| |||||||||||||||||||| ||||||||||| |||||||||||||| ||||| |||| |||||||||||||||| ||||||||| ||||||||| | ||||||||||||||||| ||||| |||||||||||| | ||||||||||||||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n\n <Hit_num>5</Hit_num>\n <Hit_id>barnacle-ee288238</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee288238</Hit_accession>\n <Hit_len>620</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>336.713</Hsp_bit-score>\n <Hsp_score>372</Hsp_score>\n <Hsp_evalue>2.47841e-91</Hsp_evalue>\n <Hsp_query-from>220</Hsp_query-from>\n <Hsp_query-to>467</Hsp_query-to>\n <Hsp_hit-from>619</Hsp_hit-from>\n <Hsp_hit-to>373</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>-1</Hsp_hit-frame>\n <Hsp_identity>224</Hsp_identity>\n <Hsp_positive>224</Hsp_positive>\n <Hsp_gaps>1</Hsp_gaps>\n <Hsp_align-len>248</Hsp_align-len>\n <Hsp_qseq>AGAAACAGACGTCCGACACATTCTGCTACCATATGATGCGAAAAATTTACTCCTGGCACTTCCCATTTGTTCAGGAATGGGGTTGTTTTGAAAAGGAAAATGGTGCTTGGGAGGTCGGCGCCAGTCATTCATGAAGGAGTTAGCGCCAGAGAAGACCTACAAAATACTCACGGATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCGGATGTTGCCGGGGAGCTCGTCAAACCTAAATCCGAC</Hsp_qseq>\n <Hsp_hseq>AGAAACAGATGGCCGACACATTCTGCTACGATATTATGCAAACAAATTACTCCTGGCAATTCCCGTTTGTTCAGGAATGGGGTCGTTTTGAAAAGGGAAATGGTGCTTTGGATGTCGGCGCCAGCC-TTCGTGAAGGAGTTGGCGCCAGAGAATACCTACAAAATGCTCATGAATGGTGTCGATGCAAGCTGTCGGCTTTCAGGGAGAAGCAAATGTGGCCGGGGAGCTCGTCAAACCTAAATCCGAC</Hsp_hseq>\n <Hsp_midline>||||||||| | ||||||||||||||||| |||| |||| || || |||||||||||| ||||| |||||||||||||||||| |||||||||||| ||||||||||| ||| ||||||||||| | ||| |||||||||| ||||||||||| ||||||||||| |||| | |||||||||||||||||||||||||||||||||||||| |||| ||||||||||||||||||||||||||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n\n <Hit_num>6</Hit_num>\n <Hit_id>barnacle-ee34877</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee34877</Hit_accession>\n <Hit_len>457</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>233.921</Hsp_bit-score>\n <Hsp_score>258</Hsp_score>\n <Hsp_evalue>2.17598e-60</Hsp_evalue>\n <Hsp_query-from>33</Hsp_query-from>\n <Hsp_query-to>277</Hsp_query-to>\n <Hsp_hit-from>249</Hsp_hit-from>\n <Hsp_hit-to>3</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>-1</Hsp_hit-frame>\n <Hsp_identity>203</Hsp_identity>\n <Hsp_positive>203</Hsp_positive>\n <Hsp_gaps>8</Hsp_gaps>\n <Hsp_align-len>250</Hsp_align-len>\n <Hsp_qseq>AAGAGGCAGATCTGCAGCGAATAATTTTCTTTTAAACACAAAATCCCGATAAAACCACACGATGGACAGGTT---TGGGCCGTTTACAAAGCAGAACATCTCCCGAGGAAAACACTCCGCAACCGAGCGAAGGTCTTGGCATGGGGAATGACGAGCGCTTGGAATTTGCAAAATTTGCACAATGTGTCTGAGAAACAGACGTCCGACACATTCTGCTACCATATGATGC--GAAAAATTTACTCCTGGCA</Hsp_qseq>\n <Hsp_hseq>AAGAGGCAAATCTGGAGCGAATAGCTTTCCTTCGGACACAAAATCCC---AACACCACATAATAGACTGGCTGCTTAGGCCGTTTACAAAGCAAAAGCTCTCTTGAGGAAAACACTCCGCAACCCAGCGAATGTTTTGGCATGGGGAATGTTGAGCGTTTGGAATTTGCAAAATTTGAACGGCGTGTCTCAGAAACAGATGGCCAACACATTCTGCTACGATATTATGCAAAAAAAAATTACCCCTGGCA</Hsp_hseq>\n <Hsp_midline>|||||||| ||||| |||||||| |||| || |||||||||||| || |||||| || ||| || | | |||||||||||||||| || |||| |||||||||||||||||||| |||||| || ||||||||||||||| ||||| ||||||||||||||||||| || |||||| ||||||||| | || |||||||||||||| |||| |||| ||||| |||| |||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n\n <Hit_num>7</Hit_num>\n <Hit_id>barnacle-ee294988</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee294988</Hit_accession>\n <Hit_len>2129</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>53.584</Hsp_bit-score>\n <Hsp_score>58</Hsp_score>\n <Hsp_evalue>4.21178e-06</Hsp_evalue>\n <Hsp_query-from>1</Hsp_query-from>\n <Hsp_query-to>32</Hsp_query-to>\n <Hsp_hit-from>1333</Hsp_hit-from>\n <Hsp_hit-to>1302</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>-1</Hsp_hit-frame>\n <Hsp_identity>31</Hsp_identity>\n <Hsp_positive>31</Hsp_positive>\n <Hsp_gaps>0</Hsp_gaps>\n <Hsp_align-len>32</Hsp_align-len>\n <Hsp_qseq>TTAGGAGCAAATGAAAAGAAGAAAGCTGGAAA</Hsp_qseq>\n <Hsp_hseq>TTAGGAGCCAATGAAAAGAAGAAAGCTGGAAA</Hsp_hseq>\n <Hsp_midline>|||||||| |||||||||||||||||||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n\n <Hit_num>8</Hit_num>\n <Hit_id>barnacle-ee265222</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee265222</Hit_accession>\n <Hit_len>484</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>53.584</Hsp_bit-score>\n <Hsp_score>58</Hsp_score>\n <Hsp_evalue>4.21178e-06</Hsp_evalue>\n <Hsp_query-from>1</Hsp_query-from>\n <Hsp_query-to>32</Hsp_query-to>\n <Hsp_hit-from>318</Hsp_hit-from>\n <Hsp_hit-to>349</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>1</Hsp_hit-frame>\n <Hsp_identity>31</Hsp_identity>\n <Hsp_positive>31</Hsp_positive>\n <Hsp_gaps>0</Hsp_gaps>\n <Hsp_align-len>32</Hsp_align-len>\n <Hsp_qseq>TTAGGAGCAAATGAAAAGAAGAAAGCTGGAAA</Hsp_qseq>\n <Hsp_hseq>TTAGGAGCCAATGAAAAGAAGAAAGCTGGAAA</Hsp_hseq>\n <Hsp_midline>|||||||| |||||||||||||||||||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n\n <Hit_num>9</Hit_num>\n <Hit_id>barnacle-ee316353</Hit_id>\n <Hit_def>No definition line</Hit_def>\n <Hit_accession>barnacle-ee316353</Hit_accession>\n <Hit_len>626</Hit_len>\n <Hit_hsps>\n \n <Hsp_num>1</Hsp_num>\n <Hsp_bit-score>40.9604</Hsp_bit-score>\n <Hsp_score>44</Hsp_score>\n <Hsp_evalue>0.0265793</Hsp_evalue>\n <Hsp_query-from>84</Hsp_query-from>\n <Hsp_query-to>125</Hsp_query-to>\n <Hsp_hit-from>506</Hsp_hit-from>\n <Hsp_hit-to>547</Hsp_hit-to>\n <Hsp_query-frame>1</Hsp_query-frame>\n <Hsp_hit-frame>1</Hsp_hit-frame>\n <Hsp_identity>34</Hsp_identity>\n <Hsp_positive>34</Hsp_positive>\n <Hsp_gaps>0</Hsp_gaps>\n <Hsp_align-len>42</Hsp_align-len>\n <Hsp_qseq>AAACCACACGATGGACAGGTTTGGGCCGTTTACAAAGCAGAA</Hsp_qseq>\n <Hsp_hseq>AAACGACATGATGACCAGGCTTGAGAAGTTTACAAAGCAGAA</Hsp_hseq>\n <Hsp_midline>|||| ||| |||| |||| ||| | |||||||||||||||</Hsp_midline>\n </Hsp>\n </Hit_hsps>\n</Hit>\n</Iteration_hits>\n <Iteration_stat>\n \n <Statistics_db-num>192231</Statistics_db-num>\n <Statistics_db-len>134919102</Statistics_db-len>\n <Statistics_hsp-len>28</Statistics_hsp-len>\n <Statistics_eff-space>56866582326</Statistics_eff-space>\n <Statistics_kappa>0.41</Statistics_kappa>\n <Statistics_lambda>0.625</Statistics_lambda>\n <Statistics_entropy>0.78</Statistics_entropy>\n </Statistics>\n </Iteration_stat>\n</Iteration>\n</BlastOutput_iterations>\n</BlastOutput>\n\n",
"errors": ""
}
]
}

@iimog
Copy link
Member

iimog commented May 5, 2017

Thanks for sharing. I think I found the problem. When creating a blast database from fasta via makeblastdb there is an option called -parse_seqids by default this is not set. Hence the ids of entries in the blastdb are randomly generated and the whole fasta header (id + desc, everything after >) is stored in the def of the entry. This is why TBro parses the first word from the <Hit_def>. However, if the -parse_seqids option is used the fasta id (first word after >) is used as id and only the rest of the line (in your case, nothing) is stored in def.

So when you are rebuilding could you please re-generate the blast databases without the parse_seqids flag.

I think in general it is more appropriate to have blast databases that use -parse_seqids and hence have the id in <Hit_id>. But as this will break backwards compatibility I will schedule this change for version 1.2.0. I will open a separate issue for that.

@000generic
Copy link
Author

000generic commented May 5, 2017

Great detective work! I haven't tested it yet but I think that makes sense.

I agree, its generally more appropriate/useful to have a blast database that is setup with -parse_seqids For instance, I believe it enables blastdbcmd to pull sequences out of the database using a single familiar identifier that is used throughout a workflow - so it makes working with a blast database at the command line much easier - but no problem to make a separate database just for TBro. And it would be great to add a flag for blast databases that are +/- -parse_seqids added to TBro setup in v1.2.0!

I'll test a new blast database set up in TBro next...

It works fantastic!!!

Now on to Expression Search :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants