Skip to content
Rebecca Clement edited this page May 21, 2019 · 5 revisions

2. Let’s build your library

PathoScope Library module (PathoLib) is designed to help you collect sequence data from NCBI that you deem pertinent for a given analysis. Basically, you need to think about your potential targets, i.e., what would you expect to find in your metagenomic sample that you want to identify, and your filters, i.e., what would you like to filter out from your data that are not part of the target analysis. Typically, you would want to look for any mi- crobes (virus, bacteria, fungi) and discard any host reads and artificially added sequences like PhiX174.

So let’s get some target sequences for our analysis:

  • Change directory (cd) to PathoScope folder
  • Create a directory to contain your library data
  • cd into pathoscope directory and call PathoLib module help for details
cd Desktop/pathoscope2/
mkdir library
cd pathoscope
python pathoscope.py LIB -h

You have two ways of running PathoLib.

  1. You could set up or use an existing MySQL database from where to draw your library or

  2. you can download NCBI’s nucleotide database (10 GB) from their ftp site (ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz) and get your data from there. Either way you’ll end up with files representing your Filter or Target libraries.

  • Within pathoscope2 directory, create a directory to contain your data files (name it data)
  • cd to it and download the latest nucleotide database from NCBI
  • Decompress
mkdir data
cd data
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
gunzip -dc nt.gz > nt.fasta

Now, you should have a file named nt.fasta within the pathoscope2/data directory. Our next step is collecting sequence data that matches viruses (Taxonomy ID 10239) so that we create a target library.

2.1 Let’s try using a MySQL database first (OPTIONAL)...

For this option, you will need to have access to a MySQL database containing sequence data information, and the python package MySQLdb (http://sourceforge.net/projects/mysql-python/). Also, make sure you have MySQLdb in your PYTHONPATH. Then, you should issue something like this:

python pathoscope.py LIB -genomeFile ../data/nt.fasta -taxonIds 10239 --subTax -dbhost localhost -dbuser user -dbpasswd xxxxx -outPre x virus_sql

Where -dbhost, -dbuser, and -dbpasswd are your credentials to access the SQL database. In the backend, PathoLib is going to take the file nt.fasta, grab GI numbers, query your SQL database, and retrieve taxonomy information to be prepended to your resulting file. Your resulting file will have all NCBI’s records whose taxon ID is 10239 (virus) and all subtaxa. For instance: From:

gi|40555938|ref|NC_005309.1| Canarypox virus, complete genome

To:

ti|44088|gi|40555938|ref|NC_005309.1| Canarypox virus, complete genome

2.2 ...And now let’s try without a MySQL database

(within pathoscope directory)

python pathoscope.py LIB -genomeFile ../data/nt.fasta -taxonIds 10239 --subTax -outPre x virus

Note that in the command line above (--subTax) the extra hyphen is not a typo.

Again, we are asking PathoLib to look into nt.fasta for all sequences whose taxon ID is 10239 (virus), including all the subtending taxonomies according to NCBI’s taxonomy tree (--subTax) and create a file with the pre x ‘viral’. On the backend, PathoLib is going to remotely connect to NCBI and download taxonomy information to link GI numbers (present in fasta entries within nt.fasta) to taxon ID numbers (TI) and thus be able to select sequence data according to the user’s preferences.

Bear in mind that executing PathoLib can be memory intensive. In order to search the entire GenBank nucleo- tide database for the taxon IDs you’d like to use for PathoMap, you’d need to allocate something around 6 GB of RAM to PathoLib. You could decrease memory requirements by using a 2-step approach where 1) you prepend taxIDs to all nt.fasta entries and 2) then you select sequences from the resulting file. Step 1) will be a one time step that you can use as a source to subselect taxa according to your different analyses and needs. Step 1) needs a MySQL database up and running, however, you can download nt_ti.fa.gz directly from PathoScope website (ftp://pathoscope.bumc.bu.edu/data/nt_ti.fa.gz) and start from step 2) without a MySQL database.

Step 1):

python pathoscope.py LIB -genomeFile ../data/nt.fasta  -outPre x nt -dbhost localhost -dbuser pathoscope -dbpasswd johnsonlab

Step 2):

python pathoscope.py LIB -genomeFile nt_ti.fa -taxonIds 10239 --subTax -outPre x virus

This approach will render a file with sequence data for viruses just as the one generated above using a MySQL database. However, downstream your report won’t have all the organismal information as it would by using a database. Let’s see how to set up a ready-to-use database.

2.3 How to setup a MySQL database

To make things more amenable, we provide a curated version of the nt GenBank database (as of OCT2013) as a MySQL database. Please go to our website and download pathodb.sql (ftp://pathoscope.bumc.bu.edu/data/pathodb.sql.gz). Make sure you have MySQL and python package MySQLdb up and running, then simply follow the instructions below: From the terminal:

mysql -u root -p
<Enter root password>
create DATABASE pathodb;
create user pathoscope;
grant all privileges on pathodb.* to pathoscope@”localhost” identi ed by
‘johnsonlab’;
flush privileges;

And then...

mysql -u pathoscope -p pathodb < pathodb.sql
<Enter the following password when asked>
johnsonlab

So when you use the PathoLib you should provide your database credentials as follows...

python pathoscope.py LIB -genomeFile ../data/nt.fasta -taxonIds 10239
--subTax -dbhost localhost -dbuser pathoscope -dbpasswd johnsonlab -out-
Pre x virus_sql