SMRT Analysis Software Installation v2.0.1

Pacific Biosciences edited this page Jun 12, 2013 · 25 revisions
Clone this wiki locally

System Requirements

Operating System

  • SMRT® Analysis is only supported on:
    • English-language Ubuntu 10.04
    • English-language RedHat/CentOS 5.6
    • By request English-language Ubuntu 8.04 and RedHat/CentOS 5.3 are temporarily supported as well.
  • SMRT Analysis cannot be installed on the Mac OS or Windows.
  • Users with alternate versions of Ubuntu or CentOS will likely encounter library errors when running an initial analysis job. Install the following missing libraries on your system for an analysis job to complete successfully.

Running SMRT® Analysis in the Cloud

Users who do not have access to a server with CentOS 5.6 or later or Ubuntu 10.0.4 or later can use the public Amazon Machine Image (AMI). For details, see the document Running SMRT Analysis on Amazon.

Software Requirement

  • MySQL 5
  • bash
  • Perl (v5.8.8)
    • Statistics::Descriptive Perl module: sudo cpan Statistics::Descriptive

Ubuntu: sudo aptitude install mysql-server libxml-parser-perl liblapack3gf libssl0.9.8

CentOS 5: sudo yum install mysql-server perl-XML-Parser libgfortran libgfortran44 openssl redhat-lsb

CentOS 6: sudo yum install mysql-server perl-XML-Parser compat-libgfortran-41 openssl098e redhat-lsb

Client web browser:

We recommend using Firefox® 15 or Google Chrome® 21 web browsers to run SMRT Portal for consistent functionality. We also support Apple’s Safari® and Internet Explorer® web browsers; however some features may not be optimized on these browsers.

Client Java:

To run SMRT View, we recommend using Java 7 for Windows (Java 7 64 bit for users with 64 bit OS), and Java 6 for the Mac OS.

Minimum Hardware Requirements

1 head node:

  • Minimum 8 cores, with 2 GB RAM per core. We recommend 16 cores with 4 GB RAM per core for de novo assemblies and larger references such as human
  • Minimum 250 GB of disk space

Compute nodes:

  • Minimum 3 compute nodes. We recommend 5 nodes for high utilization focused on de novo assemblies
  • Minimum 8 cores per node, with 2 GB RAM per core. We recommend 16 cores per node with 4 GB RAM per core
  • Minimum 250 GB of disk space per node
  • To perform de novo assembly of large genomes using the Celera® Assembler, one of the nodes will need to have considerably more memory. See the Celera® Assembler home page for recommendations: http://wgs-assembler.sourceforge.net/.

Note: It is possible, but not advisable to install SMRT Analysis on a single-node machine (see the distributed computing section). You will likely be able to submit jobs one SMRT Cell at a time, but the time to completion may be long as the software may not have sufficient resources to complete the job.

Data storage:

  • 10 TB (Actual storage depends on usage.)

Network File System Requirement

Please refer to the IT Site Prep guide provided with your instrument purchase for more details.

  1. The SMRT Analysis software directory $SEYMOUR_HOME must have the same path and be readable by the smrtanalysis user across all compute nodes via NFS. Note the smrtanalysis user is specified during the installation and does not have to be "smrtanalysis".

  2. The SMRT Cell input directories contain data from the PacBio RS. The metadata.xml and bas.h5 files under this directory must have the same path and be readable by the smrtanalysis user across all compute nodes via NFS.

  3. The SMRT Analysis output directory $SEYMOUR_HOME/common/userdata must have the same path and be writable by the smrtanalysis user across all compute nodes via NFS. This directory is usually soft-linked to a large storage volume.

  4. The local temporary directory $TMP specified in smrtpipe.rc and default to /tmp/ must be writable by the smrtanalysis user and exist as independent directories on all compute nodes.

  5. The shared temporary directory $SHARED_DIR specified in smrtpipe.rc and default to $SEYMOUR_HOME/common/userdata/shared_dir/ must be writable by the smrtanalysis user across all compute nodes via NFS. This functionality is enabled by (3), but is listed again here to provide guidance to users who want to change the location of this directory.

Installation and Upgrade Summary

Following are the steps for installing and upgrading SMRT Analysis v2.0.1. For further details, click the links.

IMPORTANT: The upgrade script works only from v2.0.0 to v2.0.1. If you are using an older version of SMRT Analysis, you can either perform a fresh installation and manually import old SMRT Cells and jobs, or download and upgrade any intermediate versions (v1.3.0, v1.3.1, v1.3.3, v1.4, v2.0.0).

  1. Select an installation directory to assign to the $SEYMOUR_HOME environment variable. The default is /opt/smrtanalysis.

  2. Decide on a user who will perform the installation. We recommend that a system administrator create a special user with sudo privileges. The default is smrtanalysis, who belongs to the smrtanalysis group. If you are upgrading, the smrtanalysis user is the owner of the previous $SEYMOUR_HOME directory ex: ls -lLd /opt/smrtanalysis. Although not recommended, it is possible to install SMRT Analysis as a non-sudo user.

  3. Extract the tarball and softlink the directories:

    tar -C /opt -xvvzf <tarball_name>.tgz
    rm /opt/smrtanalysis (if it already exists)
    ln -s /opt/smrtanalysis-2.0.1 /opt/smrtanalysis
    
  4. Edit /opt/smrtanalysis-2.0.1/etc/setup.sh to match your installation location:

    SEYMOUR_HOME=/opt/smrtanalysis
    
  5. Run the appropriate script:

    • Option 1: If you are performing a fresh installation, run the installation script and start tomcat and kodos:
    /opt/smrtanalysis/etc/scripts/postinstall/configure_smrtanalysis.sh
    /opt/smrtanalysis/etc/scripts/tomcatd start
    /opt/smrtanalysis/etc/scripts/kodosd start
    
    • Option 2: If you are upgrading and want to preserve SMRT Cells, jobs, and users from a previous installation: Turn off services in the previous installation, run the upgrade script, and turn on services in the current installation. Note: Updating the references may take several hours.
    /opt/smrtanalysis-<old-version-number>/etc/scripts/kodosd stop
    /opt/smrtanalysis-<old-version-number>/etc/scripts/tomcatd stop
    /opt/smrtanalysis/etc/scripts/postinstall/upgrade_and_configure_smrtanalysis.sh
    /opt/smrtanalysis-<current-version-number>/etc/scripts/tomcatd start
    /opt/smrtanalysis-<current-version-number>/etc/scripts/kodosd start
    
  6. New Installations only: Set up distributed computing by deciding on a job management system (JMS), then edit the following files:
/opt/smrtanalysis/analysis/etc/cluster/<JMS>/start.tmpl
/opt/smrtanalysis/analysis/etc/cluster/<JMS>/interactive.tmpl
/opt/smrtanalysis/analysis/etc/cluster/<JMS>/kill.tmpl
/opt/smrtanalysis/redist/tomcat/webapps/smrtportal/WEB-INF/web.xml

Note: If you are not using SGE, you will need to deactivate the Celera® Assembler protocols so that they do not display in SMRT Portal. To do so, rename the following files, located in common/protocols. Rename the following files:

RS_CeleraAssembler.1.xml to RS_CeleraAssembler.1.bak
filtering/CeleraAssemblerSFilter.1.xml to CeleraAssemblerSFilter.1.bak
assembly/CeleraAssembler.1.xml to CeleraAssembler.1.bak
  1. New Installations only: Set up user data folders that point to external storage.

  2. New Installations only: Set up SMRT Portal.

  3. Verify the installation.

Bundled with SMRT® Analysis

The following are bundled within the application and should not depend on what is already deployed on the system.

  • Java® 1.6
  • Python® 2.5.2
  • Tomcat™ 7.0.23

Changes from SMRT® Analysis v2.0.0

New Features

  • Now includes Quiver training for DNA/Polymerase P4.
  • Now includes modification detection using the P4/C2 combination with an updated in silico control.
    • Modification identification of 6-methyladenine (6-mA) and 4-methylcytosine (4-mC) is also supported, and is expected to have equivalent performance to previous chemistry releases.
    • Modification identification of 5-methylcytosine (5-mC) using TET-treated samples is also supported. However, due to a limited training dataset, this application is not yet optimized for the P4/C2 combination. Future releases of the software are expected to have improved TET-converted 5-mC identification as the in silico control is updated with additional training data.

Fixed Issues

  • Fixed an Instrument Web Services problem with well status queries. (23191)
  • Removed a time limit to the Sun Grid Engine (SGE) that caused analysis jobs to stop after 12 hours. Any limits must now be placed by your IT department; SMRT® Pipe will not limit the run time. (23312)
  • Fixed an issue where sample barcodes were not working properly with multi-streamed data files (bax.h5), and most barcodes were not being recognized. (23136)
  • Modified HGAP defaults so that partial alignments are allowed and Celera® Assembler will run on a single node.

Step 5, Option 1 Details: Run the Installation script and turn on services

cd /opt/smrtanalysis/etc/scripts/postinstall
./configure_smrtanalysis.sh
/opt/smrtanalysis-<current-version-number>/etc/scripts/tomcatd start
/opt/smrtanalysis-<current-version-number>/etc/scripts/kodosd start

The installation script requires the following input:

  • The system name. (Default: hostname -a)
  • The port number that the services will run under. (Default: 8080)
  • The Tomcat shutdown port. (Default: 8005)
  • The user/group to run the services and set permissions for the files. (Default: smrtanalysis:smrtanalysis)
  • The mysql user name and password to install the database. (Default: root:no password)
  • The Job Management System for your distributed system (SGE)
    • The queue name. (Default: secondary)
    • The Parallel environment. (Default: smp)

The installation script performs the following:

  • Creates the SMRT Portal database. The mysql user performing the install must have permissions to alter or create databases. Otherwise, the installer will reject the user and prompt for another.
  • Sets the host and port names for various configuration files.
  • Sets the Tomcat/kodos user. The services will run as the specified user.
  • Sets the user and group permissions and ownership of the application to the Tomcat user.
  • Adds links in /etc/init.d to the tomcat and kodos services if invoked as root. (The defaults are: /etc/init.d/kodosd and /etc/init.d/tomcatd.) These are soft links to the actual service files within the application. If a file is already present (for example, tomcatd is already installed), the link can be created with a different name. The permissions of the underlying scripts are limited to the user running the services.
  • Installs the services. The services will automatically restart if the system restarts. (On CentOS, the installer will run chkconfig to install the services, rather than update-rc.d.)

Step 5, Option 2 Details: Run the Upgrade Script

Run upgrade_and_configure_smrtanalysis.sh script. This may take several hours if you have many references to upgrade:

  /opt/smrtanalysis-<old-version-number>/etc/scripts/kodosd stop
  /opt/smrtanalysis-<old-version-number>/etc/scripts/tomcatd stop
  cd /opt/smrtanalysis/etc/scripts/postinstall/
  ./upgrade_and_configure_smrtanalysis.sh
  /opt/smrtanalysis-<current-version-number>/etc/scripts/tomcatd start
  /opt/smrtanalysis-<current-version-number>/etc/scripts/kodosd start

The upgrade script performs the following:

  • Preserves SMRT Cells, jobs, and users from a previous installation by updating any smrtportal database schema changes.
  • Preserves SMRT Cells, jobs, and users from a previous installation by updating the softlink to the userdata directory
  • Preserves computing configurations from a previous installation such that steps 6-8 do not need to be repeated.
  • The upgrade script does not port over protocols that were defined in previous versions. This is because protocol files can vary a great deal between versions due to rapid code development and change. Please recreate any custom protocols you may have.

Step 6 Details: (New Installations Only) Set up Distributed Computing

SMRT Analysis provides support for distributed computation using an existing job management system. Pacific Biosciences has explicitly validated Sun Grid Engine (SGE), LSF and PBS. You only need to configure the software once during initial install. The upgrade process will port over any configuration settings from the previous version. This section describes setup for SGE and gives guidance for extensions to other Job Management Systems.

Note: Celera® Assembler 7.0 will only work correctly with the SGE job management system. If you are not using SGE, you will need to deactivate the Celera® Assembler protocols so that they do not display in SMRT Portal. To do so, rename the following files, located in common/protocols:

RS_CeleraAssembler.1.xml to RS_CeleraAssembler.1.bak
filtering/CeleraAssemblerSFilter.1.xml to CeleraAssemblerSFilter.1.bak
assembly/CeleraAssembler.1.xml to CeleraAssembler.1.bak

Configuring SMRT Portal

Running jobs in distributed mode is disabled by default in SMRT Portal. To enable distributed processing, set the jobsAreDistributed value in /opt/smrtanalysis/redist/tomcat/webapps/smrtportal/WEB-INF/web.xml to true, and then restart Tomcat:

<context-param>
<param-name>jobsAreDistributed</param-name>
<param-value>true</param-value>
</context-param>

Smrtpipe.rc Configuration

Following are the options in the /opt/smrtanalysis/analysis/etc/smrtpipe.rc file that you can set to execute distributed SMRT Pipe runs.

  • CLUSTER_MANAGER Default value: SGE Text string that points to template files in /opt/smrtanalysis/analysis/etc/cluster/. These files communicate with the Job Management System. SGE is officially supported, but adding new JMSs is straightforward.

  • EXIT_ON_FAILURE Default value: False The default behavior is to continue executing tasks as long as possible. Set to True to specify that smrtpipe.py not submit any additional tasks after a failure.

  • MAX_CHUNKS Default value: 64 SMRT Pipe splits inputs into ‘chunks’ during distributed computing. Different tasks use different chunking mechanisms, but MAX_CHUNKS sets the maximum number of chunks any file or task will be split into. This also affects the maximum number of tasks, and the size of the graph for a job.

  • MAX_THREADS Default value: 8 SMRT Pipe uses one thread per active task to launch, block, and monitor return status for each task. This option limits the number of active threads for a single job. Additional tasks will wait until a thread is freed up before launching.

  • MAX_SLOTS Default value: 256 SMRT Pipe cluster resource management is controlled by the ‘slots’ mechanism. MAX_SLOTS limits the total number of concurrent slots used by a single job. In a non-distributed environment, this roughly determines the total number of cores to be used at once.

  • NJOBS Default value: 64 Specifies the number of jobs to submit for a distributed job. This applies only to assembly workflows (S_* modules).

  • NPROC Default value: 15

    • Determines the number of JMS ‘slots’ reserved by compute-intensive tasks.
    • Determines the number of cores that compute-intensive tasks will attempt to use.
    • In a distributed environment, NPROC should be at most (total slots - 1). This allows an I/O-heavy single process task to share a node with a CPU-intensive tasks that would not otherwise be using the I/O.
  • SHARED_DIR Default value: $SEYMOUR_HOME/common/userdata/shared_dir/. A shared writeable directory visible to all nodes. Used for sharing temporary files that can be used by more than one compute process.

  • TMP Default value: /tmp/ Specifies the local temporary storage location for creation of temporary files and directories used for fast read/write access. For optimal performance, this should have at least 100 GB of free space. Important: Make sure to change this to an actual temporary location on the head node and compute nodes. Your jobs will fail if the path does not exist.

Configuring Templates

The central component for setting up distributed computing in SMRT Analysis are the Job Management Templates (JMTs). JMTs provide a flexible format for specifying how SMRT Analysis communicates with the resident Job Management System (JMS). There are two templates which must be modified for your system:

  • start.tmpl is the legacy template used for assembly algorithms.
  • interactive.tmpl is the new template used for resequencing algorithms. The difference between the two is the additional requirement of a sync option in interactive.tmpl. (kill.tmpl is not used.)

Note: We are in the process of converting all protocols to use only interactive.tmpl.

To customize a JMS for a particular environment, edit or create start.tmpl and interactive.tmpl. For example, the installation includes the following sample start.tmpl and interactive.tmpl (respectively) for SGE:

qsub -pe smp ${NPROC} -S /bin/bash -V -q secondary -N ${JOB_ID} -o ${STDOUT_FILE} -e ${STDERR_FILE} ${EXTRAS} ${CMD}
qsub -S /bin/bash -sync y -V -q secondary -N ${JOB_ID} -o ${STDOUT_FILE} -e ${STDERR_FILE} -pe smp ${NPROC} ${CMD}

To support a new JMS:

  1. Create a new directory in etc/cluster/ under NEW_NAME.
  2. In smrtpipe.rc, change the CLUSTER_MANAGER variable to NEW_NAME, as described in “Smrtpipe.rc Configuration”.
  3. Once you have a new JMS directory specified, edit the interactive.tmpl and start.tmpl files for your particular setup.

Sample SGE, LSF and PBS templates are included with the installation in /opt/smrtanalysis/analysis/etc/cluster.

Specifying the SGE Job Management System:

For this version (v2.0.1), you must still edit both interactive.tmpl and start.tmpl as follows:

  1. Change secondary to the queue name on your system. (This is the –q option.)
  2. Change smp to the parallel environment on your system. (This is the -pe option.)

Specifying the PBS Job Management System

PBS does not have a –sync option, so the interactive.tmpl file runs a script named qsw.py to simulate the functionality. You must edit both interactive.tmpl and start.tmpl.

  1. Change the queue name to one that exists on your system. (This is the –q option.)
  2. Change the parallel environment to one that exists on your system. (This is the -pe option.)
  3. Make sure that interactive.tmpl calls the –PBS option.

Specifying the LSF Job Management System

Create an interactive.tmpl file by copying the start.tmpl file and adding the –K functionality in the bsub call. Or, you can also edit the sample LSF templates.

Specifying other Job Management Systems

We have not tested the –sync functionally on other systems. Find the equivalent to the –sync option for your JMS and create an interactive.tmpl file. If there is no -sync option available, you may need to edit the qsw.py script in /opt/smrtanalysis/analysis/lib/python2.7/pbpy-0.1-py2.7.egg/EGG-INFO/scripts/qsw.py to add additional options for wrapping jobs on your system.

The code for PBS and SGE looks like the following:

if '-PBS' in args:
            args.remove('-PBS')
            self.jobIdDecoder   = PBS_JOB_ID_DECODER
            self.noJobFoundCode = PBS_NO_JOB_FOUND_CODE
            self.successCode    = PBS_SUCCESS_CODE
            self.qstatCmd       = "qstat"
        else:
            self.jobIdDecoder   = SGE_JOB_ID_DECODER
            self.noJobFoundCode = SGE_NO_JOB_FOUND_CODE
            self.successCode    = SGE_SUCCESS_CODE
            self.qstatCmd       = "qstat -j"

Configuring Submit hosts for Celera® Assembler

To run Celera® Assembler on a distributed infrastructure, all the execute hosts in your queue must also be submit hosts. You can add submit hosts by executing qconf -as <hostname> in SGE.

Step 7 Details: (New Installations Only) Set Up User Data Folders

SMRT Analysis saves references and results in its own hierarchy. Note that large amounts of data are generated and storage can get filled up. We suggest that you softlink to an external directory with more storage.

All jobs and references, as well as drop boxes, are contained in /opt/smrtanalysis/common/userdata. You can move this folder to another location, then soft link /opt/smrtanalysis/common/userdata to the new location.

mv /opt/smrtanalysis/common/userdata /my_offline_storage
ln -s /my_offline_storage/userdata /opt/smrtanalysis/common/userdata

Step 8 Details: (New Installations Only) Set Up SMRT® Portal

  1. Use your web browser to start SMRT Portal: http://HOST:PORT/smrtportal
  2. Click Register at the top right.
  3. Create a user named administrator (all lowercase). This user is special, as it is the only user that does not require activation on creation.
  4. Enter the user name administrator.
  5. Enter an email address. All administrative emails, such as new user registrations, will be sent to this address.
  6. Enter the password and confirm the password.
  7. Select Click Here to access Change Settings.
  8. To set up the mail server, enter the SMTP server information and click Apply. For email authentication, enter a user name and password. You can also enable Transport Layer Security.
  9. To enable automated submission from a PacBio® RS instrument, click Add under the Instrument Web Services URI field. Then, enter the following into the dialog box and click OK:
http://INSTRUMENT_PAP01:8081

INSTRUMENT_PAP01 is the IP address or name (pap01) of the instrument. 8081 is the port for the instrument web service.

  1. Select the new URI, then click Test to check if SMRT Portal can communicate with the instrument service.
  2. (Optional) You can delete the pre-existing instrument entry by clicking Remove.

Step 9: Verify the installation

Create a test job in SMRT Portal using the provided lambda sequence data. This is data from a single SMRT cell that has been down-sampled to reduce overall tarball size. If you are upgrading, this cell will already have been imported into your system, and you can skip to step 10 below.

Open your web browser and clear the browser cache:

  • Google Chrome: Choose Tools > Clear browsing data. Choose the beginning of time from the droplist, then check Empty the cache and click Clear browsing data.
  • Internet Explorer: Choose Tools > Internet Options > General, then under Browsing history, click Delete. Check Temporary Internet files, then click Delete.
  • Firefox: Choose Tools > Options > Advanced, then click the Network tab. In the Cached Web Content section, click Clear Now.

  1. Refresh the current page by pressing F5.
  2. Log into SMRT Portal by navigating to http://HOST:PORT/smrtportal.
  3. Click Design Job.
  4. Click Import and Manage.
  5. Click Import SMRT Cells.
  6. Click Add.
  7. Enter /opt/smrtanalysis/common/test/primary/lambda, then click OK.
  8. Select the new path and click Scan. You should get a dialog saying “One input was scanned."
  9. Click Design Job.
  10. Click Create New.
  11. Enter a job name and comment.
  12. Select the protocol RS_Resequencing.1.
  13. Under SMRT Cells Available, select a lambda cell and click the right-arrow button.
  14. Click Save on the bottom right, then click Start. The job should complete successfully.
  15. Click the SMRT View button. SMRT View should open with tracks displayed, and the reads displayed in the Details panel.

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2010 - 2013, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and the applicable license terms at http://www.pacificbiosciences.com/licenses.html. P/N 100-250-300