notes.txt

1. INSTALLING TPP

--------
#If admin permission is needed, enter the command: sudo bash
cd /usr/local/src/
svn co https://sashimi.svn.sourceforge.net/svnroot/sashimi/tags/release_4-3-1/trans_proteomic_pipeline/
cd trans_proteomic_pipeline/installer_linux/
sh ./install-prerequisites-ubuntu-9_04.sh

cd ../src/
make
make install
--------

This version of TPP comes with an outdated X! Tandem, so I downloaded the recent one from their web site. g++ complains at it though, so I had to add "#include <stdint.h>" to saxhandler.h. It also complains at several files, so I had to add "#include <string.h>" to those files. Also, X! Tandem automatically adds the exact date to the name of each output file, making it more difficult for automation. I fixed this by commenting out line 914 of tandem-linux-10-01-01-4/src/mreport.cpp and re-executing make.


2. CHILD PROCESSES

Since the pipeline relies on several programs, I chose to execute those programs by using the kernel::fork method when the pipeline doesn't have to wait right away for the process to finish, as this takes advantage of multiple cores. In order to wait for several processes to finish, such as in the search class, I created a method to run a while loop calling Process.wait. This way, it will keep waiting until all child processes have finished, throw an exception when there are none to wait on but be caught be an exception handler.


3. HARDKLOR

Hardklor's output is in a unique format and is thus cannot be used as input to the search engines. Can it be converted to an appropriate format?


4. SELECTING AN ENZYME

The search engines use different ways of selecting an enzyme in the command line. To handle this, I chose to have the name of the enzyme be the input to the pipeline, while nokogiri is parses an xml file to obtain the correct input for the search engine. X! Tandem didn't have such an xml file, so I made it myself.


5. DECOY DATABASES

I created a class to select and reverse a database to be used in decoy searches. The code to format the output may not be necessary. formatdb needs to be called to convert a database into the correct format for OMSSA, both on the normal and reversed databases. I have not at this time automated the database reversal.


6. INSTALLING FROM JUST THE PIPELINE SOURCE CODE

As of 4 June 2010:
cd pipeline/
sudo aptitude install ruby1.9.1-full
sudo aptitude install rubygems1.9.1
gem install ms-msrun
*Crazy stuff happened
gem install builder
gem install ms-error_rate
sudo aptitude install wine
install ReAdW.exe and get XCalibur's Xrawfile2.dll, fileio.dll, fregistry.dll into the same folder as ReAdW.exe, along with zlib1.dll
download ProteoWizard on a Windows machine, and place msconvert_server.rb and scriptit.bat in the same folder
SEE NOTE 11
install hardklor
install X! Tandem. May need to add "#include <stdint.h>" to saxhandler.h. Make correction as noted in note 7.
install OMSSA
install Tide from http://noble.gs.washington.edu/proj/tide/
download search engine databases
sudo apt-get install blast2. This is to use formatdb on the search engine databases to format them for OMSSA.
install Percolator from http://github.com/downloads/percolator/percolator/percolator_1.14_src.tar.gz. May need to add "#include <stdint.h>" to Scores.h.


7. MODIFIED X! TANDEM

For some reason, X! Tandem automatically renamed the output files to include the date and time. This complicated automation, so I edited the source code to not add the date and time by commenting out line 914 of tandem-linux/src/mreport.cpp.


8. POSSIBLE NAMES

ProjectZero: Because Zero kicks butt, and so should this. Megaman Zero reference.
WarpPipe: Because, well, both pipeline and warp pipe have the word pipe in it. Super Mario reference.
JPipe: Because my name starts with J, and so does John's. There's gotta be at least one narcissistic name, right?
KatamariShiteki (Clod Identification), KatamariDotei (Clod Identification) or KatamariBunkai (Clod Analysis): Because it's a wad of other programs. Katamari Damacy reference.
SmartPipe: Because Smartball can whiz back and forth right through pipes. Smartball reference.
MasterPipe: Eh, sounds cool. Zelda reference.
PeptideBaton: Like a music conductor, the user simply waves his baton (Enters a command) and the the orchestra (Program) plays the parts of the song (Runs the different programs). Windwaker reference.
Decided on KatamariDotei.


9. DIFFERENT DATABASE FORMATS

FASTA databases are used, but in addition to that, other formats of the FASTA files are used.
	A. OMSSA requires a special format, which is done by running formatdb on the FASTA file.
	B. The FASTA files are reversed and used for decoy searches. Reversed databases are obtained by running reverse_database.rb.
	C. Tide uses a special format from the FASTA file to increase its search speed. search.rb handles this.
	D. A YAML file is used as a quick means of grabbing the proteins for Percolator.
	Use src/database_formatter.rb to create these other formats.


10. MASCOT DATABASE NAMES

Because the name of a database from the Mascot's selection is not necessarily the same as the file name, I created a mapping between database ID and Mascot name located in pipeline/mascot/database-Mascot.xml


11. COST OF MZML

The pipeline started off with using mzXML, but adding support for mzML came at the cost of greater complexity to setting up the pipeline. msconvert.exe is used to convert a raw file into mzML, and while I tried to get it running under wine, I just couldn't get it to work. Instead, a Windows computer has to be running msconvert_server.rb, so that the pipeline running on a Linux computer can communicate with the Windows computer via Ruby's socket library.


12. FILE NAMING CONVENTIONS

To make the automation easier (for me), the following naming conventions are used for the files:
	A. Extensions are named as follows: .mzXML, .mzML, .pep.xml, .mzid
	B. Search engine outputs have the original raw file name with the direction (either target or decoy), the search engine name (mascot, tandem, tide, or omssa), and the run number added to the file name, e.g. test_1-target_mascot.pep.xml
	C. Combined psms files are have the word "combined" with the run number, e.g. test_combined_1.psms


13. RUBY AND LARGE FILES

Trying to write concise code, I would write IO one-liners such as "File.open("#{fileName}.mzML", 'w') {|io| io.print client.gets}".
This, however, led to strange errors and low processing speeds. Breaking them up into parts such as

"file = File.open("#{fileName}.mzML", 'w')
data = client.gets
file.print data"

solved the problem. So if you try tidying up the code and come across some lines of IO that could be reduced to fewer lines, DON'T DO IT!!!