Skip to content

Commit

Permalink
Added TODOs
Browse files Browse the repository at this point in the history
  • Loading branch information
Philipp Comans committed Oct 13, 2011
1 parent 6b40f59 commit bdb8488
Show file tree
Hide file tree
Showing 6 changed files with 429 additions and 0 deletions.
File renamed without changes.
22 changes: 22 additions & 0 deletions Kingdom-Extraction-Readme.md
@@ -0,0 +1,22 @@
# Kingdom-Extraction

## License
This program is licensed under the GNU Lesser General Public License.
See License.txt for more information.

## Usage
Usage: kingdom-extraction sequences.fasta clean.csv contaminated.csv clean_output.fasta contaminated_output.fasta

## Installation
In a nutshell:

git clone git@github.com:PalMuc/Kingdom-Extraction.git
cd Kingdom-Extraction
rvm use jruby
rm pkg/*
bundle install
rake install
rvm jruby gem install pkg/*.gem

# Acknowledgements
Development of this program was supported by the [Molecular Geo- and Palaeobiology Lab](http://www.mol-palaeo.de/) of the Department of Earth and Environmental Sciences and the initiative "[Gleichstellung in Forschung und Lehre](http://www.frauenbeauftragte.uni-muenchen.de/foerdermoegl/lmu1/tg73/index.html)" of the Ludwig-Maximilians-University Munich (LMU).
125 changes: 125 additions & 0 deletions Kingdom-Splitter-Readme.md
@@ -0,0 +1,125 @@
# Kingdom-Splitter

## License
This program is licensed under the GNU Lesser General Public License.
See License.txt for more information.

## Description
This gem is designed to sort out bacterial, archaeal and viral contaminations from eukaryotic Expressed Sequence Tag (EST) and genomic data.

Kingdom-Splitter uses CSV files generated by [Kingdom-Assignment](https://github.com/PalMuc/Kingdom-Assignment) as input. This input file is split into two new CSV files. The first file contains all sequences that are deemed to belong to eukaryotic organisms according to the rules stated below. The second file contains all sequences that are deemed to be prokaryotic or viral contaminations.

## Rules
Sequences go into the clean eukaryotic subset when at least one of their three best BLAST hits does not match the contamination filter. Right now, this filter contains the NCBI taxonomies Bacteria, Archaea, Viruses and NONE, which represents [unknown sequences](http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=12908&lvl=3&keep=1&srchmode=1&unlock).

## Using Kingdom-Splitter

kingdom-splitter input.csv

This will automatically create input\_clean.csv and input\_contaminated.csv in the same directory.

## Customizing the rules
Right now, it is not possible to customize the rules without modifying the source code.
If you need the rules to work differently, [fork this project](http://help.github.com/fork-a-repo/) and modify it to your liking.

User customizable rules are a feature that might come in a future version of this gem if there is demand for it. If you have any additional questions, contact me directly or [open an issue](https://github.com/PalMuc/Kingdom-Splitter/issues)

## Prerequisites
In order to install this gem you need to have several programs
installed:

* Ruby either in version 1.8.7 or 1.9.2. The use of [JRuby](http://www.jruby.org/) (a Java implementation of Ruby) is recommended.
* Git
* cURL

In the following, the installation procedure is given for **Mac OS X** and **Ubuntu Linux 10.10**. The commands for Ubuntu also have been tested to work for **Debian Squeeze** although you should substitute apt-get by aptitude.

If you already installed Kingdom-Assignment, you can jump right to the section "Using Kingdom-Splitter"

### Installing Git
An installer for Mac OS X can be obtained from the [official website](http://git-scm.com/). For any Linux distribution it is recommended that you use your system's package manager to install Git. Look for a package called git or git-core. For Ubuntu 10.10 the command is:

sudo apt-get install git

### Installing cURL
Mac OS X comes with curl by default, on a Linux system, cURL can be obtained via the system's package manager. For Ubuntu 10.10 the command is:

sudo apt-get install curl

### Installing JRuby
Very few distributions offer packages for the most recent version of JRuby.
The easiest way to install the most recent version of JRuby is via the [Ruby Version Manager](http://rvm.beginrescueend.com/) by Wayne E. Seguin.

Before you install RVM, make sure you have git and curl installed on your system.

RVM can be installed by calling:

bash < <( curl http://rvm.beginrescueend.com/releases/rvm-install-head )

This will install RVM to .rvm in your home folder and print several instructions specific to your platform on how to finish the installation. Please pay close attention to the "dependencies" section and look for the part where it says something like this:

# For Ruby (MRI & ree) you should install the following OS dependencies:
ruby: /usr/bin/apt-get install build-essential bison openssl libreadline6 libreadline6-dev curl git-core zlib1g zlib1g-dev libssl-dev libyaml-dev libsqlite3-0 libsqlite3-dev sqlite3 libxml2-dev libxslt-dev autoconf libc6-dev ncurses-dev

These are the requirements for building the normal C version of Ruby. However, many of those tools are also required for building the Java version of Ruby so it is advisable that you install all of these prerequisites. Please do not copy the commands from this file, look at the output of the RVM installer.

sudo apt-get install build-essential bison openssl libreadline6 libreadline6-dev curl git-core zlib1g zlib1g-dev libssl-dev libyaml-dev libsqlite3-0 libsqlite3-dev sqlite3 libxml2-dev libxslt-dev autoconf libc6-dev ncurses-dev

If installing any of these packages gives you an error, consider updating your packages by using your system's update manager.

Next you need to install the tools that are specifically required for installing JRuby. The output of RVM might look like this:

# For JRuby (if you wish to use it) you will need:
jruby: /usr/bin/apt-get install curl g++ openjdk-6-jre-headless
jruby-head: /usr/bin/apt-get install ant openjdk-6-jdk

It is recommended that you use the latest stable version of JRuby, not jruby-head. Accordingly, on Ubuntu 10.10 you have to install the following packages in order to use JRuby with RVM:

apt-get install curl g++ openjdk-6-jre-headless

Next, you have to make sure that RVM is loaded when you start a new shell. Look for the part where it says: "You m

## Installing Kingdom-Splitter
This gem is distributed in source form for the time being, so you must build it yourself in order to use it. Don't worry, it's not hard:

First you must download the source code of this gem by going to a folder of your choice and typing:

git clone git@github.com:PalMuc/Kingdom-Splitter.git

This will will clone a copy of this repository in a folder named Kingdom-Assignment. Go to this folder by typing:

cd Kingdom-Splitter

Kingdom assignment is delivered as a Ruby gem. In order to build and install it, you first have to install another gem called bundler. Type:

rvm jruby gem install bundler

In order to install the other gems Kingdom Assignment depends on, first switch to JRuby:

rvm use jruby

Now go to the folder called kingdom-assignment and type:

bundle install

Before you build an updated version of Kingdom Assignment, you should
delete previous builds by typing:

rm pkg/kingdom-splitter-*.gem

After that, create a new Ruby gem by typing:

rake install

Finally you can install the gem by typing:

rvm jruby gem install pkg/kingdom-splitter*.gem

Kingdom Assignment is now in your global path, meaning that from any point in the system you can use it by typing

kingdom-splitter

on the command line. Please note that in order to do that you have to switch to JRuby as mentioned before.

# Acknowledgements
Development of this program was supported by the [Molecular Geo- and Palaeobiology Lab](http://www.mol-palaeo.de/) of the Department of Earth and Environmental Sciences and the initiative "[Gleichstellung in Forschung und Lehre](http://www.frauenbeauftragte.uni-muenchen.de/foerdermoegl/lmu1/tg73/index.html)" of the Ludwig-Maximilians-University Munich (LMU).
114 changes: 114 additions & 0 deletions bin/kingdom-extraction
@@ -0,0 +1,114 @@
#!/usr/bin/env ruby

def table_to_set(table, header)
result = Set.new()
table.each do |current_row|
current = current_row[header]
if current.nil?
raise "Error: no entry found for header " + header.to_s + " at " + current_row.inspect
end

unless result.include?(current)
result.add(current)
else
raise "Error: duplicate entry for " + current.to_s
end
end
return result
end

#parse command line arguments
settings = {}
unless ARGV.size == 5
puts "Usage: kingdom-extraction sequences.fasta clean.csv contaminated.csv clean_output.fasta contaminated_output.fasta"
exit
end

$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
$LOAD_PATH.unshift(File.dirname(__FILE__))

require 'rubygems'
require 'csv'
require 'set'
require 'bio'
require 'kingdom-extraction/version'

puts "Running Kingdom-Extraction " + Kingdom::Extraction::VERSION.to_s

settings[:input_fasta] = ARGV.shift
settings[:input_clean] = ARGV.shift
settings[:input_contaminated] = ARGV.shift
settings[:output_clean] = ARGV.shift
settings[:output_contaminated] = ARGV.shift

unless File.exists?(settings[:input_fasta])
puts "The input file at " + File.expand_path(settings[:input_fasta]) + " could not be opened!"
exit
end

unless File.exists?(settings[:input_clean])
puts "The input file at " + File.expand_path(settings[:input_clean]) + " could not be opened!"
exit
end

unless File.exists?(settings[:input_contaminated])
puts "The input file at " + File.expand_path(settings[:input_contaminated]) + " could not be opened!"
exit
end

if File.exists?(settings[:output_clean])
puts "The input file at " + File.expand_path(settings[:output_clean]) + " already exists!"
exit
end

if File.exists?(settings[:output_contaminated])
puts "The input file at " + File.expand_path(settings[:output_contaminated]) + " already exists!"
exit
end

#CSV backwards compatibility
if CSV.const_defined? :Reader
require 'fastercsv'
INSTALLED_CSV = FasterCSV
else
INSTALLED_CSV = CSV
end

#Open output of Kingdom-Splitter, save clean and contaminated sequence ids in two sets
puts "Reading clean..."
clean_table = INSTALLED_CSV.open(settings[:input_clean], "r", { :col_sep => ";", :headers => :first_row, :header_converters => :symbol})
clean = table_to_set(clean_table, :query_sequence_id)
clean_table.close

puts "Reading contaminated..."
contaminated_table = INSTALLED_CSV.open(settings[:input_contaminated], "r", { :col_sep => ";", :headers => :first_row, :header_converters => :symbol})
contaminated = table_to_set(contaminated_table, :query_sequence_id)
contaminated_table.close

#Initialize output files
clean_out = File.open(settings[:output_clean], "w")
contaminated_out = File.open(settings[:output_contaminated], "w")

puts "Extracting FASTA sequences..."
QUERY_SEQ_REGEXP = /\A(\S+)\s.*\z/ #Make sure this is exactly the same as in BlastStringParser in Kingdom-Assignment

sequences = Bio::FastaFormat.open(settings[:input_fasta])
sequences.each do |entry|
current = QUERY_SEQ_REGEXP.match(entry.definition)[1] #TODO do something when this comparison fails
if clean.include?(current)
#Sequence belongs in the clean set
clean_out.write(entry)
elsif contaminated.include?(current)
#Sequence belongs in the contaminated set
contaminated_out.write(entry)
else
#Sequence is not annotated
end

end

sequences.close
clean_out.close
contaminated_out.close

puts "Done!"

0 comments on commit bdb8488

Please sign in to comment.