Added TODOs

PalMuc · Oct 13, 2011 · bdb8488 · bdb8488
1 parent 6b40f59
commit bdb8488
Show file tree

Hide file tree

Showing 6 changed files with 429 additions and 0 deletions.
diff --git a/Readme.md → Kingdom-Assignment-Readme.md b/Readme.md → Kingdom-Assignment-Readme.md
diff --git a/Kingdom-Extraction-Readme.md b/Kingdom-Extraction-Readme.md
@@ -0,0 +1,22 @@
+# Kingdom-Extraction
+
+## License
+This program is licensed under the GNU Lesser General Public License.
+See License.txt for more information.
+
+## Usage
+    Usage: kingdom-extraction sequences.fasta clean.csv contaminated.csv clean_output.fasta contaminated_output.fasta
+
+## Installation
+In a nutshell:
+
+    git clone git@github.com:PalMuc/Kingdom-Extraction.git
+    cd Kingdom-Extraction
+    rvm use jruby
+    rm pkg/*
+    bundle install
+    rake install
+    rvm jruby gem install pkg/*.gem
+
+# Acknowledgements
+Development of this program was supported by the [Molecular Geo- and Palaeobiology Lab](http://www.mol-palaeo.de/) of the Department of Earth and Environmental Sciences and the initiative "[Gleichstellung in Forschung und Lehre](http://www.frauenbeauftragte.uni-muenchen.de/foerdermoegl/lmu1/tg73/index.html)" of the Ludwig-Maximilians-University Munich (LMU).
diff --git a/Kingdom-Splitter-Readme.md b/Kingdom-Splitter-Readme.md
@@ -0,0 +1,125 @@
+# Kingdom-Splitter
+
+## License
+This program is licensed under the GNU Lesser General Public License.
+See License.txt for more information.
+
+## Description
+This gem is designed to sort out bacterial, archaeal and viral contaminations from eukaryotic Expressed Sequence Tag (EST) and genomic data.
+
+Kingdom-Splitter uses CSV files generated by [Kingdom-Assignment](https://github.com/PalMuc/Kingdom-Assignment) as input. This input file is split into two new CSV files. The first file contains all sequences that are deemed to belong to eukaryotic organisms according to the rules stated below. The second file contains all sequences that are deemed to be prokaryotic or viral contaminations.
+
+## Rules
+Sequences go into the clean eukaryotic subset when at least one of their three best BLAST hits does not match the contamination filter. Right now, this filter contains the NCBI taxonomies Bacteria, Archaea, Viruses and NONE, which represents [unknown sequences](http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Undef&id=12908&lvl=3&keep=1&srchmode=1&unlock).
+
+## Using Kingdom-Splitter
+
+    kingdom-splitter input.csv
+
+This will automatically create input\_clean.csv and input\_contaminated.csv in the same directory.
+
+## Customizing the rules
+Right now, it is not possible to customize the rules without modifying the source code.
+If you need the rules to work differently, [fork this project](http://help.github.com/fork-a-repo/) and modify it to your liking.
+
+User customizable rules are a feature that might come in a future version of this gem if there is demand for it. If you have any additional questions, contact me directly or [open an issue](https://github.com/PalMuc/Kingdom-Splitter/issues)
+
+## Prerequisites
+In order to install this gem you need to have several programs
+installed:
+
+ * Ruby either in version 1.8.7 or 1.9.2. The use of [JRuby](http://www.jruby.org/) (a Java implementation of Ruby) is recommended.
+ * Git
+ * cURL
+
+In the following, the installation procedure is given for **Mac OS X** and **Ubuntu Linux 10.10**. The commands for Ubuntu also have been tested to work for **Debian Squeeze** although you should substitute apt-get by aptitude.
+
+If you already installed Kingdom-Assignment, you can jump right to the section "Using Kingdom-Splitter"
+
+### Installing Git
+An installer for Mac OS X can be obtained from the [official website](http://git-scm.com/). For any Linux distribution it is recommended that you use your system's package manager to install Git. Look for a package called git or git-core. For Ubuntu 10.10 the command is:
+
+    sudo apt-get install git
+
+### Installing cURL
+Mac OS X comes with curl by default, on a Linux system, cURL can be obtained via the system's package manager. For Ubuntu 10.10 the command is:
+
+    sudo apt-get install curl
+
+### Installing JRuby
+Very few distributions offer packages for the most recent version of JRuby.
+The easiest way to install the most recent version of JRuby is via the [Ruby Version Manager](http://rvm.beginrescueend.com/) by Wayne E. Seguin.
+
+Before you install RVM, make sure you have git and curl installed on your system.
+
+RVM can be installed by calling:
+
+    bash < <( curl http://rvm.beginrescueend.com/releases/rvm-install-head )
+
+This will install RVM to .rvm in your home folder and print several instructions specific to your platform on how to finish the installation. Please pay close attention to the "dependencies" section and look for the part where it says something like this:
+
+    # For Ruby (MRI & ree)  you should install the following OS dependencies:
+    ruby: /usr/bin/apt-get install build-essential bison openssl libreadline6 libreadline6-dev curl git-core zlib1g zlib1g-dev libssl-dev libyaml-dev libsqlite3-0 libsqlite3-dev sqlite3 libxml2-dev libxslt-dev autoconf libc6-dev ncurses-dev
+
+These are the requirements for building the normal C version of Ruby. However, many of those tools are also required for building the Java version of Ruby so it is advisable that you install all of these prerequisites. Please do not copy the commands from this file, look at the output of the RVM installer.
+
+    sudo apt-get install build-essential bison openssl libreadline6 libreadline6-dev curl git-core zlib1g zlib1g-dev libssl-dev libyaml-dev libsqlite3-0 libsqlite3-dev sqlite3 libxml2-dev libxslt-dev autoconf libc6-dev ncurses-dev
+
+If installing any of these packages gives you an error, consider updating your packages by using your system's update manager.
+
+Next you need to install the tools that are specifically required for installing JRuby. The output of RVM might look like this:
+
+    # For JRuby (if you wish to use it) you will need:
+      jruby: /usr/bin/apt-get install curl g++ openjdk-6-jre-headless
+      jruby-head: /usr/bin/apt-get install ant openjdk-6-jdk
+
+It is recommended that you use the latest stable version of JRuby, not jruby-head. Accordingly, on Ubuntu 10.10 you have to install the following packages in order to use JRuby with RVM:
+
+    apt-get install curl g++ openjdk-6-jre-headless
+
+Next, you have to make sure that RVM is loaded when you start a new shell. Look for the part where it says: "You m
+
+## Installing Kingdom-Splitter
+This gem is distributed in source form for the time being, so you must build it yourself in order to use it. Don't worry, it's not hard:
+
+First you must download the source code of this gem by going to a folder of your choice and typing:
+
+    git clone git@github.com:PalMuc/Kingdom-Splitter.git
+
+This will will clone a copy of this repository in a folder named Kingdom-Assignment. Go to this folder by typing:
+
+    cd Kingdom-Splitter
+
+Kingdom assignment is delivered as a Ruby gem. In order to build and install it, you first have to install another gem called bundler. Type:
+
+    rvm jruby gem install bundler
+
+In order to install the other gems Kingdom Assignment depends on, first switch to JRuby:
+
+    rvm use jruby
+
+Now go to the folder called kingdom-assignment and type:
+
+    bundle install
+
+Before you build an updated version of Kingdom Assignment, you should
+delete previous builds by typing:
+
+    rm pkg/kingdom-splitter-*.gem
+
+After that, create a new Ruby gem by typing:
+
+    rake install
+
+Finally you can install the gem by typing:
+
+    rvm jruby gem install pkg/kingdom-splitter*.gem
+
+Kingdom Assignment is now in your global path, meaning that from any point in the system you can use it by typing
+
+    kingdom-splitter
+
+on the command line. Please note that in order to do that you have to switch to JRuby as mentioned before.
+
+# Acknowledgements
+Development of this program was supported by the [Molecular Geo- and Palaeobiology Lab](http://www.mol-palaeo.de/) of the Department of Earth and Environmental Sciences and the initiative "[Gleichstellung in Forschung und Lehre](http://www.frauenbeauftragte.uni-muenchen.de/foerdermoegl/lmu1/tg73/index.html)" of the Ludwig-Maximilians-University Munich (LMU).
diff --git a/bin/kingdom-extraction b/bin/kingdom-extraction
@@ -0,0 +1,114 @@
+#!/usr/bin/env ruby
+
+def table_to_set(table, header)
+  result = Set.new()
+  table.each do |current_row|
+    current = current_row[header]
+    if current.nil?
+      raise "Error: no entry found for header " + header.to_s + " at " + current_row.inspect
+    end
+
+    unless result.include?(current)
+      result.add(current)
+    else
+      raise "Error: duplicate entry for " + current.to_s
+    end
+  end
+  return result
+end
+
+#parse command line arguments
+settings = {}
+unless ARGV.size == 5
+  puts "Usage: kingdom-extraction sequences.fasta clean.csv contaminated.csv clean_output.fasta contaminated_output.fasta"
+  exit
+end
+
+$LOAD_PATH.unshift(File.join(File.dirname(__FILE__), '..', 'lib'))
+$LOAD_PATH.unshift(File.dirname(__FILE__))
+
+require 'rubygems'
+require 'csv'
+require 'set'
+require 'bio'
+require 'kingdom-extraction/version'
+
+puts "Running Kingdom-Extraction " +  Kingdom::Extraction::VERSION.to_s
+
+settings[:input_fasta] = ARGV.shift
+settings[:input_clean] = ARGV.shift
+settings[:input_contaminated] = ARGV.shift
+settings[:output_clean] = ARGV.shift
+settings[:output_contaminated] = ARGV.shift
+
+unless File.exists?(settings[:input_fasta])
+  puts "The input file at " + File.expand_path(settings[:input_fasta]) + " could not be opened!"
+  exit
+end
+
+unless File.exists?(settings[:input_clean])
+  puts "The input file at " + File.expand_path(settings[:input_clean]) + " could not be opened!"
+  exit
+end
+
+unless File.exists?(settings[:input_contaminated])
+  puts "The input file at " + File.expand_path(settings[:input_contaminated]) + " could not be opened!"
+  exit
+end
+
+if File.exists?(settings[:output_clean])
+  puts "The input file at " + File.expand_path(settings[:output_clean]) + " already exists!"
+  exit
+end
+
+if File.exists?(settings[:output_contaminated])
+  puts "The input file at " + File.expand_path(settings[:output_contaminated]) + " already exists!"
+  exit
+end
+
+#CSV backwards compatibility
+if CSV.const_defined? :Reader
+  require 'fastercsv'
+  INSTALLED_CSV = FasterCSV
+else
+  INSTALLED_CSV = CSV
+end
+
+#Open output of Kingdom-Splitter, save clean and contaminated sequence ids in two sets
+puts "Reading clean..."
+clean_table = INSTALLED_CSV.open(settings[:input_clean], "r", { :col_sep => ";", :headers => :first_row, :header_converters => :symbol})
+clean = table_to_set(clean_table, :query_sequence_id)
+clean_table.close
+
+puts "Reading contaminated..."
+contaminated_table = INSTALLED_CSV.open(settings[:input_contaminated], "r", { :col_sep => ";", :headers => :first_row, :header_converters => :symbol})
+contaminated = table_to_set(contaminated_table, :query_sequence_id)
+contaminated_table.close
+
+#Initialize output files
+clean_out = File.open(settings[:output_clean], "w")
+contaminated_out = File.open(settings[:output_contaminated], "w")
+
+puts "Extracting FASTA sequences..."
+QUERY_SEQ_REGEXP = /\A(\S+)\s.*\z/ #Make sure this is exactly the same as in BlastStringParser in Kingdom-Assignment
+
+sequences = Bio::FastaFormat.open(settings[:input_fasta])
+sequences.each do |entry|
+  current = QUERY_SEQ_REGEXP.match(entry.definition)[1] #TODO do something when this comparison fails
+  if clean.include?(current)
+    #Sequence belongs in the clean set
+    clean_out.write(entry)
+  elsif contaminated.include?(current)
+    #Sequence belongs in the contaminated set
+    contaminated_out.write(entry)
+  else
+    #Sequence is not annotated
+  end
+
+end
+
+sequences.close
+clean_out.close
+contaminated_out.close
+
+puts "Done!"