GitHub - EESI/nbc: Naive Bayes Classification Tool

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
GPL-3		GPL-3
LICENSE		LICENSE
Makefile		Makefile
NEWS		NEWS
README		README
binary.sml		binary.sml
build.sh		build.sh
clean.sh		clean.sh
count.ml		count.ml
count.mlb		count.mlb
count.sml		count.sml
countncbi		countncbi
fail.sml		fail.sml
fasta-all.mlb		fasta-all.mlb
fasta-test.mlb		fasta-test.mlb
fasta.mlb		fasta.mlb
fasta.sml		fasta.sml
gene.mlb		gene.mlb
gene.sml		gene.sml
genome.sml		genome.sml
gzip.c		gzip.c
gzip.sml		gzip.sml
history.mlb		history.mlb
history.sml		history.sml
judy.sml		judy.sml
kahan.sml		kahan.sml
main.sml		main.sml
matlab.sml		matlab.sml
misc.sml		misc.sml
nmer-all.mlb		nmer-all.mlb
nmer-test.mlb		nmer-test.mlb
nmer.mlb		nmer.mlb
nmer.sml		nmer.sml
options.sml		options.sml
parse-state.mlb		parse-state.mlb
parse-state.sml		parse-state.sml
probabilities-by-read.mlb		probabilities-by-read.mlb
probabilities-by-read.sml		probabilities-by-read.sml
program.sml		program.sml
promise.mlb		promise.mlb
promise.sml		promise.sml
score.mlb		score.mlb
score.sml		score.sml
sequence.sml		sequence.sml
stopwatch.sml		stopwatch.sml
storejudy.sml		storejudy.sml
stream.mlb		stream.mlb
stream.sml		stream.sml
substitution.grm		substitution.grm
substitution.lex		substitution.lex
substitution.sml		substitution.sml
tabulate.ml		tabulate.ml
test-generate.sml		test-generate.sml
test-library.mlb		test-library.mlb
test-library.sml		test-library.sml
tree.mlb		tree.mlb
tree.sml		tree.sml

Repository files navigation

This is the Naive Bayes Classifier, developed by the genomic signal
processing lab led by Professor Gail Rosen at Drexel University.

It uses a method similar to that used in many email spam filters to
score a genetic sample against different genomes, to possibly identify
the closest match. The method is described in the paper <>

REQUIREMENTS

To compile this code you need the following (versions given are the
versions we used, but slightly older or newer versions probably work too):

MLton 20100608
GNU binutils 2.18.1
GNU C Compiler 4.3.2
GNU Make 3.81
Judy 1.0.5
zlib 1.2.3.3

It has been tested extensively on Mac OS X and Linux, on both 32-bit
and 64-bit processors. It probably works on other, similar operating
systems without any changes. 64-bit uses more memory but also allows
larger genomes to be used. No other differences between the 32-bit and
64-bit versions have been observed. It may work on Windows, but that has
not been attempted.

Since it is now written in Standard ML, it may in theory be compilable
with other Standard ML compilers, such as Standard ML of New Jersey,
MLKit, PolyML, etc. We have not attempted this. Some changes would
probably be necessary since the MLton foreign function interface (used
for judy array and gzip support) is different from the interface used
by other compilers.

BUILDING

For all the example commands, the $ indicates the shell prompt. Don't type
the $, just everything after the $. And most of these examples should
not be typed in verbatim (unless you happen to have the genomes for a
unicorn and a wumpus lying around - in that case, lucky you!). Instead
modify the examples to suit your particular circumstances.

Run "make" to build:

$ make

Assuming it completes without any problems, you will have three
programs: count, score, and tabulate. Install them somewhere in your path:

$ cp count score tabulate /usr/local/bin

SETUP

The first step is to set up your genome data. Create a new directory,
for example "genomes", and inside that directory, create a directory
for each genome:

$ mkdir genomes
$ mkdir genomes/Unicorn
$ mkdir genomes/Wumpus

Then you run count on the FASTA files containing the genome (and any
plasmids), for each word size you want to score against:

$ count -w genomes/Unicorn/15perword.gz -t genomes/Unicorn/15total \
	-r 15 Unicorn.fasta Unicorn_plasmid.fasta
$ count -w genomes/Unicorn/13perword.gz -t genomes/Unicorn/13total \
	-r 13 Unicorn.fasta Unicorn_plasmid.fasta
$ count -w genomes/Wumpus/15perword.gz -t genomes/Wumpus/15total \
	-r 15 Wumpus.fasta
$ count -w genomes/Wumpus/13perword.gz -t genomes/Wumpus/13total \
	-r 13 Wumpus.fasta

SCORING

Now, run score on your input file. Order 15 usually gives the best
results so we'll try that first:

$ score -a semen_sample.fasta -r 15 -j genomes

For this example, you would get two files:
	semen_sample-15-Unicorn.txt
	semen_sample-15-Wumpus.txt

TABULATION

For easy import into a spreadsheet, you can run tabulate to put it in
CSV format:

$ tabulate semen_sample-15-Unicorn.txt semen_sample-15-Wumpus.txt

This will create the files:
	semen_sample-15-0.csv.gz
	semen_sample-15-1.csv.gz
	semen_sample-15-2.csv.gz
and so on. The exact number of files will depend on how big your input
file is.

FURTHER INFORMATION

Each command has a --help option, which may be helpful.

BUGS

count and score load the entire genome into memory. For large genomes this
requires a stupendous amount of memory.

LICENSE

It has been licensed under the <> license.
See the LICENSE file for details.

FEEDBACK

Any feedback should be directed to gailr@gmail.com.