public
Description: Use a Bayesian classifier to determine source code language
Homepage:
Clone URL: git://github.com/chrislo/sourceclassifier.git
name age message
file .gitignore Sun Dec 07 08:24:47 -0800 2008 .gitignore file [chrislo]
file HISTORY Tue Jan 06 11:02:15 -0800 2009 Up to gem v0.2.2, added HISTORY file [chrislo]
file Manifest Tue Jan 06 11:02:15 -0800 2009 Up to gem v0.2.2, added HISTORY file [chrislo]
file README.textile Tue Jan 06 11:02:15 -0800 2009 Up to gem v0.2.2, added HISTORY file [chrislo]
file Rakefile Thu Feb 19 13:55:45 -0800 2009 Ruby 1.9 compatibility fixes [chrislo]
directory examples/ Sun Dec 07 08:25:13 -0800 2008 Example of usage [chrislo]
directory lib/ Thu Feb 19 13:53:19 -0800 2009 [ruby 1.9 compatability] ftools has been remove... [chrislo]
file sourceclassifier.gemspec Thu Feb 19 13:55:45 -0800 2009 Ruby 1.9 compatibility fixes [chrislo]
directory sources/ Tue Jan 06 10:23:02 -0800 2009 Added a rake task to populate CSS from csszenga... [chrislo]
directory test/ Thu Feb 19 13:53:19 -0800 2009 [ruby 1.9 compatability] ftools has been remove... [chrislo]
file trainer.bin Tue Jan 06 10:23:02 -0800 2009 Added a rake task to populate CSS from csszenga... [chrislo]
README.textile

SourceClassifier

Source classifier identifies programming language using a Bayesian classifier trained on a corpus generated from the Computer Language Benchmarks Game . It is written in Ruby and availabe as a gem. To train the classifier to identify new languages download the sources from github.

Out of the box SourceClassifier recognises Css, C, Java, Javascript, Perl, Php, Python and Ruby.

Usage

First install the gem using github as a source

$ gem sources -a http://gems.github.com $ sudo gem install chrislo-sourceclassifier

Then, to use

  require 'rubygems'
  require 'sourceclassifier'
  
  s = SourceClassifier.new
  
  ruby_text = <<EOT
  def my_sorting_function(a)
    a.sort
  end
  EOT
  
  c_text = <<EOT
  #include <unistd.h>
  
  int main() {
    write(1, "hello world\n", 12);
    return(0);
  }
  EOT
  
  s.identify(ruby_text) #=> Ruby
  s.identify(c_text) #=> Gcc

Training

Download the sources from github and in the directory run the training rake test

$ rake train

In the ./sources directory are subdirectories for each language you wish to be able to identify. Each subdirectory contains examples of programs written in that language. The name of the directory is significant – it is the value returned by the SourceClassifier.identify() method.

The rake task populate:shootout can be used to build these subdirectories from a checkout of the computer language shootout sources but you are free to train the classifier using any available examples. Edit the Rakefile to point to your checkout of the shootout sources

Run rake populate:css to grab the css files used to train the classifier from csszengarden.com.

To populate the sources directory using all available sources run

$ rake populate:all

Acknowledgments

This library depends heavily on the great Classifier gem by Lucas Carlson and David Fayram II.