Google Summer of Code 2013 Ideas
Clone this wiki locally
Hey! We're pleased to announce that the Ruby Science Foundation has been accepted as a mentoring organization for Google Summer of Code 2013.
Feel free to reach us by joining
#sciruby on chat.freenode.net or via our mailing list.
Instructions for students
You don't need to know a lot about Ruby before proposing a project: depending on how much you already know, it'll be pretty easy to learn enough to be able to contribute. However, you'll need some familiarity with scientific computation. If you don't have any, take a look at "Numerical Recipes in C", which you'll probably find in your university's library.
In any case, if you feel your skills aren't enough for some project, please ask us on our IRC channel (see contact section above) and we can help you.
Our number-one priority right now as an organization is NMatrix. Other priorities come close, but so far we haven't seen a lot of students expressing interest in NMatrix. In contrast, tons of folks have talked about how to accomplish the Ruby D3 idea, so there will be a lot of competition for that spot. Take this as a hint.
NMatrix is SciRuby's numerical matrix core, implementing dense matrices as well as two types of sparse (linked-list-based and Yale/CSR). NMatrix is a fairly new but well-established project which has received Summer-of-Code-like grants from both Brighter Planet and the Ruby Association (in other words, from Matz, who created Ruby). Those who contribute to NMatrix will likely eventually become authors of a jointly-published peer-reviewed science article on the library. Additionally, NMatrix is a good place to gain practical C and C++ experience, while also working to improve Ruby.
- Mentors: John Woods (@mohawkjohn)
ATLAS Functionality. NMatrix has many but not all ATLAS (cBLAS) and LAPACK functions exposed. We would like to see a consistent interface which makes sense in Ruby. We also want to be able to design and implement several
NMatrixmethods which depend upon ATLAS, cBLAS, and cLAPACK functions.
- Rational Functionality. NMatrix includes some rational number capability, but support is lacking in areas where ATLAS functions are required, since ATLAS does not have a rational type. Rational-specific equivalents of ATLAS functions are needed. Along the way it may be possible to also implement some integer-specific ATLAS function equivalents. (This is a component of the ATLAS Functionality project, but could be proposed separately with sufficient justification.)
ATLAS-Free Support. NMatrix has some non-ATLAS versions of functions like
gemm(matrix multiplication) which are typically used for rational matrices. Since we have to go to the trouble of implementing rational versions of many ATLAS functions, it might be useful to also have more simplistic ATLAS-free complex and floating-point versions of those ATLAS functions as well -- primarily for those who don't have LAPACK. They wouldn't be as heavily optimized, but would serve in a tight spot.
- Basic matrix math functionality. Specifically, exponentials and square roots, matrix decomposition/factorization, calculation of norms, tensor products, principal component analysis (PCA). This is listed as a sub-project of ATLAS functionality, because several depend upon ATLAS, but it could really be proposed as a separate GSOC project. These functions are all enormously important, and would substantially improve the usability of NMatrix. Successful implementation would likely lead to co-authorship on a peer-reviewed article, and at the very least would look outstanding on a curriculum vitae.
- Sparse improvements. The "new" Yale matrices used by NMatrix, which store diagonals (zero and non-zero) separately from non-diagonal non-zeros, are inefficient for matrices that are taller than they are wide. One way to address the problem would be to introduce an alternate "old" Yale storage. Another would be to allow matrices to be stored and operated on transposed. The goal, overall, is to be able to produce efficient Yale/sparse vectors regardless of the vectors' orientation.
extconf improvements. See Ruby-core improvements below. NMatrix uses
mkmffor compilation of its C and C++ code, as well as linking ATLAS, LAPACK, and BLAS. But
mkmfis difficult to use, and leads to compilation and linking problems -- not just in NMatrix but elsewhere as well, and particularly when working on multiple platforms (Linux, Mac, Windows, etc.). It'd be better to have a custom
extconf.rb-related library for NMatrix to use for linking highly-specialize C libraries like ATLAS. A successful implementation of this project would significantly reduce barriers for NMatrix adoption (e.g., by eliminating compiling and linking difficulties).
- Mentors: John Woods (@mohawkjohn)
- Ruby-core projects, particularly
mkmfrequire that the student develop a good understanding of C as well as Ruby. Some prior familiarity with C and C++ would be beneficial.
mkmfis the library Ruby uses, typically in
extconf.rbin gems or other libraries (including NMatrix), for linking C and C++ extensions. It lacks documentation. Most people currently figure it out by trial-and-error.
mkmf-related project would accomplish both of the following goals:
mkmfand how it is used by other Ruby extensions, in order to determine common use cases.
- Propose and implement an update to or replacement for
mkmf, which improves Ruby extension compilation and linking, and show how your work makes it easier to achieve the use-cases from #1.
Such a project would be extremely popular in the broader Ruby community.
SciRuby::Dataframe (provisory name)
- Mentors: Carlos Agarie (@agarie), Claudio Bustos(@clbustos), Max Makarochkin (@mac-r), John Prince (@jtprince)
- SciRuby::Dataframe will be an implementation of a concept similar to Pandas (http://pandas.pydata.org/pandas-docs/dev/), made in Python. It's a library that will provide containers -- Dataframes, tabular structures like data.frame in R and Series, for 1-dimensional data -- usable by more powerful data analysis packages.
- Some requirements:
- Have some simple statistics built-in (maybe by having statsample as a dependency): averages, quartiles, standard deviation, etc.
- Be really easy to plot. For example, a user should be able to plot a histogram from a Series object with only one method call, maybe two, without having to do conversions or anything else.
- Easily receive and interpret data from a CSV file (or any delimeter separated value file), transforming it into a Dataframe with something as simple as
SciRuby::Dataframe.csv("data.csv")or similar. Can use a simpler parser, e.g. stdlib's CSV module, for now, but will eventually need a faster one.
- Chunk processing of CSV files. The lazy enumerators in Ruby 2.0 can be useful for this, or we might need a new parser.
- Be able to add/remove columns and do operations on rows or columns.
- Have labeled columns and indexed rows. This means that the underlying data structure (wrapping NMatrices and NVectors) will need to store some metadata.
- Use NMatrix/NVector for data storage. This also implies that we can use the NMatrix::IO module.
- As some of the requirements of this project depend on others (visualization, statistics, etc), the most important part is to design and develop it in such a way for its API to be easy to use for new users (e.g. scientists without much programming background) but extensible enough for other projects to use it.
- There are various projects that can be based on this one -- e.g. you can design and develop a CSV engine with most of the features that R/Pandas have or you can build an implementation of Dataframe/Series that uses NMatrix/NVector in a very efficient way, etc. If you have an idea, talk to us on the mailing list or on IRC.
- Inspired by Pandas and Statsample::Dataset.
Create the foundations of a visualization package based on D3
- Mentors: John Woods (@mohawkjohn), Raoul J.P. Bonnal, Rob Syme, Pjotr Prins, Karl Broman
Statsample and Distribution
Statsample is an essential scientific library which brings statistical functions to Ruby. Currently, it depends upon Ruby/GSL, which conflicts with NMatrix. To bring it up to spec, it needs to require the SciRuby fork of rb-gsl instead. There has been some talk of removing support for Ruby versions prior to 1.9.3. Additionally, but no less importantly, a student could work on implementing Generalized Linear Models (GLM) and Time Series Analysis. Lastly, Statsample depends upon Distribution, which makes available statistical distribution functions for users of MRI (in pure Ruby and through GSL) and JRuby. Many of these functions remain unimplemented, or need a JRuby or GSL or pure Ruby version written.
- Mentors: Claudio Bustos (@clbustos), John Woods for statistical distributions (@mohawkjohn)
Minimization and Integration
Minimization and Integration are two SciRuby modules which are used by Claudio Bustos' statsample gem. For Minimization, students would research and suggest additional minimization methods, develop tests, and improve documentation. For Integration, students would implement additional numerical integration methods and add support for solving various types of (ordinary and/or partial) differential equations. We need to be explicit about the imprecisions and performance of each method, so benchmarks will be necessary. As always, the student is expected to write tests and document code. There has been some talk of removing support for Ruby versions earlier than 1.9.3 for both Integration and Minimization.
- Mentors: Claudio Bustos (@clbustos)
Machine Learning & Data Mining Algorithms for Ruby
- Mentors: Raoul J.P. Bonnal, Francesco Strozzi
- Machine learning and data mining algorithms are widely employed for analyses of complex datasets, especially in bioinformatics. Many Java libraries currently exist that implement the most commonly used algorithms in bioinformatics (such as clustering methods and simple classifiers), but the usability of these tools is restricted by the limited supply of APIs and user-friendly implementations for languages other than Java.
- Approach: The goal of this project would be to implement a system to easily access these set of tools using JRuby and to develop a basic framework that integrates the different sources. The Java libraries that could be primarily used would be taken from Weka (http://www.cs.waikato.ac.nz/ml/weka/) and RapidMiner (http://rapid-i.com/content/view/181/190/). This approach could be subsequently extended to develop a visualization scheme based on D3.
- Another idea is to integrate Waffles: "Waffles seeks to be the world's most comprehensive collection of command-line tools for machine learning and data mining. Our native tools have minimal dependencies (no interpreter, VM, or runtime environment is necessary), and build cross-platform. If you have a useful data mining tool that meets these criteria, we want it in Waffles.". We would want to wrap the command line interface (much like mini_magick does for imagemagick) and/or create native bindings that link in with NMatrix or Sciruby::Dataset.
- Difficulty and needed skills: Medium/Hard depending on the topic selected and the scope of the project. Basic statistical knowledge is required as well as programming in Ruby, JRuby and Java.
- The project requires basic statistical knowledge,Ruby,JRuby,Java and possibly C/C++, wrapping external libraries, machine learning
Semantic web support for SciRuby
- Mentors: Pjotr Prins, Toshiaki Katayama, Mark Wilkinson, Jerven Bolleman
- Interactive scientific tools tend to manage complex state in RAM and allows persistence after an analysis session. One example for statistics is R and its environment. The downside of this approach is that the amount of data handled is limited by memory size of the machine. We propose to use a semantic web data store (a server or a linked library) as a generic backend for interactive data analysis and session persistence in interactive Ruby. The use of a flexible data store that allows complex objects and querying using SPARQL is appealing. Applications in scientific domains, such as bioinformatics, would especially be gratifying. The bioinformatics community is doing a lot of work integrating different data repositories. Bio2RDF contains a wide range of mapped identifiers and SADI, which is the service discovery for when data is put online. A list of activities can be found here. BioRuby and biogems contain a wide range of parsers and formatters which could be extended to support reading and writing RDF (RDF is the standard of semantic web data). Having such functionality would make it easy for bioinformaticians to incorporate and expose RDF for flexible data queries.
- Approach: We will visit all existing classes, parsers and formatters and decide which ones are most useful for RDF import/export. The student will tackle one transformer at a time, writing tests and adding a SPARQL end point for others to use. The student will also add SADI service discovery. We take bioinformatics as the applied domain, still the software should be generically useful for statistical data mining and discovery
- Difficulty and needed skills: Average difficulty
- The student will need to have affinity with the semantic web and get to a decent level op Ruby programming. Probably includes meta-programming.
Create a Ruby wrapper for LEMON
- Mentors: Carlos Agarie (@agarie), John Prince (@jtprince)
- From their site: "LEMON stands for Library for Efficient Modeling and Optimization in Networks. It is a C++ template library providing efficient implementations of common data structures and algorithms with focus on combinatorial optimization tasks connected mainly with graphs and networks."
- The major hurdle will be to decide if we should use the original API, resulting in a tiny layer of C code to connect Ruby to LEMON, or if it (or parts of it) should be redesigned to have a more "Ruby-like feeling".
- This would be a great chance to learn more about Ruby's C API. Some experience with graphical models will be useful.
- This library would allow us to create probabilistic graphical models with SciRuby (using statsample and distribution).
- Related technologies and theories: MRI C API, wrapping external libraries, C/C++, graph theory.