Frédéric Bastien edited this page Mar 18, 2016 · 11 revisions

This is a list of ideas for the Google Summer of Code 2016. You can have other good ideas. In all cases discuss them on theano-dev mailing list to make you are known to our community and to understand them better. This is important for your application; it should demonstrate that you understand what is needed to do.

We will try to participate through an umbrella organization, probably the Python Software Fundation. For more information on how to apply, you should read this Python SoC2016. They request that you have one PR merged for candidate student. You can look at ticket marted as easy fix.

The current mentors are Frédéric Bastien, Pascal Lamblin, and Arnaud Bergeron.

Theano Organization

Theano is a software library written in Python, C and CUDA (we also have a start of an OpenCL back-end). It is a domain-specific compiler. This means that a user writes Python code to build a computation graph and asks Theano to compile it. Theano optimizes the graph and generates mainly C and/or CUDA code to perform the computation efficiently. It mainly supports computation on numpy.ndarray, but Theano also supports sparse matrices.

Theano is mostly developed by the MILA institute, that works in machine learning, but Theano isn't restricted to machine learning; its capacity for optimizing computation makes it useful for many applications that rely on large amounts of numerical computation. There are also many Theano contributors outside of the MILA.

As you probably know, deep learning is changing the world and Theano is one of the main libraries that support this field!

Contacting us

The main communication channel with developers is via our developer mailing list: theano-dev, please use it for GSoC related questions or discussion of projects.

We mainly reside in the Eastern Standard Time zone so you will usually receive replies faster during our work day. Some of us, however, frequently work outside normal work hours or reside in other time zones.

For student in GSoC, we prefer that discussions stay public as much as possible. The mailing list or github is great for that. But for more interactive discussion, other means can be used. Last year, we used g-chat. This needs to be discussed with mentors.

Highlighted ideas:

  • Faster optimization phase during compilation

    • Difficulty: Easy/medium
    • Skill needed: only Python code. Understanding of algorithm complexity (O(1) vs O(n)) useful. Some experience with Theano useful to start faster.
    • Problem: The Theano compilation phase optimization, is slow for big graph. This make Theano hard to use with big graph, especially when the user is developing its model.
    • ticket Ask on the mailing list user for slow case. Then profile them and find/fix the bottlenecks.
    • This can work as all the problems found up to now for slow case are due to how the optimization are implemented or the optimization order or the algorithm or how they got applied. Only real use case can reveal the real bottleneck.
    • Change algo to better scale.
      • For example, use more strict algo instead of cycle detection for inplace opt.
    • Global gpu opt to make a first pass. issue
    • Mentors: Frédéric and Arnaud
  • Faster linker phase during compilation

    • Difficulty: medium
    • Skill needed: Python and C code.
    • Problem: The first time we compile a Theano function, we compile many C shared library. This time-consuming. As we cache them, it less of a problem for later call, but as it still can take ~1h in some case, an upgrade there would be very useful.
    • One way is to compile fewer shared module (Currently we compile about 5k Python module for Theano tests)
      • Check the content of the Theano cache for Theano tests. Then find way to combine many case together. Some cases found.
        • The indexing operation could be more generic without loosing execution speed.
        • CuDNN descriptor should not use attribute, but inputs of the ops.
        • CuDNN conv and pool could use op param instead of hardcoding some props.
      • (Started) Make elemwise c code generate code for many dtypes at the same times. Elemwise is the op with the highest number of generated c code.
    • Compilation system
      • Compile not cached thunk from the same Theano function inside one module.
      • Compile thunks in parallel with python threads. This mainly help one job with empty cache.
      • A lock-less compilation cache, or partial locks, to enable compilation in parallel using multiple processes. This mainly help many concurrent jobs started about the same time with empty cache.
    • Mentors: Frédéric and Pascal
  • Include more operations from optimized GPU libraries

    • Difficulty: medium
    • Skills needed: Python and CUDA
    • Problem:
      • CuDNN provides optimized calls for some machine learning algorithms, or pieces of algorithms, for instance batch normalization. These are currently implemented using existing Theano primitives, which could be less efficient, or use more memory (during the fprop, bprop, or both). We should wrap those functionalities as Theano Ops, and add the appropriate optimizations so that they are used when available, and fall back to another implementation (chain of Theano operations) otherwise.
      • Nervana provides optimized kernels for GEMM and convolutions, for single and half precision
      • CuSparse and CuSolver could also be wrapped for faster operations
    • Mentors: Frédéric, Arnaud, Pascal
  • Better handling of large graph

    • Difficulty: medium/hard
    • Skills needed: Python, algorithmic understanding (graph traversal, asymptotic complexity)
    • Problem: Theano has trouble handling large graphs (large number of nodes) and deep graphs (long chain between inputs annd outputs), which can lead to crashes or long compilation times.
      • One issue is the use of recursive algorithms for graph traversal, which make Theano reach the Python stack limit. They are used in particular when computing gradients, when cloning the graph, maybe during the optimization phase as well.
      • Another issue is that Python's pickle also uses a recursive algorithm to serialize and de-serialize objects. We should investigate alternatives. Maybe fixed in Python 3.
      • Also, the time spent optimizing graphs does scale supra-linearly in the number of nodes when using the full optimizer (fast_run). This is something to investigate further: can we reduce that by cutting some optimization phase? How could we organize optimizations differently?
      • Global optimization that move computation to the GPU.
    • Mentors: Frederic, Pascal
  • Lower peak memory usage

    • Difficulty: medium
    • Skill needed: Python, algorithmic understanding (O(1) vs O(n))
    • Problem: We can't run some models on the GPU as the GPU don't have enough memory.
      • We need a mechanism in the VM that allows it to remove during execution some temporary variable and let them be recomputer later only when needed. The recomputation is partially done with the current lazy evaluation mechanism. Then when we miss memory on the GPU, we can callback to that.
    • Problem: We currently compare the peak memory usage we currently have again the min theoretic peak. But the computation of the min theoretic peak is too slow for many basic case. So speed it up. We have see in some case that we use more then the min theoretic peak. So it would be useful to know in more normal case if we got hit by that.
      • As that algo is too slow, we can try approximations during profiling: random search and a fast algo that give the right result when the graph is a tree (this would be an approximation as Theano graph are DAGs)
      • After comparing those 2 algo on during profiling, make Theano function use them. This will request that the user pass expected shapes for the inputs.
    • Mentors: Frédéric, Pascal and Arnaud
  • Partial evaluation of a Theano function

    • Difficulty: medium/hard
    • Skill needed: Python and C code.
    • This would help to make compilation faster, but would allow to compile the eval function on the train set much faster (only 1 compilation + 1 recompilation/partial evaluation).
    • This would allow to unroll graph and select to compute only for some partial execution (unrolled graph)
    • issue gh-2472
    • Mentors: Frédéric and Arnaud
  • Lower Theano function call overhead

    • Difficulty: easy
    • Skill needed: Python and C code.
    • Problem: Each call to a function compiled by Theano have some overhead. If the graph does not contain much compututaion (like if it works on scalar) this can be significant.
    • Create a Filter Op and reuse it to move the logic that validate/convert the input in the Theano graph instead of the wrapping Python code.
      • Create c code of this Op to remove Python overhead
    • Split the Python code when calling a Python version in 2 layers: one with a fixed number of inputs without any keyword arguments and default values and one with those.
    • Move elemwise computation on numpy 0d array to theano scalar op(that represent the c type, so no object allocation)
    • Disable garbage collector for numpy 0d array?
    • Mentors: Frédéric, Arnaud

Ideas with a lower priority

  • Generate a shared library (a proof of concept is available as a starting point)

    • Difficulty: medium
    • Skill needed: Python and C.
    • Problem: It would be very useful to generate a shared library from a Theano function. This would allow to reuse it in other program and on embedded systems more easily.
    • Bring the prototype to a working version without adding new feature.
    • Document it.
    • Add support for scalar constant value in the graph.
    • Make a configuration option to enable/disable GC of intermediate results.
    • Make an interface to support shared variables.
    • (If time permits) To make it work on Windows, we need to back-port some c code that use C99 features.
    • Mentors: Frédéric Bastien, Arnaud and Pascal
  • Add more linear algebra operation here, here and here

    • Difficulty: easy
    • Skill needed: Python
    • Problem: There is still many operation in numpy.* that we do not have under theano.tensor.* We have request for some of them from time to time. We should provide those operations. We should also implement the infer_shape and grad method when possible.
    • Mentors: Frédéric, Arnaud and Pascal.
  • Bridge Theano with other compiler and library (Numba, Cython, Parakeet, ...)

    • Difficulty: medium
    • Skill needed: Python mostly, but knowing C would help some of them.
    • Problem: There is many other system that have very optimized code for some case or allow to generate faster code then Python code more easily then writing C code (like Numba, Cython). Making it easier to use them with Theano would be very valuable.
    • Update the compilation system to compile other library more easily by reusing the Theano compilation cache mechanism.
    • Make an easy to use interface to use Cython with Theano. We currently do it manually for the Scan op.
    • Make an easy to use interface to reuse Numba (we provide just an example for now)
    • Make Theano use the C interface of a Numba function
    • Mentors: Frédéric, Arnaud and Pascal.
  • An example for Android

    • Difficulty: medium
    • Skill needed: Python, C. Knowing Android would help.
    • 2 possible cases:
      • Full Theano with dynamic compilation
      • Only the dynamic DLL from the point above. This could need only the first part of above.
    • Mentors: Frédéric
  • OpenCL

    • Difficulty: medium.
    • Skill needed: Python and C. Understanding parallel computation a must. Knowing CUDA and/or OpenCL a plus.
    • Continue ongoing work in development branch to build OpenCL support
    • Port current CUDA implementations to OpenCL
    • Add OpenCL implementations for unsupported expression types
    • Tune existing OpenCL kernels for various operations
    • Mentors: Frédéric and Arnaud
  • Improve pickling of Theano objects

    • Difficulty: very hard.
    • Skill needed: Python.
    • Theano Shared variable pickling with and without GPU.
    • Cache the compilation step in the compiledir (started, but need to be finished gh-)
    • Mentors: Frédéric, Arnaud and Pascal

Other ideas not sorted and not developed

  • IfElse (lazy evaluation) c code and can be inplace on two inputs
  • Faster optimization phase (use a SAT Solver?)
  • Allow to do memory profiling in the CVM (now it use the VM)
  • Re-write DebugMode to reuse the CVM or VM and simplify it
  • less opaque theano.function()
  • Track user usage of Theano with their permission
    • Allow to find bugs that would have affected you in the past too.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.