diff --git a/.gitignore b/.gitignore
index 6d6397f166b..76c4892a180 100644
--- a/.gitignore
+++ b/.gitignore
@@ -37,3 +37,4 @@ Theano.suo
 .ipynb_checkpoints
 .pydevproject
 .ropeproject
+core
\ No newline at end of file
diff --git a/doc/extending/extending_theano.txt b/doc/extending/extending_theano.txt
index e9a0e7a6298..41741d23703 100644
--- a/doc/extending/extending_theano.txt
+++ b/doc/extending/extending_theano.txt
@@ -681,8 +681,8 @@ For instance, to verify the Rop method of the DoubleOp, you can use this:
 Testing GPU Ops
 ^^^^^^^^^^^^^^^
 
-Ops to be executed on the GPU should inherit from the
-``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
+When using the old GPU backend, Ops to be executed on the GPU should inherit
+from ``theano.sandbox.cuda.GpuOp`` and not ``theano.Op``. This allows
 Theano to distinguish them. Currently, we use this to test if the
 NVIDIA driver works correctly with our sum reduction code on the GPU.
 
diff --git a/doc/install.txt b/doc/install.txt
index f9768b0ff29..713329122a8 100755
--- a/doc/install.txt
+++ b/doc/install.txt
@@ -375,7 +375,7 @@ If ``theano-nose`` is not found by your shell, you will need to add
 
     If you want GPU-related tests to run on a specific GPU device, and not
     the default one, you should use :attr:`~config.init_gpu_device`.
-    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=gpu1``.
+    For instance: ``THEANO_FLAGS=device=cpu,init_gpu_device=cuda1``.
 
     See :ref:`libdoc_config` for more information on how to change these
     configuration options.
@@ -508,25 +508,25 @@ Any one of them is enough.
 
     :ref:`Ubuntu instructions <install_ubuntu_gpu>`.
 
-
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
 
 Once that is done, the only thing left is to change the ``device`` option to name the GPU device in your
 computer, and set the default floating point computations to float32.
-For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=gpu,floatX=float32'``.
+For example: ``THEANO_FLAGS='cuda.root=/path/to/cuda/root,device=cuda,floatX=float32'``.
 You can also set these options in the .theanorc file's ``[global]`` section:
 
      .. code-block:: cfg
 
         [global]
-        device = gpu
+        device = cuda
         floatX = float32
 
 Note that:
 
-    * If your computer has multiple GPUs and you use 'device=gpu', the driver
+    * If your computer has multiple GPUs and you use 'device=cuda', the driver
       selects the one to use (usually gpu0).
     * You can use the program nvida-smi to change this policy.
-    * You can choose one specific GPU by specifying 'device=gpuX', with X the
+    * You can choose one specific GPU by specifying 'device=cudaX', with X the
       the corresponding GPU index (0, 1, 2, ...)
     * By default, when ``device`` indicates preference for GPU computations,
       Theano will fall back to the CPU if there is a problem with the GPU.
@@ -794,6 +794,8 @@ setup CUDA, but be aware of the following caveats:
      toggle your GPU on, which can be done with
      `gfxCardStatus <http://codykrieger.com/gfxCardStatus>`__.
 
+Next, install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_.
+
 Once your setup is complete, head to :ref:`using_gpu` to find how to verify
 everything is working properly.
 
diff --git a/doc/install_ubuntu.txt b/doc/install_ubuntu.txt
index bed57ae701c..d77190e54bc 100644
--- a/doc/install_ubuntu.txt
+++ b/doc/install_ubuntu.txt
@@ -43,7 +43,7 @@ For Ubuntu 11.10 through 14.04:
 
     sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git
     sudo pip install Theano
-    
+
 On 14.04, this will install Python 2 by default. If you want to use Python 3:
 
 .. code-block:: bash
@@ -104,30 +104,30 @@ For Ubuntu 11.04:
    The development version of Theano supports Python 3.3 and
    probably supports Python 3.2, but we do not test on it.
 
-    
+
 Bleeding Edge Installs
 ----------------------
 
-If you would like, instead, to install the bleeding edge Theano (from github) 
-such that you can edit and contribute to Theano, replace the `pip install Theano` 
+If you would like, instead, to install the bleeding edge Theano (from github)
+such that you can edit and contribute to Theano, replace the `pip install Theano`
 command with:
 
 .. code-block:: bash
 
     git clone git://github.com/Theano/Theano.git
-    cd Theano 
+    cd Theano
     python setup.py develop --user
     cd ..
 
 VirtualEnv
 ----------
-    
-If you would like to install Theano in a VirtualEnv, you will want to pass the 
-`--system-site-packages` flag when creating the VirtualEnv so that it will pick up 
+
+If you would like to install Theano in a VirtualEnv, you will want to pass the
+`--system-site-packages` flag when creating the VirtualEnv so that it will pick up
 the system-provided `Numpy` and `SciPy`.
 
 .. code-block:: bash
-    
+
     virtualenv --system-site-packages -p python2.7 theano-env
     source theano-env/bin/activate
     pip install Theano
@@ -208,7 +208,7 @@ Updating Bleeding Edge Installs
 Change to the Theano directory and run:
 
 .. code-block:: bash
-    
+
     git pull
 
 
@@ -303,7 +303,7 @@ Test GPU configuration
 
 .. code-block:: bash
 
-    THEANO_FLAGS=floatX=float32,device=gpu python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
+    THEANO_FLAGS=floatX=float32,device=cuda python /usr/lib/python2.*/site-packages/theano/misc/check_blas.py
 
 .. note::
 
diff --git a/doc/install_windows.txt b/doc/install_windows.txt
index 733e518661e..3183aa10d5c 100644
--- a/doc/install_windows.txt
+++ b/doc/install_windows.txt
@@ -423,16 +423,16 @@ Create a test file containing:
    print("NP time: %f[s], theano time: %f[s] (times should be close when run on CPU!)" %(
                                               np_end-np_start, t_end-t_start))
    print("Result difference: %f" % (np.abs(AB-tAB).max(), ))
-   
+
 .. testoutput::
    :hide:
    :options: +ELLIPSIS
-   
+
    NP time: ...[s], theano time: ...[s] (times should be close when run on CPU!)
    Result difference: ...
 
 .. code-block:: none
-   
+
    NP time: 1.480863[s], theano time: 1.475381[s] (times should be close when run on CPU!)
    Result difference: 0.000000
 
@@ -445,6 +445,8 @@ routine for matrix multiplication)
 Configure Theano for GPU use
 ############################
 
+Install `libgpuarray <http://deeplearning.net/software/libgpuarray/installation.html>`_ if you have not already done so.
+
 Theano can be configured with a ``.theanorc`` text file (or
 ``.theanorc.txt``, whichever is easier for you to create under
 Windows). It should be placed in the directory pointed to by the
@@ -457,7 +459,7 @@ To use the GPU please write the following configuration file:
 .. code-block:: cfg
 
    [global]
-   device = gpu
+   device = cuda
    floatX = float32
 
    [nvcc]
@@ -498,7 +500,7 @@ within an MSYS shell if you installed Nose manually as described above.
 Compiling a faster BLAS
 ~~~~~~~~~~~~~~~~~~~~~~~
 
-If you installed Python through WinPython or EPD, Theano will automatically 
+If you installed Python through WinPython or EPD, Theano will automatically
 link with the MKL library, so you should not need to compile your own BLAS.
 
 .. note::
diff --git a/doc/optimizations.txt b/doc/optimizations.txt
index db5457c6439..51cec844c27 100644
--- a/doc/optimizations.txt
+++ b/doc/optimizations.txt
@@ -32,6 +32,7 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 ========================================================= ========= ============ =============
 :term:`merge`                                             x         x
 :term:`constant folding<constant folding>`                x         x
+:term:`GPU transfer`                                      x         x
 :term:`shape promotion<shape promotion>`                  x
 :term:`fill cut<fill cut>`                                x
 :term:`inc_subtensor srlz.<inc_subtensor serialization>`  x
@@ -52,7 +53,6 @@ Optimization                                              FAST_RUN  FAST_COMPILE
 :term:`inplace_elemwise`                                  x
 :term:`inplace_random`                                    x
 :term:`elemwise fusion`                                   x
-:term:`GPU transfer`                                      x
 :term:`local_log_softmax`                                 x                      x
 :term:`local_remove_all_assert`                                                   
 ========================================================= ========= ============ =============
diff --git a/doc/tutorial/aliasing.txt b/doc/tutorial/aliasing.txt
index f9e5962b9c0..8651253f524 100644
--- a/doc/tutorial/aliasing.txt
+++ b/doc/tutorial/aliasing.txt
@@ -261,52 +261,6 @@ combination of ``return_internal_type=True`` and ``borrow=True`` arguments to
 hints that give more flexibility to the compilation and optimization of the
 graph.
 
-For GPU graphs, this borrowing can have a major speed impact.  See the following code:
-
-.. code-block:: python
-
-   from theano import function, config, shared, sandbox, tensor, Out
-   import numpy
-   import time
-
-   vlen = 10 * 30 * 768  # 10 x # cores x # threads per core
-   iters = 1000
-
-   rng = numpy.random.RandomState(22)
-   x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-   f1 = function([], sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)))
-   f2 = function([],
-                 Out(sandbox.cuda.basic_ops.gpu_from_host(tensor.exp(x)),
-                     borrow=True))
-   t0 = time.time()
-   for i in range(iters):
-       r = f1()
-   t1 = time.time()
-   no_borrow = t1 - t0
-   t0 = time.time()
-   for i in range(iters):
-       r = f2()
-   t1 = time.time()
-   print(
-       "Looping %s times took %s seconds without borrow "
-       "and %s seconds with borrow" % (iters, no_borrow, (t1 - t0))
-   )
-   if numpy.any([isinstance(x.op, tensor.Elemwise) and
-                 ('Gpu' not in type(x.op).__name__)
-                 for x in f1.maker.fgraph.toposort()]):
-       print('Used the cpu')
-   else:
-       print('Used the gpu')
-
-Which produces this output:
-
-.. code-block:: none
-
-   $ THEANO_FLAGS=device=gpu0,floatX=float32 python test1.py
-   Using gpu device 0: GeForce GTX 275
-   Looping 1000 times took 0.368273973465 seconds without borrow and 0.0240728855133 seconds with borrow.
-   Used the gpu
-
 *Take home message:*
 
 When an input *x* to a function is not needed after the function
@@ -317,4 +271,3 @@ requirement.  When a return value *y* is large (in terms of memory
 footprint), and you only need to read from it once, right away when
 it's returned, then consider marking it with an ``Out(y,
 borrow=True)``.
-
diff --git a/doc/tutorial/using_gpu.txt b/doc/tutorial/using_gpu.txt
index 0ce68f1d354..8ac4502b800 100644
--- a/doc/tutorial/using_gpu.txt
+++ b/doc/tutorial/using_gpu.txt
@@ -15,355 +15,9 @@ about how to carry out those computations.  One of the ways we take
 advantage of this flexibility is in carrying out calculations on a
 graphics card.
 
-There are two ways currently to use a gpu, one of which only supports NVIDIA cards (:ref:`cuda`) and the other, in development, that should support any OpenCL device as well as NVIDIA cards (:ref:`gpuarray`).
-
-.. _cuda:
-
-CUDA backend
-------------
-
-If you have not done so already, you will need to install Nvidia's
-GPU-programming toolchain (CUDA) and configure Theano to use it.
-We provide installation instructions for :ref:`Linux <gpu_linux>`,
-:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
-
-Testing Theano with GPU
-~~~~~~~~~~~~~~~~~~~~~~~
-
-To see if your GPU is being used, cut and paste the following program into a
-file and run it.
-
-.. testcode::
-
-    from theano import function, config, shared, sandbox
-    import theano.tensor as T
-    import numpy
-    import time
-
-    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
-    iters = 1000
-
-    rng = numpy.random.RandomState(22)
-    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
-    f = function([], T.exp(x))
-    print(f.maker.fgraph.toposort())
-    t0 = time.time()
-    for i in range(iters):
-        r = f()
-    t1 = time.time()
-    print("Looping %d times took %f seconds" % (iters, t1 - t0))
-    print("Result is %s" % (r,))
-    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
-        print('Used the cpu')
-    else:
-        print('Used the gpu')
-
-The program just computes the ``exp()`` of a bunch of random numbers.
-Note that we use the ``shared`` function to
-make sure that the input *x* is stored on the graphics device.
-
-.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously
-
-If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
-whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact
-same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.
-
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS
-
-   [Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
-   Looping 1000 times took ... seconds
-   Result is ...
-   Used the cpu
-
-.. code-block:: none
-
-    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
-    [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
-    Looping 1000 times took 3.06635117531 seconds
-    Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
-      1.62323284]
-    Used the cpu
-
-    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
-    Using gpu device 0: GeForce GTX 580
-    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
-    Looping 1000 times took 0.638810873032 seconds
-    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
-      1.62323296]
-    Used the gpu
-
-Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).
-
-
-Returning a Handle to Device-Allocated Data
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The speedup is not greater in the preceding example because the function is
-returning its result as a NumPy ndarray which has already been copied from the
-device to the host for your convenience.  This is what makes it so easy to swap in ``device=gpu``, but
-if you don't mind less portability, you might gain a bigger speedup by changing
-the graph to express a computation with a GPU-stored result.  The ``gpu_from_host``
-op means "copy the input from the host to the GPU" and it is optimized away
-after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.
-
-.. testcode::
-
-    from theano import function, config, shared, sandbox
-    import theano.sandbox.cuda.basic_ops
-    import theano.tensor as T
-    import numpy
-    import time
-
-    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
-    iters = 1000
-
-    rng = numpy.random.RandomState(22)
-    x = shared(numpy.asarray(rng.rand(vlen), 'float32'))
-    f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
-    print(f.maker.fgraph.toposort())
-    t0 = time.time()
-    for i in range(iters):
-        r = f()
-    t1 = time.time()
-    print("Looping %d times took %f seconds" % (iters, t1 - t0))
-    print("Result is %s" % (r,))
-    print("Numpy result is %s" % (numpy.asarray(r),))
-    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
-        print('Used the cpu')
-    else:
-        print('Used the gpu')
-
-The output from this program is
-
-.. testoutput::
-   :hide:
-   :options: +ELLIPSIS, +SKIP
-
-   Using gpu device 0: GeForce GTX 580
-   [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
-   Looping 1000 times took ... seconds
-   Result is <CudaNdarray object at 0x...>
-   Numpy result is ...
-   Used the gpu
-
-.. code-block:: none
-
-    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
-    Using gpu device 0: GeForce GTX 580
-    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
-    Looping 1000 times took 0.34898686409 seconds
-    Result is <CudaNdarray object at 0x6a7a5f0>
-    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
-      1.62323296]
-    Used the gpu
-
-Here we've shaved off about 50% of the run-time by simply not copying
-the resulting array back to the host.  The object returned by each
-function call is now not a NumPy array but a "CudaNdarray" which can
-be converted to a NumPy ndarray by the normal NumPy casting mechanism
-using something like ``numpy.asarray()``.
-
-For even more speed you can play with the ``borrow`` flag.  See
-:ref:`borrowfunction`.
-
-What Can Be Accelerated on the GPU
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The performance characteristics will change as we continue to optimize our
-implementations, and vary from device to device, but to give a rough idea of
-what to expect right now:
-
-* Only computations
-  with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but
-  *float64* computations are still relatively slow (Jan 2010).
-* Matrix
-  multiplication, convolution, and large element-wise operations can be
-  accelerated a lot (5-50x) when arguments are large enough to keep 30
-  processors busy.
-* Indexing,
-  dimension-shuffling and  constant-time reshaping will be equally fast on GPU
-  as on CPU.
-* Summation
-  over rows/columns of tensors can be a little slower on the GPU than on the CPU.
-* Copying
-  of large quantities of data to and from a device is relatively slow, and
-  often cancels most of the advantage of one or two accelerated functions on
-  that data.  Getting GPU performance largely hinges on making data transfer to
-  the device pay off.
-
-Tips for Improving Performance on GPU
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-* Consider
-  adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
-  GPU work.
-* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
-* Prefer
-  constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
-  ``dscalar`` because the former will give you *float32* variables when
-  ``floatX=float32``.
-* Ensure
-  that your output variables have a *float32* dtype and not *float64*.  The
-  more *float32* variables are in your graph, the more work the GPU can do for
-  you.
-* Minimize
-  tranfers to the GPU device by using ``shared`` *float32* variables to store
-  frequently-accessed data (see :func:`shared()<shared.shared>`).  When using
-  the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
-  eliminate transfer time for GPU ops using those variables.
-* If you aren't happy with the performance you see, try running your script with
-  ``profile=True`` flag. This should print some timing information at program
-  termination. Is time being used sensibly?   If an op or Apply is
-  taking more time than its share, then if you know something about GPU
-  programming, have a look at how it's implemented in theano.sandbox.cuda.
-  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
-  This can tell you if not enough of your graph is on the GPU or if there
-  is too much memory transfer.
-* Use nvcc options. nvcc supports those options to speed up some
-  computations: `-ftz=true` to `flush denormals values to
-  zeros. <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
-  `--prec-div=false` and `--prec-sqrt=false` options to speed up
-  division and square root operation by being less precise. You can
-  enable all of them with the `nvcc.flags=--use_fast_math` Theano
-  flag or you can enable them individually as in this example:
-  `nvcc.flags=-ftz=true --prec-div=false`.
-* To investigate whether if all the Ops in the computational graph are running on GPU.
-  It is possible to debug or check your code by providing a value to `assert_no_cpu_op`
-  flag, i.e. `warn`, for warning `raise` for raising an error or `pdb` for putting a breakpoint
-  in the computational graph if there is a CPU Op.
-
-.. _gpu_async:
-
-GPU Async capabilities
-~~~~~~~~~~~~~~~~~~~~~~
-
-Ever since Theano 0.6 we started to use the asynchronous capability of
-GPUs. This allows us to be faster but with the possibility that some
-errors may be raised later than when they should occur. This can cause
-difficulties when profiling Theano apply nodes. There is a NVIDIA
-driver feature to help with these issues. If you set the environment
-variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be
-automatically synchronized. This reduces performance but provides good
-profiling and appropriately placed error messages.
-
-This feature interacts with Theano garbage collection of intermediate
-results. To get the most of this feature, you need to disable the gc
-as it inserts synchronization points in the graph. Set the Theano flag
-``allow_gc=False`` to get even faster speed! This will raise the memory
-usage.
-
-Changing the Value of Shared Variables
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To change the value of a ``shared`` variable, e.g. to provide new data to processes,
-use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
-see :ref:`aliasing`.
-
-
-Exercise
-++++++++
-
-Consider again the logistic regression:
-
-.. testcode::
-
-    import numpy
-    import theano
-    import theano.tensor as T
-    rng = numpy.random
-
-    N = 400
-    feats = 784
-    D = (rng.randn(N, feats).astype(theano.config.floatX),
-    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
-    training_steps = 10000
-
-    # Declare Theano symbolic variables
-    x = T.matrix("x")
-    y = T.vector("y")
-    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
-    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
-    x.tag.test_value = D[0]
-    y.tag.test_value = D[1]
-
-    # Construct Theano expression graph
-    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
-    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
-    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
-    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
-    gw,gb = T.grad(cost, [w,b])
-
-    # Compile expressions to functions
-    train = theano.function(
-                inputs=[x,y],
-                outputs=[prediction, xent],
-                updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
-                name = "train")
-    predict = theano.function(inputs=[x], outputs=prediction,
-                name = "predict")
-
-    if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
-            train.maker.fgraph.toposort()]):
-        print('Used the cpu')
-    elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
-              train.maker.fgraph.toposort()]):
-        print('Used the gpu')
-    else:
-        print('ERROR, not able to tell if theano used the cpu or the gpu')
-        print(train.maker.fgraph.toposort())
-
-    for i in range(training_steps):
-        pred, err = train(D[0], D[1])
-
-    print("target values for D")
-    print(D[1])
-
-    print("prediction on D")
-    print(predict(D[0]))
-
-.. testoutput::
-   :hide:
-   :options: + ELLIPSIS
-
-   Used the cpu
-   target values for D
-   ...
-   prediction on D
-   ...
-
-Modify and execute this example to run on GPU with ``floatX=float32`` and
-time it using the command line ``time python file.py``. (Of course, you may use some of your answer
-to the exercise in section :ref:`Configuration Settings and Compiling Mode<using_modes>`.)
-
-Is there an increase in speed from CPU to GPU?
-
-Where does it come from? (Use ``profile=True`` flag.)
-
-What can be done to further increase the speed of the GPU version? Put your ideas to test.
-
-
-.. Note::
-
-   * Only 32 bit floats are currently supported (development is in progress).
-   * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
-
-   * There is a limit of one GPU per process.
-   * Use the Theano flag ``device=gpu`` to require use of the GPU device.
-   * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
-
-   * Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code.
-   * ``Cast`` inputs before storing them into a ``shared`` variable.
-   * Circumvent the automatic cast of *int32* with *float32* to *float64*:
-
-     * Insert manual cast in your code or use *[u]int{8,16}*.
-     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
-     * Notice that a new casting mechanism is being developed.
-
-:download:`Solution<using_gpu_solution_1.py>`
-
--------------------------------------------
+There are two ways currently to use a gpu, on that should support any OpenCL
+device as well as NVIDIA cards (:ref:`gpuarray`), and the old backend which
+only supports NVIDIA cards (:ref:`cuda`).
 
 .. _gpuarray:
 
@@ -380,10 +34,9 @@ be referred to as GPU.
 
 .. warning::
 
-  While it is fully our intention to support OpenCL, as of May 2014
-  this support is still in its infancy.  A lot of very useful ops
-  still do not support it because they were ported from the old
-  backend with minimal change.
+  The backend was designed to support OpenCL, however current support is
+  incomplete. A lot of very useful ops still do not support it because they
+  were ported from the old backend with minimal change.
 
 Testing Theano with GPU
 ~~~~~~~~~~~~~~~~~~~~~~~
@@ -391,6 +44,9 @@ Testing Theano with GPU
 To see if your GPU is being used, cut and paste the following program
 into a file and run it.
 
+Use the Theano flag ``device=cuda`` to require the use of the GPU. Use the flag
+``device=cuda{0,1,...}`` to specify which GPU to use.
+
 .. testcode::
 
   from theano import function, config, shared, tensor
@@ -453,11 +109,11 @@ Returning a Handle to Device-Allocated Data
 By default functions that execute on the GPU still return a standard
 numpy ndarray.  A transfer operation is inserted just before the
 results are returned to ensure a consistent interface with CPU code.
-This allows changing the deivce some code runs on by only replacing
+This allows changing the device some code runs on by only replacing
 the value of the ``device`` flag without touching the code.
 
 If you don't mind a loss of flexibility, you can ask theano to return
-the GPU object directly.  The following code is modifed to do just that.
+the GPU object directly.  The following code is modified to do just that.
 
 .. testcode::
 
@@ -532,23 +188,68 @@ What Can be Accelerated on the GPU
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 The performance characteristics will of course vary from device to
-device, and also as we refine our implementation.
-
-This backend supports all regular theano data types (float32, float64,
-int, ...) however GPU support varies and some units can't deal with
+device, and also as we refine our implementation:
+
+* In general, matrix multiplication, convolution, and large element-wise
+  operations can be accelerated a lot (5-50x) when arguments are large enough
+  to keep 30 processors busy.
+* Indexing, dimension-shuffling and constant-time reshaping will be equally fast
+  on GPU as on CPU.
+* Summation over rows/columns of tensors can be a little slower on the
+  GPU than on the CPU.
+* Copying of large quantities of data to and from a device is relatively slow,
+  and often cancels most of the advantage of one or two accelerated functions
+  on that data. Getting GPU performance largely hinges on making data transfer
+  to the device pay off.
+
+The backend supports all regular theano data types (float32, float64,
+int, ...), however GPU support varies and some units can't deal with
 double (float64) or small (less than 32 bits like int16) data types.
 You will get an error at compile time or runtime if this is the case.
 
-By default all inputs will get transferred to GPU.  You can prevent an
+By default all inputs will get transferred to GPU. You can prevent an
 input from getting transferred by setting its tag.target attribute to
 'cpu'.
 
 Complex support is untested and most likely completely broken.
 
-In general, large operations like matrix multiplication, or
-element-wise operations with large inputs, will be significatly
-faster.
+Tips for Improving Performance on GPU
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+* Consider adding ``floatX=float32`` (or the type you are using) to your
+  ``.theanorc`` file if you plan to do a lot of GPU work.
+* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
+* Prefer constructors like ``matrix``, ``vector`` and ``scalar`` to
+  ``dmatrix``, ``dvector`` and ``dscalar`` because the former will give
+  you *float32* variables when ``floatX=float32``.
+* Ensure that your output variables have a *float32* dtype and not *float64*.
+  The more *float32* variables are in your graph, the more work the GPU can do for
+  you.
+* Minimize transfers to the GPU device by using ``shared`` *float32* variables
+  to store frequently-accessed data (see :func:`shared()<shared.shared>`).
+  When using the GPU, *float32* tensor ``shared`` variables are stored on
+  the GPU by default to eliminate transfer time for GPU ops using those variables.
+* If you aren't happy with the performance you see, try running your script with
+  ``profile=True`` flag. This should print some timing information at program
+  termination. Is time being used sensibly?   If an op or Apply is
+  taking more time than its share, then if you know something about GPU
+  programming, have a look at how it's implemented in theano.sandbox.cuda.
+  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and
+  Xs(X%) in transfer op*. This can tell you if not enough of your graph is
+  on the GPU or if there is too much memory transfer.
+* Use nvcc options. nvcc supports those options to speed up some computations:
+  `-ftz=true` to `flush denormals values to zeros.
+  <https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence>`_,
+  `--prec-div=false` and `--prec-sqrt=false` options to speed up
+  division and square root operation by being less precise. You can
+  enable all of them with the `nvcc.flags=--use_fast_math` Theano
+  flag or you can enable them individually as in this example:
+  `nvcc.flags=-ftz=true --prec-div=false`.
+* To investigate whether all the Ops in the computational graph are
+  running on GPU, it is possible to debug or check your code by providing
+  a value to `assert_no_cpu_op` flag, i.e. `warn`, for warning, `raise` for
+  raising an error or `pdb` for putting a breakpoint in the computational
+  graph if there is a CPU Op.
 
 GPU Async Capabilities
 ~~~~~~~~~~~~~~~~~~~~~~
@@ -565,6 +266,125 @@ calling its ``sync()`` method.  This is useful to get accurate timings
 when doing benchmarks.
 
 
+Changing the Value of Shared Variables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To change the value of a ``shared`` variable, e.g. to provide new data to processes,
+use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
+see :ref:`aliasing`.
+
+Exercise
+~~~~~~~~
+
+Consider again the logistic regression:
+
+.. testcode::
+
+    import numpy
+    import theano
+    import theano.tensor as T
+    rng = numpy.random
+
+    N = 400
+    feats = 784
+    D = (rng.randn(N, feats).astype(theano.config.floatX),
+    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
+    training_steps = 10000
+
+    # Declare Theano symbolic variables
+    x = T.matrix("x")
+    y = T.vector("y")
+    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
+    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
+    x.tag.test_value = D[0]
+    y.tag.test_value = D[1]
+
+    # Construct Theano expression graph
+    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probability of having a one
+    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
+    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
+    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
+    gw,gb = T.grad(cost, [w,b])
+
+    # Compile expressions to functions
+    train = theano.function(
+                inputs=[x,y],
+                outputs=[prediction, xent],
+                updates=[(w, w-0.01*gw), (b, b-0.01*gb)],
+                name = "train")
+    predict = theano.function(inputs=[x], outputs=prediction,
+                name = "predict")
+
+    if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
+            train.maker.fgraph.toposort()]):
+        print('Used the cpu')
+    elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
+              train.maker.fgraph.toposort()]):
+        print('Used the gpu')
+    else:
+        print('ERROR, not able to tell if theano used the cpu or the gpu')
+        print(train.maker.fgraph.toposort())
+
+    for i in range(training_steps):
+        pred, err = train(D[0], D[1])
+
+    print("target values for D")
+    print(D[1])
+
+    print("prediction on D")
+    print(predict(D[0]))
+
+.. testoutput::
+   :hide:
+   :options: + ELLIPSIS
+
+   Used the cpu
+   target values for D
+   ...
+   prediction on D
+   ...
+
+Modify and execute this example to run on GPU with ``floatX=float32`` and
+time it using the command line ``time python file.py``. (Of course, you may use some of your answer
+to the exercise in section :ref:`Configuration Settings and Compiling Mode<using_modes>`.)
+
+Is there an increase in speed from CPU to GPU?
+
+Where does it come from? (Use ``profile=True`` flag.)
+
+What can be done to further increase the speed of the GPU version? Put your ideas to test.
+
+:download:`Solution<using_gpu_solution_1.py>`
+
+-------------------------------------------
+
+.. _cuda:
+
+CUDA backend
+------------
+
+If you have not done so already, you will need to install Nvidia's
+GPU-programming toolchain (CUDA) and configure Theano to use it.
+We provide installation instructions for :ref:`Linux <gpu_linux>`,
+:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.
+
+The old CUDA backend can be activated using the flags ``device=gpu`` or
+``device=gpu{0,1,...}``
+
+.. Note::
+
+   * Only 32 bit floats are supported.
+   * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.
+
+   * There is a limit of one GPU per process.
+
+   * Apply the Theano flag ``floatX=float32`` (through ``theano.config.floatX``) in your code.
+   * ``Cast`` inputs before storing them into a ``shared`` variable.
+   * Circumvent the automatic cast of *int32* with *float32* to *float64*:
+
+     * Insert manual cast in your code or use *[u]int{8,16}*.
+     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
+     * Notice that a new casting mechanism is being developed.
 
 -------------------------------------------
 
diff --git a/doc/tutorial/using_gpu_solution_1.py b/doc/tutorial/using_gpu_solution_1.py
index 4a8faf95082..aec61e4160f 100755
--- a/doc/tutorial/using_gpu_solution_1.py
+++ b/doc/tutorial/using_gpu_solution_1.py
@@ -11,8 +11,6 @@
 import theano
 import theano.tensor as tt
 
-from theano import sandbox, Out
-
 theano.config.floatX = 'float32'
 
 rng = numpy.random
@@ -20,7 +18,7 @@
 N = 400
 feats = 784
 D = (rng.randn(N, feats).astype(theano.config.floatX),
-rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
+    rng.randint(size=N, low=0, high=2).astype(theano.config.floatX))
 training_steps = 10000
 
 # Declare Theano symbolic variables
@@ -38,33 +36,22 @@
 prediction = p_1 > 0.5  # The prediction that is done: 0 or 1
 xent = -y * tt.log(p_1) - (1 - y) * tt.log(1 - p_1)  # Cross-entropy
 cost = tt.cast(xent.mean(), 'float32') + \
-       0.01 * (w ** 2).sum()  # The cost to optimize
+    0.01 * (w ** 2).sum()  # The cost to optimize
 gw, gb = tt.grad(cost, [w, b])
 
-"""
-# Compile expressions to functions
-train = theano.function(
-            inputs=[x, y],
-            outputs=[Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')),borrow=True), Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(xent, 'float32')), borrow=True)],
-            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
-            name="train")
-predict = theano.function(inputs=[x], outputs=Out(theano.sandbox.cuda.basic_ops.gpu_from_host(tt.cast(prediction, 'float32')), borrow=True),
-            name="predict")
-"""
-
 # Compile expressions to functions
 train = theano.function(
             inputs=[],
             outputs=[prediction, xent],
-            updates={w: w - 0.01 * gw, b: b - 0.01 * gb},
+            updates=[(w, w - 0.01 * gw), (b, b - 0.01 * gb)],
             name="train")
 predict = theano.function(inputs=[], outputs=prediction,
             name="predict")
 
-if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
+if any([n.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for n in
 train.maker.fgraph.toposort()]):
     print('Used the cpu')
-elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
+elif any([n.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for n in
 train.maker.fgraph.toposort()]):
     print('Used the gpu')
 else:
@@ -101,171 +88,171 @@
 # in the script, followed by a summary for all functions.
 # We'll show here only the summary:
 
-Results were produced using an Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
+Results were produced using an Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz
 
 Function profiling
 ==================
-  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
-  Time in 10002 calls to Function.__call__: 1.590916e+00s
-  Time in Function.fn.__call__: 1.492365e+00s (93.805%)
-  Time in thunks: 1.408159e+00s (88.512%)
-  Total compile time: 6.309664e+00s
-    Number of Apply nodes: 25
-    Theano Optimizer time: 4.848340e-01s
-       Theano validate time: 5.454302e-03s
-    Theano Linker time (includes C, CUDA code generation/compiling): 5.691789e+00s
-
+  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
+  Time in 10001 calls to Function.__call__: 1.300452e+00s
+  Time in Function.fn.__call__: 1.215823e+00s (93.492%)
+  Time in thunks: 1.157602e+00s (89.015%)
+  Total compile time: 8.922548e-01s
+    Number of Apply nodes: 17
+    Theano Optimizer time: 6.270301e-01s
+       Theano validate time: 5.993605e-03s
+    Theano Linker time (includes C, CUDA code generation/compiling): 2.949309e-02s
+       Import time 3.543139e-03s
+
+Time in all call to theano.grad() 1.848292e-02s
+Time since theano import 2.864s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-  59.6%    59.6%       0.839s       4.19e-05s     C    20001       3   theano.tensor.blas_c.CGemv
-  30.1%    89.7%       0.424s       4.71e-06s     C    90001      10   theano.tensor.elemwise.Elemwise
-   5.5%    95.2%       0.078s       7.79e-02s     Py       1       1   theano.tensor.blas.Gemv
-   1.9%    97.1%       0.026s       1.30e-06s     C    20001       3   theano.tensor.basic.Alloc
-   1.3%    98.4%       0.018s       1.85e-06s     C    10000       1   theano.tensor.elemwise.Sum
-   1.0%    99.4%       0.014s       4.78e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
-   0.6%   100.0%       0.008s       4.23e-07s     C    20001       3   theano.compile.ops.Shape_i
+  64.5%    64.5%       0.747s       3.73e-05s     C    20001       3   theano.tensor.blas_c.CGemv
+  33.1%    97.7%       0.384s       4.79e-06s     C    80001       9   theano.tensor.elemwise.Elemwise
+   1.0%    98.6%       0.011s       1.14e-06s     C    10000       1   theano.tensor.elemwise.Sum
+   0.7%    99.4%       0.009s       2.85e-07s     C    30001       4   theano.tensor.elemwise.DimShuffle
+   0.3%    99.7%       0.004s       3.64e-07s     C    10001       2   theano.tensor.basic.AllocEmpty
+   0.3%   100.0%       0.004s       1.78e-07s     C    20001       3   theano.compile.ops.Shape_i
    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
 
 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-  59.6%    59.6%       0.839s       4.19e-05s     C     20001        3   CGemv{inplace}
-  15.8%    75.4%       0.223s       2.23e-05s     C     10000        1   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)]
-   7.7%    83.1%       0.109s       1.09e-05s     C     10000        1   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)]
-   5.5%    88.7%       0.078s       7.79e-02s     Py       1        1   Gemv{no_inplace}
-   4.3%    92.9%       0.060s       6.00e-06s     C     10000        1   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}
-   1.9%    94.8%       0.026s       1.30e-06s     C     20001        3   Alloc
-   1.3%    96.1%       0.018s       1.85e-06s     C     10000        1   Sum{acc_dtype=float64}
-   0.7%    96.8%       0.009s       4.73e-07s     C     20001        3   InplaceDimShuffle{x}
-   0.6%    97.4%       0.009s       8.52e-07s     C     10000        1   Elemwise{sub,no_inplace}
-   0.6%    98.0%       0.008s       4.23e-07s     C     20001        3   Shape_i{0}
-   0.5%    98.5%       0.007s       7.06e-07s     C     10000        1   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
-   0.5%    98.9%       0.007s       6.57e-07s     C     10000        1   Elemwise{neg,no_inplace}
-   0.3%    99.3%       0.005s       4.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
-   0.3%    99.5%       0.004s       3.78e-07s     C     10000        1   Elemwise{inv,no_inplace}
-   0.2%    99.8%       0.003s       3.44e-07s     C     10000        1   Elemwise{Cast{float32}}
-   0.2%   100.0%       0.003s       3.01e-07s     C     10000        1   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
-   0.0%   100.0%       0.000s       8.11e-06s     C        1        1   Elemwise{Composite{[GT(scalar_sigmoid(neg(sub(neg(i0), i1))), i2)]}}
+  64.5%    64.5%       0.747s       3.73e-05s     C     20001        3   CGemv{inplace}
+  18.7%    83.2%       0.217s       2.17e-05s     C     10000        1   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)]
+   8.9%    92.1%       0.103s       1.03e-05s     C     10000        1   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)]
+   4.3%    96.4%       0.050s       4.98e-06s     C     10000        1   Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}
+   1.0%    97.4%       0.011s       1.14e-06s     C     10000        1   Sum{acc_dtype=float64}
+   0.5%    97.9%       0.006s       2.83e-07s     C     20001        3   InplaceDimShuffle{x}
+   0.4%    98.3%       0.004s       4.22e-07s     C     10000        1   Elemwise{sub,no_inplace}
+   0.3%    98.6%       0.004s       3.70e-07s     C     10000        1   Elemwise{neg,no_inplace}
+   0.3%    98.9%       0.004s       3.64e-07s     C     10001        2   AllocEmpty{dtype='float32'}
+   0.3%    99.2%       0.004s       1.78e-07s     C     20001        3   Shape_i{0}
+   0.2%    99.5%       0.003s       2.88e-07s     C     10000        1   InplaceDimShuffle{1,0}
+   0.2%    99.7%       0.003s       2.65e-07s     C     10000        1   Elemwise{Composite{((-i0) - i1)}}[(0, 0)]
+   0.2%    99.9%       0.002s       1.98e-07s     C     10000        1   Elemwise{Cast{float32}}
+   0.1%   100.0%       0.002s       1.54e-07s     C     10000        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
+   0.0%   100.0%       0.000s       4.77e-06s     C        1        1   Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}
    ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
 
 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
-  31.6%    31.6%       0.445s       4.45e-05s   10000     7   CGemv{inplace}(Alloc.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-  27.9%    59.6%       0.393s       3.93e-05s   10000    17   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0, TensorConstant{0.999800026417})
-  15.8%    75.4%       0.223s       2.23e-05s   10000    14   Elemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]}}[(0, 4)](y, Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
-   7.7%    83.1%       0.109s       1.09e-05s   10000    15   Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)](Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Alloc.0, y, Elemwise{sub,no_inplace}.0, Elemwise{Cast{float32}}.0)
-   5.5%    88.7%       0.078s       7.79e-02s      1     0   Gemv{no_inplace}(aa, TensorConstant{1.0}, xx, yy, TensorConstant{0.0})
-   4.3%    92.9%       0.060s       6.00e-06s   10000    13   Elemwise{Composite{[GT(scalar_sigmoid(i0), i1)]}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
-   1.3%    94.2%       0.018s       1.85e-06s   10000    16   Sum{acc_dtype=float64}(Elemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(scalar_sigmoid(neg(i0)), i4), i5))]}}[(0, 0)].0)
-   1.0%    95.2%       0.013s       1.34e-06s   10000     5   Alloc(TensorConstant{0.0}, Shape_i{0}.0)
-   0.9%    96.1%       0.013s       1.27e-06s   10000    12   Alloc(Elemwise{inv,no_inplace}.0, Shape_i{0}.0)
-   0.6%    96.7%       0.009s       8.52e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
-   0.5%    97.2%       0.007s       7.06e-07s   10000     9   Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
-   0.5%    97.6%       0.007s       6.57e-07s   10000    11   Elemwise{neg,no_inplace}(Elemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
-   0.4%    98.1%       0.006s       6.27e-07s   10000     0   InplaceDimShuffle{x}(b)
-   0.4%    98.5%       0.006s       5.90e-07s   10000     1   Shape_i{0}(x)
-   0.3%    98.9%       0.005s       4.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
-   0.3%    99.1%       0.004s       3.78e-07s   10000    10   Elemwise{inv,no_inplace}(Elemwise{Cast{float32}}.0)
-   0.2%    99.4%       0.003s       3.44e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
-   0.2%    99.6%       0.003s       3.19e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
-   0.2%    99.8%       0.003s       3.01e-07s   10000    18   Elemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
-   0.2%   100.0%       0.003s       2.56e-07s   10000     3   Shape_i{0}(y)
-   ... (remaining 5 Apply instances account for 0.00%(0.00s) of the runtime)
+  34.0%    34.0%       0.394s       3.94e-05s   10000     7   CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
+  30.5%    64.5%       0.353s       3.53e-05s   10000    15   CGemv{inplace}(w, TensorConstant{-0.00999999977648}, x.T, Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0, TensorConstant{0.999800026417})
+  18.7%    83.2%       0.217s       2.17e-05s   10000    12   Elemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[(0, 4)](y, Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, Elemwise{sub,no_inplace}.0, Elemwise{neg,no_inplace}.0)
+   8.9%    92.1%       0.103s       1.03e-05s   10000    13   Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)](Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0, TensorConstant{(1,) of -1.0}, y, Elemwise{Cast{float32}}.0, Elemwise{sub,no_inplace}.0)
+   4.3%    96.4%       0.050s       4.98e-06s   10000    11   Elemwise{Composite{GT(scalar_sigmoid(i0), i1)}}(Elemwise{neg,no_inplace}.0, TensorConstant{(1,) of 0.5})
+   1.0%    97.4%       0.011s       1.14e-06s   10000    14   Sum{acc_dtype=float64}(Elemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((scalar_sigmoid((-i0)) * i1 * i4) / i3))}}[(0, 0)].0)
+   0.4%    97.8%       0.004s       4.22e-07s   10000     4   Elemwise{sub,no_inplace}(TensorConstant{(1,) of 1.0}, y)
+   0.3%    98.1%       0.004s       3.76e-07s   10000     0   InplaceDimShuffle{x}(b)
+   0.3%    98.4%       0.004s       3.70e-07s   10000    10   Elemwise{neg,no_inplace}(Elemwise{Composite{((-i0) - i1)}}[(0, 0)].0)
+   0.3%    98.7%       0.004s       3.64e-07s   10000     5   AllocEmpty{dtype='float32'}(Shape_i{0}.0)
+   0.2%    99.0%       0.003s       2.88e-07s   10000     2   InplaceDimShuffle{1,0}(x)
+   0.2%    99.2%       0.003s       2.65e-07s   10000     9   Elemwise{Composite{((-i0) - i1)}}[(0, 0)](CGemv{inplace}.0, InplaceDimShuffle{x}.0)
+   0.2%    99.4%       0.002s       2.21e-07s   10000     1   Shape_i{0}(x)
+   0.2%    99.6%       0.002s       1.98e-07s   10000     8   Elemwise{Cast{float32}}(InplaceDimShuffle{x}.0)
+   0.2%    99.7%       0.002s       1.90e-07s   10000     6   InplaceDimShuffle{x}(Shape_i{0}.0)
+   0.1%    99.9%       0.002s       1.54e-07s   10000    16   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](b, TensorConstant{0.00999999977648}, Sum{acc_dtype=float64}.0)
+   0.1%   100.0%       0.001s       1.34e-07s   10000     3   Shape_i{0}(y)
+   0.0%   100.0%       0.000s       3.89e-05s      1     3   CGemv{inplace}(AllocEmpty{dtype='float32'}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
+   0.0%   100.0%       0.000s       4.77e-06s      1     4   Elemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}(CGemv{inplace}.0, InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0.5})
+   0.0%   100.0%       0.000s       1.19e-06s      1     0   InplaceDimShuffle{x}(b)
+   ... (remaining 2 Apply instances account for 0.00%(0.00s) of the runtime)
+
 
 
 
 # 2.2 Profiling for GPU computations
 
 # In your terminal, type:
-$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=gpu python using_gpu_solution_1.py
+$ CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=profile=True,device=cuda python using_gpu_solution_1.py
 
 # You'll see first the output of the script:
 Used the gpu
 target values for D
 prediction on D
 
-Results were produced using a GeForce GTX TITAN
+Results were produced using a GeForce GTX TITAN X
 
 # Profiling summary for all functions:
 
 Function profiling
 ==================
-  Message: Sum of all(3) printed profiles at exit excluding Scan op profile.
-  Time in 10002 calls to Function.__call__: 3.535239e+00s
-  Time in Function.fn.__call__: 3.420863e+00s (96.765%)
-  Time in thunks: 2.865905e+00s (81.067%)
-  Total compile time: 4.728150e-01s
-    Number of Apply nodes: 36
-    Theano Optimizer time: 4.283385e-01s
-       Theano validate time: 7.687330e-03s
-    Theano Linker time (includes C, CUDA code generation/compiling): 2.801418e-02s
-
+  Message: Sum of all(2) printed profiles at exit excluding Scan op profile.
+  Time in 10001 calls to Function.__call__: 4.181247e+00s
+  Time in Function.fn.__call__: 4.081113e+00s (97.605%)
+  Time in thunks: 3.915566e+00s (93.646%)
+  Total compile time: 9.256095e+00s
+    Number of Apply nodes: 21
+    Theano Optimizer time: 9.996419e-01s
+       Theano validate time: 6.523132e-03s
+    Theano Linker time (includes C, CUDA code generation/compiling): 8.239602e+00s
+       Import time 4.228115e-03s
+
+Time in all call to theano.grad() 3.286195e-02s
+Time since theano import 15.415s
 Class
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
-  45.7%    45.7%       1.308s       1.64e-05s     C    80001       9   theano.sandbox.cuda.basic_ops.GpuElemwise
-  17.2%    62.8%       0.492s       2.46e-05s     C    20002       4   theano.sandbox.cuda.blas.GpuGemv
-  15.1%    77.9%       0.433s       2.17e-05s     C    20001       3   theano.sandbox.cuda.basic_ops.GpuAlloc
-   8.2%    86.1%       0.234s       1.17e-05s     C    20002       4   theano.sandbox.cuda.basic_ops.HostFromGpu
-   7.2%    93.3%       0.207s       2.07e-05s     C    10000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
-   4.4%    97.7%       0.127s       1.27e-05s     C    10003       4   theano.sandbox.cuda.basic_ops.GpuFromHost
-   0.9%    98.6%       0.025s       8.23e-07s     C    30001       4   theano.sandbox.cuda.basic_ops.GpuDimShuffle
-   0.7%    99.3%       0.020s       9.88e-07s     C    20001       3   theano.tensor.elemwise.Elemwise
-   0.5%    99.8%       0.014s       7.18e-07s     C    20001       3   theano.compile.ops.Shape_i
-   0.2%   100.0%       0.006s       5.78e-07s     C    10000       1   theano.tensor.elemwise.DimShuffle
+  59.5%    59.5%       2.329s       1.16e-04s     C    20001       3   theano.sandbox.gpuarray.blas.GpuGemv
+  29.8%    89.3%       1.166s       1.30e-05s     C    90001      10   theano.sandbox.gpuarray.elemwise.GpuElemwise
+   4.1%    93.4%       0.162s       8.10e-06s     C    20001       3   theano.sandbox.gpuarray.basic_ops.HostFromGpu
+   3.3%    96.7%       0.131s       1.31e-05s     C    10000       1   theano.sandbox.gpuarray.elemwise.GpuCAReduceCuda
+   1.6%    98.3%       0.061s       6.10e-06s     C    10000       1   theano.sandbox.gpuarray.basic_ops.GpuFromHost
+   0.8%    99.1%       0.033s       1.09e-06s     C    30001       4   theano.sandbox.gpuarray.elemwise.GpuDimShuffle
+   0.7%    99.8%       0.026s       2.59e-06s     C    10001       2   theano.sandbox.gpuarray.basic_ops.GpuAllocEmpty
+   0.2%   100.0%       0.008s       3.95e-07s     C    20001       3   theano.compile.ops.Shape_i
    ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
 
 Ops
 ---
 <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
-  17.2%    17.2%       0.492s       2.46e-05s     C     20001        3   GpuGemv{inplace}
-   8.2%    25.3%       0.234s       1.17e-05s     C     20002        4   HostFromGpu
-   8.0%    33.3%       0.228s       2.28e-05s     C     10001        2   GpuAlloc{memset_0=True}
-   7.4%    40.7%       0.211s       2.11e-05s     C     10000        1   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}
-   7.2%    47.9%       0.207s       2.07e-05s     C     10000        1   GpuCAReduce{add}{1}
-   7.1%    55.0%       0.205s       2.05e-05s     C     10000        1   GpuAlloc
-   6.9%    62.0%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
-   6.9%    68.9%       0.198s       1.98e-05s     C     10000        1   GpuElemwise{inv,no_inplace}
-   6.2%    75.1%       0.178s       1.78e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
-   5.6%    80.6%       0.159s       1.59e-05s     C     10000        1   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)]
-   4.4%    85.1%       0.127s       1.27e-05s     C     10003        4   GpuFromHost
-   4.3%    89.4%       0.124s       1.24e-05s     C     10000        1   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)]
-   4.2%    93.6%       0.121s       1.21e-05s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]
-   4.2%    97.7%       0.119s       1.19e-05s     C     10000        1   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)]
-   0.5%    98.2%       0.014s       7.18e-07s     C     20001        3   Shape_i{0}
-   0.5%    98.7%       0.013s       1.33e-06s     C     10001        2   Elemwise{gt,no_inplace}
-   0.3%    99.0%       0.010s       9.81e-07s     C     10000        1   GpuDimShuffle{1,0}
-   0.3%    99.3%       0.008s       7.90e-07s     C     10000        1   GpuDimShuffle{0}
-   0.2%    99.6%       0.007s       6.97e-07s     C     10001        2   GpuDimShuffle{x}
-   0.2%    99.8%       0.006s       6.50e-07s     C     10000        1   Elemwise{Cast{float32}}
-   ... (remaining 3 Ops account for   0.20%(0.01s) of the runtime)
+  59.5%    59.5%       2.329s       1.16e-04s     C     20001        3   GpuGemv{inplace=True}
+   4.1%    63.6%       0.162s       8.10e-06s     C     20001        3   HostFromGpu(gpuarray)
+   4.0%    67.6%       0.157s       1.57e-05s     C     10000        1   GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>
+   3.8%    71.4%       0.149s       1.49e-05s     C     10000        1   GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>
+   3.7%    75.1%       0.144s       1.44e-05s     C     10000        1   GpuElemwise{sub,no_inplace}
+   3.6%    78.7%       0.141s       1.41e-05s     C     10000        1   GpuElemwise{gt,no_inplace}
+   3.4%    82.1%       0.133s       1.33e-05s     C     10000        1   GpuElemwise{Cast{float32}}[]<gpuarray>
+   3.4%    85.5%       0.133s       1.33e-05s     C     10000        1   GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>
+   3.3%    88.8%       0.131s       1.31e-05s     C     10000        1   GpuCAReduceCuda{add}
+   2.9%    91.7%       0.112s       1.12e-05s     C     10000        1   GpuElemwise{neg,no_inplace}
+   2.6%    94.3%       0.102s       1.02e-05s     C     10000        1   GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>
+   2.5%    96.7%       0.096s       9.63e-06s     C     10000        1   GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>
+   1.6%    98.3%       0.061s       6.10e-06s     C     10000        1   GpuFromHost<None>
+   0.7%    99.0%       0.026s       2.59e-06s     C     10001        2   GpuAllocEmpty{dtype='float32', context_name=None}
+   0.5%    99.5%       0.021s       1.06e-06s     C     20001        3   InplaceGpuDimShuffle{x}
+   0.3%    99.8%       0.011s       1.14e-06s     C     10000        1   InplaceGpuDimShuffle{1,0}
+   0.2%   100.0%       0.008s       3.95e-07s     C     20001        3   Shape_i{0}
+   0.0%   100.0%       0.000s       2.00e-05s     C        1        1   GpuElemwise{Composite{GT(scalar_sigmoid((-((-i0) - i1))), i2)}}[]<gpuarray>
+   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
 
 Apply
 ------
 <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
-   8.8%     8.8%       0.251s       2.51e-05s   10000    22   GpuGemv{inplace}(w, TensorConstant{-0.00999999977648}, GpuDimShuffle{1,0}.0, GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0, TensorConstant{0.999800026417})
-   8.4%    17.2%       0.241s       2.41e-05s   10000     7   GpuGemv{inplace}(GpuAlloc{memset_0=True}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
-   8.0%    25.1%       0.228s       2.28e-05s   10000     5   GpuAlloc{memset_0=True}(CudaNdarrayConstant{[ 0.]}, Shape_i{0}.0)
-   7.4%    32.5%       0.211s       2.11e-05s   10000    13   GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}(y, GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
-   7.2%    39.7%       0.207s       2.07e-05s   10000    21   GpuCAReduce{add}{1}(GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)].0)
-   7.1%    46.9%       0.205s       2.05e-05s   10000    17   GpuAlloc(GpuDimShuffle{0}.0, Shape_i{0}.0)
-   6.9%    53.8%       0.198s       1.98e-05s   10000     4   GpuElemwise{sub,no_inplace}(CudaNdarrayConstant{[ 1.]}, y)
-   6.9%    60.7%       0.198s       1.98e-05s   10000    12   GpuElemwise{inv,no_inplace}(GpuFromHost.0)
-   6.2%    66.9%       0.178s       1.78e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0)
-   5.6%    72.5%       0.159s       1.59e-05s   10000    19   GpuElemwise{Composite{[add(mul(scalar_sigmoid(i0), i1, i2, i3), true_div(mul(i4, i5), i6))]}}[(0, 0)](GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)].0, CudaNdarrayConstant{[-1.]}, GpuAlloc.0, y, GpuElemwise{ScalarSigmoid}[(0, 0)].0, GpuElemwise{sub,no_inplace}.0, GpuFromHost.0)
-   4.8%    77.3%       0.138s       1.38e-05s   10000    18   HostFromGpu(GpuElemwise{ScalarSigmoid}[(0, 0)].0)
-   4.4%    81.7%       0.126s       1.26e-05s   10000    10   GpuFromHost(Elemwise{Cast{float32}}.0)
-   4.3%    86.0%       0.124s       1.24e-05s   10000     9   GpuElemwise{Composite{[sub(neg(i0), i1)]}}[(0, 0)](GpuGemv{inplace}.0, GpuDimShuffle{x}.0)
-   4.2%    90.2%       0.121s       1.21e-05s   10000    15   GpuElemwise{ScalarSigmoid}[(0, 0)](GpuElemwise{neg,no_inplace}.0)
-   4.2%    94.4%       0.119s       1.19e-05s   10000    23   GpuElemwise{Composite{[sub(i0, mul(i1, i2))]}}[(0, 0)](b, CudaNdarrayConstant{0.00999999977648}, GpuCAReduce{add}{1}.0)
-   3.4%    97.7%       0.096s       9.61e-06s   10000    16   HostFromGpu(GpuElemwise{Composite{[sub(mul(i0, scalar_softplus(i1)), mul(i2, i3, scalar_softplus(i4)))]},no_inplace}.0)
-   0.5%    98.2%       0.013s       1.33e-06s   10000    20   Elemwise{gt,no_inplace}(HostFromGpu.0, TensorConstant{(1,) of 0.5})
-   0.3%    98.5%       0.010s       9.81e-07s   10000     2   GpuDimShuffle{1,0}(x)
-   0.3%    98.8%       0.008s       8.27e-07s   10000     1   Shape_i{0}(x)
-   0.3%    99.1%       0.008s       7.90e-07s   10000    14   GpuDimShuffle{0}(GpuElemwise{inv,no_inplace}.0)
-   ... (remaining 16 Apply instances account for 0.90%(0.03s) of the runtime)
+  55.0%    55.0%       2.154s       2.15e-04s   10000     7   GpuGemv{inplace=True}(GpuAllocEmpty{dtype='float32', context_name=None}.0, TensorConstant{1.0}, x, w, TensorConstant{0.0})
+   4.5%    59.5%       0.176s       1.76e-05s   10000    18   GpuGemv{inplace=True}(w, TensorConstant{-0.00999999977648}, InplaceGpuDimShuffle{1,0}.0, GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0, TensorConstant{0.999800026417})
+   4.0%    63.5%       0.157s       1.57e-05s   10000    12   GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>(y, GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, GpuElemwise{sub,no_inplace}.0, GpuElemwise{neg,no_inplace}.0)
+   3.8%    67.3%       0.149s       1.49e-05s   10000    15   GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[-1.]}, y, GpuElemwise{Cast{float32}}[]<gpuarray>.0, GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuElemwise{sub,no_inplace}.0)
+   3.7%    71.0%       0.144s       1.44e-05s   10000     4   GpuElemwise{sub,no_inplace}(GpuArrayConstant{[ 1.]}, y)
+   3.6%    74.6%       0.141s       1.41e-05s   10000    16   GpuElemwise{gt,no_inplace}(GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>.0, GpuArrayConstant{[ 0.5]})
+   3.4%    78.0%       0.133s       1.33e-05s   10000    10   GpuElemwise{Cast{float32}}[]<gpuarray>(InplaceGpuDimShuffle{x}.0)
+   3.4%    81.4%       0.133s       1.33e-05s   10000     9   GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>(GpuGemv{inplace=True}.0, InplaceGpuDimShuffle{x}.0)
+   3.3%    84.7%       0.131s       1.31e-05s   10000    17   GpuCAReduceCuda{add}(GpuElemwise{Composite{(((scalar_sigmoid(i0) * i1 * i2) / i3) - ((i4 * i1 * i5) / i3))}}[(0, 0)]<gpuarray>.0)
+   2.9%    87.5%       0.112s       1.12e-05s   10000    11   GpuElemwise{neg,no_inplace}(GpuElemwise{Composite{((-i0) - i1)}}[(0, 0)]<gpuarray>.0)
+   2.6%    90.1%       0.102s       1.02e-05s   10000    20   GpuElemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]<gpuarray>(b, GpuArrayConstant{0.00999999977648}, GpuCAReduceCuda{add}.0)
+   2.5%    92.6%       0.096s       9.63e-06s   10000    13   GpuElemwise{ScalarSigmoid}[(0, 0)]<gpuarray>(GpuElemwise{neg,no_inplace}.0)
+   2.3%    94.9%       0.090s       9.04e-06s   10000    19   HostFromGpu(gpuarray)(GpuElemwise{gt,no_inplace}.0)
+   1.8%    96.7%       0.072s       7.16e-06s   10000    14   HostFromGpu(gpuarray)(GpuElemwise{Composite{((i0 * scalar_softplus(i1)) - (i2 * i3 * scalar_softplus(i4)))}}[]<gpuarray>.0)
+   1.6%    98.3%       0.061s       6.10e-06s   10000     6   GpuFromHost<None>(Shape_i{0}.0)
+   0.7%    99.0%       0.026s       2.59e-06s   10000     5   GpuAllocEmpty{dtype='float32', context_name=None}(Shape_i{0}.0)
+   0.3%    99.3%       0.013s       1.33e-06s   10000     0   InplaceGpuDimShuffle{x}(b)
+   0.3%    99.6%       0.011s       1.14e-06s   10000     2   InplaceGpuDimShuffle{1,0}(x)
+   0.2%    99.8%       0.008s       7.94e-07s   10000     8   InplaceGpuDimShuffle{x}(GpuFromHost<None>.0)
+   0.1%    99.9%       0.005s       5.27e-07s   10000     1   Shape_i{0}(x)
+   ... (remaining 7 Apply instances account for 0.07%(0.00s) of the runtime)
 
 
 # 3. Conclusions
diff --git a/theano/misc/check_blas.py b/theano/misc/check_blas.py
old mode 100755
new mode 100644
index f5f0bf257c0..e9cd8b1b9c7
--- a/theano/misc/check_blas.py
+++ b/theano/misc/check_blas.py
@@ -86,15 +86,20 @@ def execute(execute=True, verbose=True, M=2000, N=2000, K=2000,
     t0 = 0
     t1 = -1
 
+    f() # Ignore first function call to get representative time.
     if execute:
         sync = (hasattr(theano, "sandbox") and
                 hasattr(theano.sandbox, "cuda") and
                 theano.sandbox.cuda.cuda_available)
+        sync2 = (hasattr(theano, "gpuarray") and
+                theano.gpuarray.pygpu_activated)
         t0 = time.time()
         for i in range(iters):
             f()
         if sync:
             theano.sandbox.cuda.synchronize()
+        if sync2:
+            c.get_value(borrow=True, return_internal_type=True).sync()
         t1 = time.time()
     return t1 - t0, impl