Skip to content

[Code Example, Challenge] Computing of one ConvNet layer using CUDA

License

Notifications You must be signed in to change notification settings

OwlSoul/ConvLayer_CUDA

Repository files navigation

NCTU IEE 2016 Fall
Computer Architecture Final Project

Annotation

This was a challenge project at NCTU (National Chiao-Tung University) to use CUDA parallel computation framework for speeding up computation of one ConvNet layer. Whichever team acheive maximum speedup using GPU compared to CPU wins. This code won first place in the first round, 4th place in 2nd round and 1st place overall.

Each team was provided with one the server with NVidia GTX680 GPU on board. Same one. Yes, each team was provided with the same server, and same GPU. Simultaneously. Feel the pain.

Methods to acheive maximum speedup included usage of sparse arrays, shared GPU memory and loop unrolling. Loop unrolling gave about 0.5ms speedup boost which resulted in 1st place of the first round. Another trick was to switch compiler architecture from default (compute_10) to a better one (compute_30). Main memory in compute_30 is cached instead of compute_10, which results in a reasonable speedup.

Full report is available inside this repository as well.

Contents

Original Task
  Three sub-directory
    ./data
    /.innerProduct
    ./device
Usage of the base program
Task
Evaluation
Rules
Useful references

ORIGINAL TASK:

Part-I: Use CUDA to accelerate the operations of a typical convolutional layer in often-used large-scale neural networks. (You can find the description slides here)
Part-II: Accelerate a sparse convolutional layer with CUDA. (You can find the description slides here)

Three sub-directory

./data

This directory contains the input data for the base program

  • /data/filt.txt - Store the values of filters
  • /data/filt.coo - Store the values of filters in COO format
  • /data/inNeu.txt - Store the values of input neurons
  • /data/inNeu.coo - Store the values of input neurons in COO format

./innerProduct

This is the example to show you how to use CUDA to accelerate Inner Product

Usage

cd ./innerProduct
make
make run

./device

The program under this directory can show the device information

Usage

cd ./device
make
make run

Usage of the base program

Get the code and data for part-II into a new branch

git clone https://github.com/OwlSoul/ConvLayer_CUDA.git

Compile the code

make

Run the code

make run

Task

  • Put the input data in sparse format and reimplement your CUDA kernels
  • Use NVIDIA Visual Profiler to analyze and improve your code
  • Optimize your CUDA kernels for the sparse format
  • Improve the input data format (like using other sparse format rather than COO)

Evaluation

  • convLayerCPU() will do the computation with C++ and store the output in the outCPU

  • checker() will check whether the values stored in outCPU and outGPU are the same

    • Store your result in the outGPU in dense format
    • You must pass the checking to ensure your result is correct!
  • Use nvvp (or nvprof) to measure the kernel execution time and data transfer time

  • TA will use TotalExecTime to evaluate your preformance

      DataTransTime = DataHostToDeviceTime + DataDeviceToHostTime
      TotalExecTime = GPUKernelsExecTime + DataTransTime
    

Rules

  • It’s team work, 1 ~ 3 people in one team
  • Compress your code and report into one zip file and upload to E3 system
    • Name your package as: LeaderID_FP2.zip
    • One team only need to upload one package to E3 system
    • Please name your report as: LeaderID_Report_FP2.pdf
    • Make sure TA can compile and run your code on the provided server
  • Any CUDA library is forbidden to use in this project
  • Delay is NOT acceptable
  • Any plagiarism will make you get zero point

Useful Reference

Part-I

Part-II

TA: Chien-Yu Lin
Email: myislin@gmail.com

About

[Code Example, Challenge] Computing of one ConvNet layer using CUDA

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published