# Instalation and set up Tensorflow and Keras with CUDA

In this notebook we will explain how to install tensorflow and keras, using the configuration for GPU using.

Clink link for [Pytorch Instalation](https://medium.com/@_willfalcon/how-to-install-pytorch-1-0-with-cuda-10-0-169569c5b82d)

In [1]:
import sys
print(sys.version)

3.7.9 (default, Aug 31 2020, 17:10:11) [MSC v.1916 64 bit (AMD64)]


## 0. System requirements and general Steps

* Your system has GPU Nvidia
* You have installed CUDA (in this case 10.1) $\rightarrow$ Visual Studio Express must be intalled in Windows
* You have installed CUDnn  (in this case cudnn-10.1-windows10-x64-v7.6.5.32)
* You have installed an Anaconda distribution (in this case: Miniconda)
* You have installed the GPU version of tensorflow
* Verify that tensorflow is running with GPU check if GPU is working

**CUDA, Cudnn, and tensorflow-gpu versions must be compatible with each others.**

**First of all ensure our system has Nvidia GPU**

In [2]:
!nvidia-smi

Fri Oct 02 12:08:40 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 456.38       Driver Version: 456.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  GeForce GTX 166... WDDM  | 00000000:01:00.0  On |                  N/A |
| N/A   49C    P8     8W /  N/A |    600MiB /  6144MiB |     19%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|       

## 1. CUDA 10.1 Instalation

* **Download VISUAL STUDIO EXPRESS 2019**

Its necessary install [VSExpress](https://visualstudio.microsoft.com/es/vs/express/) for getting the properly base C++ files and libraries. We do have to install **C++ Workloads**.

* **Download CUDA 10.1**

Once installed VSE, we will [Download CUDA Toolkit 10.1 from its official page](https://developer.nvidia.com/cuda-10.1-download-archive-base?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exelocal). We can choose our OS, architecture, and download options (net or local).

We must know the exact installation location of Cuda Toolkit, such as `C:\Program Files\NVIDA GPU Computing Toolkit\CUDA\v10.1`. We mus know it for adding it to the PATH.

We will add `C:\Program Files\NVIDA GPU Computing Toolkit\CUDA\v10.1\bin` and `C:\Program Files\NVIDA GPU Computing Toolkit\CUDA\v10.1\libnvvp` to the Enviroment variables.


In [3]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:26_Pacific_Standard_Time_2019
Cuda compilation tools, release 10.1, V10.1.105


## 2. cuDNN 7.6.5 for CUDA 10.1 installation

* **Install cuDNN 7.6.5 for CUDA 10.1**

(cudnn-10.1-windows10-x64-v7.6.5.32)

We should [download the zip for cuDNN](https://developer.nvidia.com/rdp/cudnn-download#). We must be loged in orde to be able to download it.

We will click [Archived cuDNN Releases](https://developer.nvidia.com/rdp/cudnn-archive) and choose *Download cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.1*

It is not an executable, **we must copy the proper files from cuDNN zip to our local CUDA PATH**.


| Source  |   | Target  |
|---|---|---|
| <downloadpath\>\cudnn-10.1-windows10-x64-v7.6.5.32\cuda\bin\\**cudnn64_7.dll**  |  $\rightarrow$ | <cudapath\>NVIDA GPU Computing Toolkit\CUDA\v10.1\bin\  |
| <downloadpath\>\cudnn-10.1-windows10-x64-v7.6.5.32\cuda\include\\**cudnn.h**  |  $\rightarrow$ | <cudapath\>NVIDA GPU Computing Toolkit\CUDA\v10.1\include\  |   
| <downloadpath\>\cudnn-10.1-windows10-x64-v7.6.5.32\cuda\lib\x64\\**cudnn.lib**  |  $\rightarrow$ | <cudapath\>NVIDA GPU Computing Toolkit\CUDA\v10.1\lib\x64\  |  
    
**Ensure cudnn64_7.dll file exists**. In some new versions, you have cudnn64_8.dll instead, and TF crashes.

## 3. Install tensorflow-gpu

`pip instal tensorflow-gpu` or `pip instal tensorflow-gpu==version`

It works for versions 2.1.0 and 2.3.1

***************

## 4. Checking if system recognices GPU

### 4.1 Checking (in many ways) if tensorflow recognice GPU

In [Tensorflow Guide for use of GPU](https://www.tensorflow.org/guide/gpu) They cover more topics like

* Limit GPU usage and dynamic growth.
* Dynamically selection of device.
* Usage of multiples GPU: **SIMULATE IN A 1-GPU SYSTEM**.


**TensorFlow code, and tf.keras models will transparently run on a single GPU with no code changes required.**


In [4]:
import tensorflow as tf 
from tensorflow.python.client import device_lib

Fist of all, we show the tensorflow version

In [5]:
tf.__version__

'2.3.1'

Check if the distribution of Tensorflow is intalled with gpu availability.

In [6]:
tf.test.is_built_with_cuda()

True

Some commands that show us if GPU is seen by tensorflow (and hence Cuda, CuDnn, and Tensorflow-gpu properly installed)

In [7]:
tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

In [8]:
tf.config.experimental.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [9]:
tf.config.list_physical_devices('GPU')

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [10]:
tf.config.experimental.list_logical_devices('GPU')

[LogicalDevice(name='/device:GPU:0', device_type='GPU')]

In [11]:
# Check all devices with its details
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 7217021839371242892,
 name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 10241166872665323710
 physical_device_desc: "device: XLA_CPU device",
 name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 4973462816
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 16761669919385134732
 physical_device_desc: "device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5",
 name: "/device:XLA_GPU:0"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 16966958974781054450
 physical_device_desc: "device: XLA_GPU device"]

**Logging device placement**

To find out which devices your operations and tensors are assigned to, put tf.debugging.set_log_device_placement(True) as the first statement of your program. Enabling device placement logging causes any Tensor allocations or operations to be printed.

In [12]:
#Set log in older versions
#sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(log_device_placement=True))
tf.debugging.set_log_device_placement(True)

Example of running: see miniconda prompt for log output or jupyter cell output.

*Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0*

In [13]:

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)


**In case of Tensorflow doesn't run it on GPU, we could set up manualy**

Manual device placement
```python 
tf.debugging.set_log_device_placement(True)

# Place tensors on the CPU
with tf.device('/CPU:0'): #with tf.device('/device:GPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

# Run on the GPU
c = tf.matmul(a, b)
print(c)
```

You will see that now a and b are assigned to CPU:0. Since a device was not explicitly specified for the MatMul operation, the TensorFlow runtime will choose one based on the operation and available devices (GPU:0 in this example) and automatically copy tensors between devices if required.

### 4.2 Keras

In Tensorflow Keras is already bult-in. Therefore, the use and configuration of GPU is madre through Tensorflow.

In [14]:
tf.keras.__version__

'2.4.0'

### 4.3 Checking if Pytorch recognice it
In case of having Pythorch installed
```python
# confirm PyTorch sees the GPU
from torch import cuda
cuda.is_available()
cuda.device_count() > 0
print(cuda.get_device_name(cuda.current_device()))
```



## 5. Example Experiment
To ensure that Keras and Tensorflow are using the GPU, we are making a little demo model and check the GPU usage.

We will compare the execution time of a toy model (mnist classification) executing on CPU vs GPU.

In [15]:
from __future__ import print_function

''' Not built-in Keras
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop'''

import tensorflow.keras as keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop
import time

tf.debugging.set_log_device_placement(True)

s_time = time.time()

batch_size = 128
num_classes = 10
epochs = 20

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()
model.add(Dense(512, activation='relu', input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))

model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

60000 train samples
10000 test samples
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 512)               401920    
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                5130      
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


In [16]:
time.time() - s_time

28.41860294342041

## 6. Comparison

In [17]:
tf.debugging.set_log_device_placement(True)

time_execution = dict()

for dev in [d for d in tf.config.experimental.list_logical_devices() if not 'XLA' in d.name]:
    start_time = time.time()

    with tf.device(dev.name):

        batch_size = 128
        num_classes = 10
        epochs = 20

        # the data, shuffled and split between train and test sets
        (x_train, y_train), (x_test, y_test) = mnist.load_data()

        x_train = x_train.reshape(60000, 784)
        x_test = x_test.reshape(10000, 784)
        x_train = x_train.astype('float32')
        x_test = x_test.astype('float32')
        x_train /= 255
        x_test /= 255
        print(x_train.shape[0], 'train samples')
        print(x_test.shape[0], 'test samples')

        # convert class vectors to binary class matrices
        y_train = keras.utils.to_categorical(y_train, num_classes)
        y_test = keras.utils.to_categorical(y_test, num_classes)

        model = Sequential()
        model.add(Dense(512, activation='relu', input_shape=(784,)))
        model.add(Dropout(0.2))
        model.add(Dense(512, activation='relu'))
        model.add(Dropout(0.2))
        model.add(Dense(10, activation='softmax'))

        model.summary()

        model.compile(loss='categorical_crossentropy',
                      optimizer=RMSprop(),
                      metrics=['accuracy'])

        history = model.fit(x_train, y_train,
                            batch_size=batch_size,
                            epochs=epochs,
                            verbose=1,
                            validation_data=(x_test, y_test))
        score = model.evaluate(x_test, y_test, verbose=0)
        print('Test loss:', score[0])
        print('Test accuracy:', score[1])
        
        time_execution[dev.device_type] = time.time() - start_time

60000 train samples
10000 test samples
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 512)               401920    
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 10)                5130      
Total params: 669,706
Trainable params: 669,706
Non-trainable params: 0
_________________________________________________________________
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/2

In [18]:
time_execution

{'CPU': 76.77355575561523, 'GPU': 28.334858179092407}