Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get RTX 3090 card to start training #944

Closed
cfernandezpa opened this issue Oct 5, 2020 · 44 comments
Closed

Cannot get RTX 3090 card to start training #944

cfernandezpa opened this issue Oct 5, 2020 · 44 comments
Assignees
Labels
backwards compatibility issues concerning prior to current versions tensorflow/training

Comments

@cfernandezpa
Copy link

OS: Win 10
DeepLabCut Version: 2.2b8
Anaconda env used: DLC-GPU (cloned from Alex's github)
WxPython version: 4.0.7.post2
Tensorflow version: many, installed with pip (see below)
Cuda version: 10 and 11

Hi everyone,

First of all, I wanted to thank all the authors for this amazing software!

I'm starting to work with DeepLabCut and after a few promising preliminary results with an "old" GPU (Turing architecture), we decided to upgrade to the recent Ampere architecture. Since it is also backwards compatible with old CUDA versions, we thought that it would be fine. However, after trying many combinations of Tensorflow and CUDA, I cannot make it to work. Here are the combinations I have tried so far:

Cuda | Tensorflow | Cudnn | Works?

10 | 1.15.2 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
10 | 1.15.0 | 7.6.5 | Same as with tf 1.15.2
10 | 1.14.0 | 7.6.5 | Does not detect GPU
11 | 1.15.0 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.13.1 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.14.0 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.15.4 | 7.6.5 | Does not detect GPU
11 | 1.15.2 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
*Tensorflow 1.13.1 does not detect the GPU either.

Using the combinations mentioned above that recognizes the GPU, and can print "Hello, Tensorflow", I ended up stuck at this screen (see code below).

I know that in the documentation says that CUDA 10.+ is not supported, but with the old card we had, it was running fine with CUDA 11. I have very limited knowledge about this, so not sure why/how it worked.

Reading in CUDA documentation it says that Ampere architecture is compatible with CUDA 10.2 or earlier. Also, according to Tensorflow documentation, Tensorflow 1.15 should be compatible with ampere. The only caveat is that it takes too long to start (up to 30 min) but that can be fixed by increasing the cuda cache size.

So, to me, the only thing left that could be giving issues is Cudnn. According to Nvidia, support for Ampere only appeared in Cudnn 8. However, as far as I know, Anaconda only supports up to Cudnn 7.6.5 on Windows. Apparently it has reached Cudnn 8 on Linux.

Code output

[Selecting multi-animal trainer
Config:
{'all_joints': [[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9],
[10],
[11],
[12]],
'all_joints_names': ['snout',
'cap',
'leftear',
'rightear',
'spine',
'lforepaw',
'rforepaw',
'lhindpaw',
'rhindpaw',
'tailbase',
'tailend',
'cornerofbox1',
'cornerofbox2'],
'batch_size': 8,
'crop_pad': 0,
'cropratio': 0.4,
'dataset': 'training-datasets\iteration-0\UnaugmentedDataSet_2CamTest9Oct4\2CamTest9_CF95shuffle3.pickle',
'dataset_type': 'multi-animal-imgaug',
'deterministic': False,
'display_iters': 500,
'fg_fraction': 0.25,
'global_scale': 0.8,
'init_weights': 'C:\Users\RyC\anaconda3\envs\dlc-gpu\lib\site-packages\deeplabcut\pose_estimation_tensorflow\models\pretrained\resnet_v1_50.ckpt',
'intermediate_supervision': False,
'intermediate_supervision_layer': 12,
'location_refinement': True,
'locref_huber_loss': True,
'locref_loss_weight': 0.05,
'locref_stdev': 7.2801,
'log_dir': 'log',
'max_input_size': 1500,
'mean_pixel': [123.68, 116.779, 103.939],
'metadataset': 'training-datasets\iteration-0\UnaugmentedDataSet_2CamTest9Oct4\Documentation_data-2CamTest9_95shuffle3.pickle',
'min_input_size': 64,
'mirror': False,
'multi_step': [[0.0001, 7500], [5e-05, 12000], [1e-05, 200000]],
'net_type': 'resnet_50',
'num_joints': 13,
'num_limbs': 55,
'optimizer': 'adam',
'pafwidth': 20,
'pairwise_huber_loss': False,
'pairwise_loss_weight': 0.1,
'pairwise_predict': False,
'partaffinityfield_graph': [[5, 9],
[4, 7],
[1, 3],
[6, 9],
[4, 8],
[5, 6],
[2, 8],
[0, 7],
[8, 9],
[1, 6],
[0, 10],
[3, 7],
[0, 3],
[2, 5],
[2, 4],
[5, 8],
[1, 2],
[4, 9],
[6, 7],
[2, 9],
[3, 10],
[6, 10],
[8, 10],
[1, 5],
[3, 6],
[0, 4],
[1, 10],
[7, 10],
[4, 10],
[2, 6],
[4, 5],
[1, 4],
[2, 10],
[9, 10],
[3, 9],
[0, 5],
[1, 9],
[2, 3],
[0, 8],
[3, 5],
[0, 1],
[2, 7],
[7, 9],
[7, 8],
[5, 10],
[4, 6],
[6, 8],
[5, 7],
[3, 8],
[0, 6],
[1, 8],
[1, 7],
[0, 9],
[3, 4],
[0, 2]],
'partaffinityfield_predict': True,
'pos_dist_thresh': 17,
'project_path': 'C:\Users\RyC\2CamTest9-CF-2020-10-04',
'regularize': False,
'rotation': 25,
'rotratio': 0.4,
'save_iters': 10000,
'scale_jitter_lo': 0.5,
'scale_jitter_up': 1.25,
'scoremap_dir': 'test',
'shuffle': True,
'snapshot_prefix': 'C:\Users\RyC\2CamTest9-CF-2020-10-04\dlc-models\iteration-0\2CamTest9Oct4-trainset95shuffle3\train\snapshot',
'stride': 8.0,
'weigh_negatives': False,
'weigh_only_present_joints': False,
'weigh_part_predictions': False,
'weight_decay': 0.0001}
Activating limb prediction...
Starting with multi-animal imaug + adam pose-dataset loader.
Batch Size is 8
Getting specs multi-animal-imgaug 55 13
Initializing ResNet
Loading ImageNet-pretrained resnet_50
2020-10-05 10:40:16.943131: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-10-05 10:40:16.946595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:08:00.0
2020-10-05 10:40:16.946675: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-10-05 10:40:16.948226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-10-05 10:40:16.948570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-10-05 10:40:16.948928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-10-05 10:40:16.949263: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-10-05 10:40:16.949302: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-10-05 10:40:16.949559: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-10-05 10:40:16.949840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-10-05 10:40:17.963045: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-05 10:40:17.963140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0
2020-10-05 10:40:17.964083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N
2020-10-05 10:40:17.964440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22071 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:08:00.0, compute capability: 8.6)
Max_iters overwritten as 3000
Display_iters overwritten as 10
Save_iters overwritten as 50
Training parameters:
{'stride': 8.0, 'weigh_part_predictions': False, 'weigh_negatives': False, 'fg_fraction': 0.25, 'mean_pixel': [123.68, 116.779, 103.939], 'shuffle': True, 'snapshot_prefix': 'C:\Users\RyC\2CamTest9-CF-2020-10-04\dlc-models\iteration-0\2CamTest9Oct4-trainset95shuffle3\train\snapshot', 'log_dir': 'log', 'global_scale': 0.8, 'location_refinement': True, 'locref_stdev': 7.2801, 'locref_loss_weight': 0.05, 'locref_huber_loss': True, 'optimizer': 'adam', 'intermediate_supervision': False, 'intermediate_supervision_layer': 12, 'regularize': False, 'weight_decay': 0.0001, 'crop_pad': 0, 'scoremap_dir': 'test', 'batch_size': 8, 'dataset_type': 'multi-animal-imgaug', 'deterministic': False, 'mirror': False, 'pairwise_huber_loss': False, 'weigh_only_present_joints': False, 'partaffinityfield_predict': True, 'pairwise_predict': True, 'all_joints': [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12]], 'all_joints_names': ['snout', 'cap', 'leftear', 'rightear', 'spine', 'lforepaw', 'rforepaw', 'lhindpaw', 'rhindpaw', 'tailbase', 'tailend', 'cornerofbox1', 'cornerofbox2'], 'cropratio': 0.4, 'dataset': 'training-datasets\iteration-0\UnaugmentedDataSet_2CamTest9Oct4\2CamTest9_CF95shuffle3.pickle', 'display_iters': 500, 'init_weights': 'C:\Users\RyC\anaconda3\envs\dlc-gpu\lib\site-packages\deeplabcut\pose_estimation_tensorflow\models\pretrained\resnet_v1_50.ckpt', 'max_input_size': 1500, 'metadataset': 'training-datasets\iteration-0\UnaugmentedDataSet_2CamTest9Oct4\Documentation_data-2CamTest9_95shuffle3.pickle', 'min_input_size': 64, 'multi_step': [[0.0001, 7500], [5e-05, 12000], [1e-05, 200000]], 'net_type': 'resnet_50', 'num_joints': 13, 'num_limbs': 55, 'pafwidth': 20, 'pairwise_loss_weight': 0.1, 'partaffinityfield_graph': [[5, 9], [4, 7], [1, 3], [6, 9], [4, 8], [5, 6], [2, 8], [0, 7], [8, 9], [1, 6], [0, 10], [3, 7], [0, 3], [2, 5], [2, 4], [5, 8], [1, 2], [4, 9], [6, 7], [2, 9], [3, 10], [6, 10], [8, 10], [1, 5], [3, 6], [0, 4], [1, 10], [7, 10], [4, 10], [2, 6], [4, 5], [1, 4], [2, 10], [9, 10], [3, 9], [0, 5], [1, 9], [2, 3], [0, 8], [3, 5], [0, 1], [2, 7], [7, 9], [7, 8], [5, 10], [4, 6], [6, 8], [5, 7], [3, 8], [0, 6], [1, 8], [1, 7], [0, 9], [3, 4], [0, 2]], 'pos_dist_thresh': 17, 'project_path': 'C:\Users\RyC\2CamTest9-CF-2020-10-04', 'rotation': 25, 'rotratio': 0.4, 'save_iters': 10000, 'scale_jitter_lo': 0.5, 'scale_jitter_up': 1.25}
Starting multi-animal training....
2020-10-05 10:40:27.731872: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll]

Upon reading in some forums, some people have been succesful using Symlink in other applications, so I tried that with Cudnn64_7.dll and hardlinked to Cudnn64_8.dll inside DLC-GPU enviroment, but I have not been able to make it work. It shows an error saying that compute capabilities does not match.

Do you have any suggestion that I might try?

Many thanks in advance.

@MMathisLab
Copy link
Member

We just got a 3090 in the lab this week; so we can test it. But in general what I would suggest is running our testscripts always as a first pass after installation.

https://www.youtube.com/watch?v=IOWtKn3l33s

https://github.com/DeepLabCut/DeepLabCut/tree/master/examples

@cfernandezpa
Copy link
Author

Thank you very much for the reply. That is great news, hopefully you would be able to make it work! I won't be able to try testscripts this week but I'll do the other week for sure and I'll report back.

Thanks again!

@MMathisLab
Copy link
Member

sorry we haven't gotten to this yet; but you might try our dev branch with TF2.x--> https://github.com/DeepLabCut/DeepLabCut-core/tree/tf2.2alpha

@cfernandezpa
Copy link
Author

Hi,

I have tried the testscript with version 2.2b8 and it stops at the same point to when I tried with my data set.
I am going to try now with the Dev branch and see how it goes.

Thanks!

@cfernandezpa
Copy link
Author

Hi,

I am currently trying with the dev branch and I was able to start training. However, it was far from ideal. First, I learned that TF2.2 does not work with CUDA 11 (let alone 11.1), so it won't recognize the GPU. So, I had to install CUDA 10.1 which is supposed to be the version that works with TF2.2. That change made the system to recognize the GPU. Then, training took a long time to start but it engaged the GPU as seen in Task manager. A warning message was shown about PTX compiling been done by the driver (I cannot find the original message in the training log), after which training started but it was very slow. Also, the reduction of the "loss" value after each iteration seems smaller than I remembered, but I have no objective way to confirm this. In any case, I was able to train for 10000 iterations which is a good progress.

I think one possible solution is to compile TF2.2 or 2.3 with CUDA 11.1 from sources, but I don't know how to do that in Windows. I found an article on how to do it for Linux (https://towardsdatascience.com/how-to-compile-tensorflow-2-3-with-cuda-11-1-8cbecffcb8d3). Could you please advice on this matter?

If I find anything else, I'll post it here.

Thanks!

@cfernandezpa
Copy link
Author

Little update. I noticed that I had "gputouse=0", so I changed it to 1 and started much faster and it is training like 100X faster.

I'll keep you posted with any advances I make.

@cfernandezpa
Copy link
Author

Hi,

I noticed you closed this issue, which is fair since I was able to train using DeepLabCutCore. However, I'm not sure about the validity of the results of the training as I'm unable to evaluate it with either this version or using the GUI with version 2.2b8; there is a Key error after evaluation started. Also, the available options in the Core version are limited as you know.

So, my question is, should there be another issue open to tackle DLC compatibility with RTX 3000 series cards? I'm willing to help as far as my skills allow.

Thanks!

@MMathisLab
Copy link
Member

MMathisLab commented Oct 20, 2020

it's a good point; i'll reopen until it's really resolved; for now, also people can hopefully find the TF2.x branch!

However, I'm not sure about the validity of the results of the training as I'm unable to evaluate it with either this version or using the GUI with version 2.2b8;

correct - the branch is only up to date with 2.1.8.1! :) so when we roll up to 2.2x for TF that would work again.

@MMathisLab MMathisLab reopened this Oct 20, 2020
@MMathisLab MMathisLab added backwards compatibility issues concerning prior to current versions tensorflow/training WORK IN PROGRESS! developers are currently working on this feature... stay tuned. labels Oct 20, 2020
@cfernandezpa
Copy link
Author

Hi,

I have been testing some more and I have made some progress. I can confirm that the training works well with the following system settings:

Deeplabcutcore
CUDA 11.1
Cudnn 8.0.4.30
Drivers 456.71
Tensorflow tf-nightly-gpu 2.5.0.dev20201019

I had Deeplabcut and TF installed in a Python environment (not Anaconda) and I was able to train, evaluate, analyze and create a video. I enconunter an issue where the video analysis was running very low, which makes me think that the GPU was not fully engaged in this part.

Hopefully the full version, including the GUI would be available soon.

Thanks!

@MMathisLab
Copy link
Member

MMathisLab commented Dec 5, 2020

Hi @cfernandezpa please also check out the blog post/ the branch is now working (and a colab notebook): http://www.mousemotorlab.org/deeplabcutblog/2020/11/23/rolling-up-to-tensorflow-2

In general, deeplabcutcore will be the package to use for TF2 support from now on; you can indeed have deeplabcut and deeplabcutcore in the same env, check out the colab on how to easily imort deeplabcutcore in the workflow; it's all the same functionality as "normal" dlc (and then deeplabcut is the GUIs, etc).

https://github.com/DeepLabCut/DeepLabCut-core/blob/tf2.2alpha/Colab_TrainNetwork_VideoAnalysis_TF2.ipynb

I think then I can close this issue, since it can support 3090 training now (woo hoo)

@Gittinator
Copy link

Is there a guide available on how get this set up? I'm also trying to use deeplabcut with an RTX 3090. I've got CUDA 11.1 on WIndows. I made a new conda environment, with python 3.7, and installed deeplabcutcore and tf-nightly-gpu.

When I go to import deeplabcutcore, it says: No module named 'tensorflow.contrib'. This seems like it wants TF1?

@cfernandezpa
Copy link
Author

Hi @cfernandezpa please also check out the blog post/ the branch is now working (and a colab notebook): http://www.mousemotorlab.org/deeplabcutblog/2020/11/23/rolling-up-to-tensorflow-2

In general, deeplabcutcore will be the package to use for TF2 support from now on; you can indeed have deeplabcut and deeplabcutcore in the same env, check out the colab on how to easily imort deeplabcutcore in the workflow; it's all the same functionality as "normal" dlc (and then deeplabcut is the GUIs, etc).

https://github.com/DeepLabCut/DeepLabCut-core/blob/tf2.2alpha/Colab_TrainNetwork_VideoAnalysis_TF2.ipynb

I think then I can close this issue, since it can support 3090 training now (woo hoo)

Thank you very much for your message and for your work/support of DLC as well! That is Awesome news!

@cfernandezpa
Copy link
Author

Is there a guide available on how get this set up? I'm also trying to use deeplabcut with an RTX 3090. I've got CUDA 11.1 on WIndows. I made a new conda environment, with python 3.7, and installed deeplabcutcore and tf-nightly-gpu.

When I go to import deeplabcutcore, it says: No module named 'tensorflow.contrib'. This seems like it wants TF1?

Hello,

I was not able to make it to work under a conda environment, so I made a Python environment. The difference is that under conda, CuDDN is limited to 7.6 (I think) and with the other one you can update to 8. I believe that it was the source of my problem. Try that and see if that solves your issue.

@DuanWei-fudan
Copy link

@Gittinator
I got the same problem. Do you solve it?
If you did,could you please tell you how can I solve it ?

@SabriQ
Copy link

SabriQ commented Dec 25, 2020

Hi @cfernandezpa please also check out the blog post/ the branch is now working (and a colab notebook): http://www.mousemotorlab.org/deeplabcutblog/2020/11/23/rolling-up-to-tensorflow-2
In general, deeplabcutcore will be the package to use for TF2 support from now on; you can indeed have deeplabcut and deeplabcutcore in the same env, check out the colab on how to easily imort deeplabcutcore in the workflow; it's all the same functionality as "normal" dlc (and then deeplabcut is the GUIs, etc).
https://github.com/DeepLabCut/DeepLabCut-core/blob/tf2.2alpha/Colab_TrainNetwork_VideoAnalysis_TF2.ipynb
I think then I can close this issue, since it can support 3090 training now (woo hoo)

Thank you very much for your message and for your work/support of DLC as well! That is Awesome news!

@MMathisLab I met the same problem. my question is do I really need to link the GOOGLE-DRIVE firstly , following the "create a training dataset" in COLAB?
my labeled data is quite huge and somehow the uploading is quit slow.
I'm wondering what makes the COLAB neccessary to create and train the dataset if we already get a local GPU.
Thanks for your job. it do helps a lot.

@DuanWei-fudan
Copy link

When I used the GPU to train my network, the software stopped at the _start training..._I saw the GPU is working, but the iteration: 10 loss: 0.2167 lr: 0.005 disappeared. It shunted down after a well.
When I changed GPU to CPU, it worked normally. How can I solve it? I can't understand it.

@MMathisLab
Copy link
Member

@SabriQ you can use the branch of dlc-core on your own machine, but see how to install it from the top of the colab notebook.

@MMathisLab MMathisLab removed the WORK IN PROGRESS! developers are currently working on this feature... stay tuned. label Jan 1, 2021
@MMathisLab MMathisLab assigned MMathisLab and unassigned AlexEMG Jan 6, 2021
@runninghsus
Copy link

Hi @MMathisLab

I just want to comment on making this work.
First, for tensorflow to be built with the new NVIDIA RTX 3090, this reddit post contains a step-by-step tutorial:
https://www.reddit.com/r/tensorflow/comments/jsalkw/rtx_3090_and_tensorflow_for_windows_10_step_by/

Second, I used the easy-install for DLC-GPU for windows.

Finally, I installed
tf2: https://pypi.org/project/tf-nightly-gpu/2.4.0.dev20201019/
DeepLabCutCore: pip install deeplabcutcore

and basically just used import deeplabcutcore as deeplabcut for training steps.

Hope this helps!

@DuanWei-fudan
Copy link

@runninghsus hello,when I pip install deeplabcutcore,there are some mistakes:
_WARNING: Discarding https://files.pythonhosted.org/packages/26/04/8b381d5b166508cc258632b225adbafec49bbe69aa9a4fa1f1b461428313/matplotlib-3.0.3.tar.gz#sha256=e1d33589e32f482d0a7d1957bf473d43341115d40d33f578dad44432e47df7b7 (from https://pypi.org/simple/matplotlib/) (requires-python:>=3.5). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Collecting deeplabcutcore
Using cached deeplabcutcore-0.0b2-py3-none-any.whl (172 kB)
Using cached deeplabcutcore-0.0b1-py3-none-any.whl (171 kB)
ERROR: Cannot install deeplabcutcore==0.0b1, deeplabcutcore==0.0b2 and deeplabcutcore==0.0b3 because these package versions have conflicting dependencies.

The conflict is caused by:
deeplabcutcore 0.0b3 depends on matplotlib==3.0.3
deeplabcutcore 0.0b2 depends on matplotlib==3.0.3
deeplabcutcore 0.0b1 depends on matplotlib==3.0.3

To fix this you could try to:

  1. loosen the range of package versions you've specified
  2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies_

@runninghsus
Copy link

runninghsus commented Feb 28, 2021

@DuanWei-fudan
Your issue does not seem familiar to me, could it be that your tf version is wrong? Check that by doing the following in ipython

import tensorflow as tf
tf.__version__

Regardless, you will have to use the alpha version, it's a different version that the one on pypi.
I also realized they merged the nightly-build into 2.4.1
So the steps I just had success last night, assuming you have 3090 and the proper driver installed following that reddit link I posted, was

upon doing the easy-install and activating DLC-GPU

pip install tensorflow==2.4.1
pip install git+https://github.com/DeepLabCut/DeepLabCut-core.git@tf2.2alpha
pip install tf_slim

and make sure when you run command lines you do
import deeplabcutcore as deeplabcut
with potentially necessary tf import:
import tensorflow as tf

all the steps that does not use GUI (labeling images, etc.) should work fine with deeplabcutcore (deeplabcut.create_training_dataset(), deeplabcut.train_network(), deeplabcut.analyze_videos(), deeplabcut.create_labeled_videos()) that is, the regular deeplabcut (tf==1) can be used to label images, the training, analyses and labeled videos can be done with deeplabcutcore (tf==2)

if you need help, I may write a blog post specifically on this. I'll keep this post updated with the blog link

@DuanWei-fudan
Copy link

@runninghsus
Well,thank you very much.I can train my network with CPU normally!!!!
So how can I use my GPU?

@DuanWei-fudan
Copy link

OMG!
I did it!
I can use my GPU right now!

@xtzhou25
Copy link

xtzhou25 commented Mar 4, 2021

@DuanWei-fudan
Hi, I met the same error when I pip install deeplabcutcore, so I tried pip install git+https://github.com/DeepLabCut/DeepLabCut-core.git@tf2.2alpha but it showed a similar error about matplotlib==3.0.3. Then I tried pip install matplotlib==3.0.3, but it became more confused because the same error occurred.
So I wonder what did you do to fix this problem. Thanks!

@MMathisLab
Copy link
Member

@xtzhou25 be sure you use python 3.7, make a new conda env and pip install deeplabcutcore + pip install tf_slim as this was updated yesterday as well.

@xtzhou25
Copy link

@xtzhou25 be sure you use python 3.7, make a new conda env and pip install deeplabcutcore + pip install tf_slim as this was updated yesterday as well.

Thanks! That works!
Python=3.7; tensorflow=2.4.1; tf-slim=1.1.0; wxpython=4.0.4 (and always set English as system language on Windows). Now I can train network by import deeplabcutcore and do the rest of works by python -m deeplabcut.

@MaloM-CVision
Copy link

MaloM-CVision commented Mar 19, 2021

Hi, I can't find a way to get the proper config. With tensorflow=2.4.1, Python=3.7.10, and deeplabcutcore i still have conflict with numpy version (tensorflow needs 1.19 and dlc needs 1.16). Can someone share his environment.yml, so that i can copy his environment that works ?
Thanks in advance

@MMathisLab
Copy link
Member

you need to run tensorflow==2.4 to install it. but I would recommend 2.2

@xtzhou25
Copy link

@MaloM-CVision
Hi, I double checked my env:
python==3.7.10 tensorflow==2.4.1 numpy==1.16.4
I successfully installed Deeplabcut==2.1.10.2 and Deeplabcutcore in this env.
But I also met some compatibility problem when using them. Most of the errors were about tensorflow version. I just googled the errors one by one and modified the documentx. It won't take too long. Now I still have some warnings occasionally, but it works on my 3090 now.

@MMathisLab
Copy link
Member

@xtzhou25 indeed best not to have deeplabcut and deeplabcut core in the same environment! Just core with TF2. If you need guis, then just use the dlc-cpu conda file in a separate environment. You can open the project in both! :)

@MaloM-CVision
Copy link

thanks a lot @MMathisLab , i finally get my env running :)

@NejcKejzar
Copy link

NejcKejzar commented Mar 22, 2021

Hi @MMathisLab and others! I've been pouring over this thread and still can't get my RTX 3070 GPU to work with deeplabcutcore.

Summary:
Specs: Ubuntu 20.04 LTS + NVIDIA GeForce RTX 3070

  1. First, I made sure to remove all NVIDIA and CUDA files from the PC (there's been a lot of tinkering to try to make things work):
> sudo apt-get --purge remove "*cublas*" "cuda*" "nsight*" 
> sudo apt-get --purge remove "*nvidia*"
> sudo apt-get autoremove
> sudo apt-get autoclean
  1. Next, I installed the latest NVIDIA drivers: sudo apt-get install nvidia-driver-460 (exact version as verified with nvidia-smi: 460.32.03)

  2. Next, I installed the latest CUDA 11.2 with the local .deb file, by following instructions here. After installation was successful, I added the following two lines to my ~/.bashrc file:

export PATH=/usr/local/cuda-11.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

I verified the installation with nvcc -V (returned Cuda compilation tools, release 11.2, V11.2.152 Build cuda_11.2.r11.2/compiler.29618528_0) as well as verifying installation by compiling examples. Running deviceQuery recognized the correct GPU and returned a test=PASS status.

  1. Next, I installed the latest cuDNN8.1.1 (compatible with CUDA 11.2), by following installing from .tar instructions here. I noticed, that the instructions suggest copying the cuDNN files to /usr/local/cuda; I changed this to /usr/local/cuda-11.2.

  2. After all this, I rebooted the PC and created a new conda environment with Python 3.7 as suggested here:

> conda create -n test4 python=3.7
> conda activate test4
> pip install tensorflow==2.4
> pip install deeplabcutcore
> pip install tf_slim
  1. I verified, that the TF recognizes the GPU:
> import tensorflow as tf
> tf.__version__
'2.4.0'
> tf.config.experimental.list_physical_devices()

This last recognized the GPU, but did not find a certain library 'libcusolver.so.10':
error1

This has been a known issue here. By hard-linking the missing libcusolver.so.10 with the installed libcusolver.so.11, this issue was solved and the GPU was successfully detected:

> cd $LD_LIBRARY_PATH
> sudo ln libcusolver.so.11 libcusolver.so.10  # hard link

test1

  1. Lastly, I got to my code, which I run from a Jupyter notebook. I am batch analyzing videos by calling:
import deeplabcutcore as dlc
dlc.analyze_videos(NN_config_path,
                   videos=[video_dir],
                   videotype='mp4',
                   shuffle=shuffle,
                   gputouse=0,
                   dynamic=(True, 0.5, 10),
                   destfolder=analysis_dir)

img1

Running this engages the GPU as seen from nvidia-smi:
img6

But returns quite a massive error log:
img3

Looking at the terminal output reveals a more compact error:
ERROR2

This is where I don't know how to proceed. I double-checked my $PATH and $LD_LIBRARY_PATH variables to see they point to the right directories:

> echo $PATH
/usr/local/cuda-11.2/bin:/home/nejc_pc2/anaconda3/envs/test4/bin:/home/nejc_pc2/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

> echo $LD_LIBRARY_PATH
/usr/local/cuda-11.2/lib64

Checking the $LD_LIBRARY_PATH directory shows that the latest cublas libraries are there:
cudnn_paths

So, I am not sure why cublas would be giving this error. Have I missed anything in the above steps? I apologize for the long post, but hopefully, this will also help other Ubuntu users getting DLC to work with 3000-series GPUs.

UPDATE [SOLVED]:
Finally solved this issue! Allowing memory growth in the same Jupyter notebook cell from which dlc.analyze_videos is run, did the trick:

import deeplabcutcore as dlc
import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
dlc.analyze_videos(NN_config_path,
                   videos=[video_dir],
                   videotype='mp4',
                   shuffle=shuffle,
                   gputouse=0,
                   dynamic=(True, 0.5, 10),
                   destfolder=analysis_dir)

Hope this helps somebody else as well!

@NejcKejzar
Copy link

NejcKejzar commented Apr 12, 2021

UPDATE: I've managed to start analyzing videos by relying on conda's installation of tensorflow-gpu, which also automatically installs compatible cuda-toolkit and cudnn versions within the anaconda environment. Specifically:

> conda create -n test6 python=3.7
> conda activate test6
> conda install tensorflow-gpu
> pip install deeplabcutcore
> pip install tf_slim

In this installation the GPU is again recognized as checked with:

> import tensorflow as tf
> tf.__version__
'2.4.1'
> tf.config.experimental.list_physical_devices()
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

When I run dlc.analyze_videos as above, the above error does not arise - it seems DLC is frozen for some time (several minutes) first on Initializing ResNet and then on Starting to extract posture.... After a while, the videos start getting analyzed and all seems good - nvidia-smi shows the GPU is engaged (the majority of memory is allocated to the conda environment from which I am running the code as well as the power spikes to >100W/220W; rest power is around 15W/220W), and the analysis rapidly progresses (>130 it/s). However, when I open the generated .h5 files, they seem completely off:
test7

This is what the actual result is:
truth

I also notice a strange output includingmetadata.pickle which is not there in regular DLC2.1.9, but I guess this could be a feature of deeplabcutcore:
files

Lastly, I checked the installed cudatoolkit and cudnn in the conda environment:

> conda list cudnn
# packages in environment at /home/nejc_pc2/anaconda3/envs/test12:
#
# Name                    Version                   Build  Channel
cudnn                     7.6.5                cuda10.1_0 

Now, this I found very strange, because TF2.4.1 should only work with latest cuda toolkit and cudnn. Just to double check, I also looked at the tensorflow installation in the conda environment:

> conda list tensorflow
# packages in environment at /home/nejc_pc2/anaconda3/envs/test12:
#
# Name                    Version                   Build  Channel
tensorflow                2.4.1           gpu_py37ha2e99fa_0  
tensorflow-base           2.4.1           gpu_py37h29c2da4_0  
tensorflow-estimator      2.4.1              pyheb71bc4_0  
tensorflow-gpu            2.4.1                h30adc30_0 

So it seems to me that anaconda automatically installs the incorrect version of cuda toolkit and cudnn. I guess that is why people suggest installing cudnn and cuda toolkit manually and tensorflow with pip.

@rlinus
Copy link

rlinus commented Apr 16, 2021

Hi

I found a way to run the DeepLabCut version based on Tensorflow 1.x on my RTX3090. A docker container based on a tensorflow docker image from nvidia that comes with tensorflow 1.5 compiled with CUDA 11. You can test it with the following dockerfile that I made:

##########################################
# Dockerfile for DeepLabCut GPU training #
##########################################

#We use Tensorflow v1.5.5 for deeplabcut
FROM nvcr.io/nvidia/tensorflow:21.03-tf1-py3

# install needed tools
RUN apt-get update && apt-get install -y wget ffmpeg

# install deeplabcut
RUN python3 -m pip install deeplabcut --no-cache-dir

# download git repo
RUN git clone https://github.com/AlexEMG/DeepLabCut /root/DeepLabCut/

WORKDIR /root

#Instructions:
#cd in folder with this file and then build the docker image with
#   docker build -t dlc .
#then run the image and open interactive shell with:
#   docker run --gpus all -it --rm dlc /bin/bash
#run testscript
#   python3 /root/DeepLabCut/examples/testscript.py

@ARHassett
Copy link

OS: Win 10
DeepLabCut Version: 2.2b8
Anaconda env used: DLC-GPU (cloned from Alex's github)
WxPython version: 4.0.7.post2
Tensorflow version: many, installed with pip (see below)
Cuda version: 10 and 11

Hi everyone,

First of all, I wanted to thank all the authors for this amazing software!

I'm starting to work with DeepLabCut and after a few promising preliminary results with an "old" GPU (Turing architecture), we decided to upgrade to the recent Ampere architecture. Since it is also backwards compatible with old CUDA versions, we thought that it would be fine. However, after trying many combinations of Tensorflow and CUDA, I cannot make it to work. Here are the combinations I have tried so far:

Cuda | Tensorflow | Cudnn | Works?

10 | 1.15.2 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
10 | 1.15.0 | 7.6.5 | Same as with tf 1.15.2
10 | 1.14.0 | 7.6.5 | Does not detect GPU
11 | 1.15.0 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.13.1 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.14.0 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
11 | 1.15.4 | 7.6.5 | Does not detect GPU
11 | 1.15.2 | 7.6.5 | Recognizes GPU and run some tf test, but takes to long and ended up failing
*Tensorflow 1.13.1 does not detect the GPU either.

Using the combinations mentioned above that recognizes the GPU, and can print "Hello, Tensorflow", I ended up stuck at this screen (see code below).

I know that in the documentation says that CUDA 10.+ is not supported, but with the old card we had, it was running fine with CUDA 11. I have very limited knowledge about this, so not sure why/how it worked.

Reading in CUDA documentation it says that Ampere architecture is compatible with CUDA 10.2 or earlier. Also, according to Tensorflow documentation, Tensorflow 1.15 should be compatible with ampere. The only caveat is that it takes too long to start (up to 30 min) but that can be fixed by increasing the cuda cache size.

So, to me, the only thing left that could be giving issues is Cudnn. According to Nvidia, support for Ampere only appeared in Cudnn 8. However, as far as I know, Anaconda only supports up to Cudnn 7.6.5 on Windows. Apparently it has reached Cudnn 8 on Linux.

Code output
Upon reading in some forums, some people have been succesful using Symlink in other applications, so I tried that with Cudnn64_7.dll and hardlinked to Cudnn64_8.dll inside DLC-GPU enviroment, but I have not been able to make it work. It shows an error saying that compute capabilities does not match.

Do you have any suggestion that I might try?

Many thanks in advance.

It appears here that you are running multi-animal DLC using deeplabcutcore, is this correct? I'm having issues running any maDLC related functions at the moment with it.

@mschart
Copy link

mschart commented May 19, 2021

rlinus' solution worked well for me on Ubuntu 20.04. See here for some more details.

@MMathisLab
Copy link
Member

Thanks @mschart ! BTW, we have new DeepLabCut dockers here: https://github.com/stes/deeplabcut-docker so perhaps those we be most useful for the IBL workflow too.

@rlinus
Copy link

rlinus commented May 20, 2021

Thanks @mschart ! BTW, we have new DeepLabCut dockers here: https://github.com/stes/deeplabcut-docker so perhaps those we be most useful for the IBL workflow too.

Those Docker images are based on the official Google Tensorflow 1.15 builds, that do not work with RTX 30xx GPUs (because of noncompatible CUDA versions). The dockerfile that I posted works with RTX 30xx GPUs.

@patakihara
Copy link

Is there a guide available on how get this set up? I'm also trying to use deeplabcut with an RTX 3090. I've got CUDA 11.1 on WIndows. I made a new conda environment, with python 3.7, and installed deeplabcutcore and tf-nightly-gpu.

When I go to import deeplabcutcore, it says: No module named 'tensorflow.contrib'. This seems like it wants TF1?

Don't know if you managed to set this up or not, but after I figured it out I wrote up a short guide on how to run DeepLabCut on an RTX 3090. Here's the link: https://hackmd.io/@guilhermepata/r1U__n89O

@ARHassett
Copy link

ARHassett commented Jun 4, 2021 via email

@Dashbrook
Copy link

Is there a guide available on how get this set up? I'm also trying to use deeplabcut with an RTX 3090. I've got CUDA 11.1 on WIndows. I made a new conda environment, with python 3.7, and installed deeplabcutcore and tf-nightly-gpu.
When I go to import deeplabcutcore, it says: No module named 'tensorflow.contrib'. This seems like it wants TF1?

Don't know if you managed to set this up or not, but after I figured it out I wrote up a short guide on how to run DeepLabCut on an RTX 3090. Here's the link: https://hackmd.io/@guilhermepata/r1U__n89O

I just want to say that Solution 2 here worked beautifully for me! Thank you!

@patakihara
Copy link

This is really nice. Any luck getting multi-animal DLC working on it?
-- Amy Hassett

I haven't tested multi-animal 😕

@MMathisLab
Copy link
Member

maDLC is not yet supported in DLC core; it will be supported soon though, in this repo.

@ARHassett
Copy link

ARHassett commented Jun 8, 2021 via email

@AlexEMG
Copy link
Member

AlexEMG commented Jan 4, 2022

Just fyi - It's now supported as TF 2.* has been integrated in the main repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards compatibility issues concerning prior to current versions tensorflow/training
Projects
No open projects
Development

No branches or pull requests