Metadata issue with the CUDA repositories [CDN] #7

justinokamoto · 2022-07-12T04:01:10Z

Reporting metadata issues with the CUDA repositories

Getting "File has unexpected size" issues when running apt update. This seems to be a known issue to NVIDIA CDNs, as NVIDIA mentions it here.

docker run -it nvidia/cuda:11.4.0-cudnn8-runtime-ubuntu18.04 bash
root@c1c86c02a768:/# apt update
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease [1581 B]
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages [814 kB]
Err:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Packages
  File has unexpected size (815853 != 814314). Mirror sync in progress? [IP: 152.195.19.142 443]
  Hashes of expected file:
   - Filesize:814314 [weak]
   - SHA256:257071ac3a46f8e8ba340c2bd6b88466ff26e4cb0c4b60afacfb267b251dc2d9
   - SHA1:4b2ecd5529c611f17784b07ed4cb2b13d5d4bd25 [weak]
   - MD5Sum:0355ef69bc6b6afaf8493d82295c3633 [weak]
  Release file created at: Mon, 11 Jul 2022 19:02:21 +0000

... (omitting successful fetches from other package index endpoints)

Fetched 25.6 MB in 3s (7382 kB/s)
Reading package lists... Done
E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/Packages.gz  File has unexpected size (815853 != 814314). Mirror sync in progress? [IP: 152.195.19.142 443]
   Hashes of expected file:
    - Filesize:814314 [weak]
    - SHA256:257071ac3a46f8e8ba340c2bd6b88466ff26e4cb0c4b60afacfb267b251dc2d9
    - SHA1:4b2ecd5529c611f17784b07ed4cb2b13d5d4bd25 [weak]
    - MD5Sum:0355ef69bc6b6afaf8493d82295c3633 [weak]
   Release file created at: Mon, 11 Jul 2022 19:02:21 +0000
E: Some index files failed to download. They have been ignored, or old ones used instead.

Please provide the following information in your comment:

When was the Release (Debian) or repomd.xml (RPM) file last modified ?

root@c1c86c02a768:/# curl -I https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/Release
HTTP/2 200
accept-ranges: bytes
age: 15211
cache-control: max-age=604800
content-type: application/octet-stream
date: Tue, 12 Jul 2022 03:58:42 GMT
etag: "64257053"
expires: Tue, 19 Jul 2022 03:58:42 GMT
last-modified: Mon, 11 Jul 2022 23:01:57 GMT
server: ECAcc (sed/E12B)
x-cache: HIT
x-vdms-version: 3.0
content-length: 696

The Linux distro and architecture. If cross-compiling or containerized, please mention that.
This is occurring within the latest Docker image nvidia/cuda:11.4.0-cudnn8-runtime-ubuntu18.04.

  root@c1c86c02a768:/# cat /etc/os-release
  NAME="Ubuntu"
  VERSION="18.04.6 LTS (Bionic Beaver)"
  ID=ubuntu
  ID_LIKE=debian
  PRETTY_NAME="Ubuntu 18.04.6 LTS"
  VERSION_ID="18.04"
  HOME_URL="https://www.ubuntu.com/"
  SUPPORT_URL="https://help.ubuntu.com/"
  BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
  PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
  VERSION_CODENAME=bionic
  UBUNTU_CODENAME=bionic
  root@c1c86c02a768:/# uname -a
  Linux c1c86c02a768 4.15.0-187-generic #198-Ubuntu SMP Tue Jun 14 03:23:51 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Which NVIDIA repositories do you have enabled ?
Do your .list / .repo files contain URLs using HTTP (port 80) or HTTPS (port 443) ?

root@c1c86c02a768:/etc/apt# cat sources.list.d/cuda.list
deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /

Which geographic region is the machine located in ?
Seattle area
Which CDN edge node are you hitting ?
Not sure :/
Any other relevant environmental conditions (i.e. a specific Docker container image) ?
nvidia/cuda:11.4.0-cudnn8-runtime-ubuntu18.04

The text was updated successfully, but these errors were encountered:

kmittman · 2022-07-12T05:05:51Z

Hi @justinokamoto and @Angel-Popa
I am investigating this now.

kmittman · 2022-07-12T08:11:21Z

I rolled back the repository metadata from the last posting, the signatures (i.e. Release.gpg, InRelease) failed to upload. Checked now apt-get update and installation of packages is functional.

I have scheduled a re-posting job to run in a few hours and will verify the intermittent issue is resolved. I'll close the issue pending that verification for all affected repos.

Please let me know if you continue to see any errors, thank you!

kmittman · 2022-07-12T22:43:14Z

The metadata for each repo passes repo-validate.sh and manual testing, closing.

mrgzg1 · 2022-07-15T23:32:51Z

The new release file seems to be error-ing for us:

E: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/Packages.gz  Hash Sum mismatch
   Hashes of expected file:
    - Filesize:815853 [weak]
    - SHA256:66e891d82894f08ecedac13c4e058ce5ea21882032c66a5b4d981cc2552f94a1
    - SHA1:7cc76dbdc9d32d50cf5a1d377c8c877bcef23dcd [weak]
    - MD5Sum:dc953e306c3b53f23ccdbe433881a800 [weak]
   Hashes of received file:
    - SHA256:989834105adeb2987d44b3da1cf0bd621e4a86a260f5790dd2f865105d231efe
    - SHA1:da0b54ecbb7f3dd76602bef508e9a8df901d1cd2 [weak]
    - MD5Sum:aa04b5a6fb8ceb8c4c0676ce2c144903 [weak]
    - Filesize:815853 [weak]
   Last modification reported: Mon, 11 Jul 2022 23:01:57 +0000
   Release file created at: Tue, 12 Jul 2022 15:01:44 +0000
E: Some index files failed to download. They have been ignored, or old ones used instead.

did something go amiss in fixing this issue?
I see few other mentions of things going wrong else where too:

kmittman · 2022-07-18T15:54:46Z

Re-opening, there are several reports in the NVIDIA Developer forums

wonmean-roche · 2022-07-19T17:45:20Z

Did this get fixed? sudo apt update runs without errors now on our end.

xmalina-aibuild · 2022-09-13T12:10:40Z

any update on this? Still getting the error

Nrohlable · 2022-09-13T13:21:40Z

Any update on this error. I'm facing similar issues while running below commands

RUN apt-cache policy libcudnn8
RUN apt-get install libcudnn8=8.3.2.44-1+cuda11.5

error response:

Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/./libcudnn8_8.3.2.44-1+cuda11.5_amd64.deb Hash Sum mismatch
Hashes of expected file:�[0mFetched 423 MB in 5s (82.5 MB/s)
�[91m
- SHA512:10cf6e68aa4f65e23fa75b481fdc1ba3d45181f1d3d7004b4731ae3d1532cfbaf141dadc5342a8f4eb15fae047b2af956fb88bd2fd50c9a66991ffffcf6c34e5
- SHA256:a1f5eeab52bddb36e94fc933acda2170df4bdf6565edafce47cd03d7248720d9
- SHA1:67aa82cc11b9974533adee3f923aec85fd984640 [weak]
- MD5Sum:8f3bd1d122899edbc0a90d485c8f22d0 [weak]
- Filesize:422575544 [weak]
Hashes of received file:
- SHA512:998af4d5e69aa9de5ff040c6e97eafcaa6adac55ffef0d04114a53401d10dad6e6ddc34eb2787c4bf7b3c7ae5294573e91ecf522dd42ecc4eb7a336e555dd619
- SHA256:6c8830a5f58fc64b583f0b4af52d5b2acdbc066698deecc05195164028dcc7ad
- SHA1:5ffcba4f0e7d47c91acd0632a3d1ed1dfc7b42bb [weak]
- MD5Sum:25e31500c9014e4675cba4659d8542a5 [weak]
- Filesize:422575544 [weak]
Last modification reported: Thu, 06 Jan 2022 06:44:48 +0000

xmalina · 2022-09-13T13:30:39Z

Getting the same thing. Have tried everything and no luck.

kmittman · 2022-09-15T06:50:35Z

Sorry I missed the email notification @xmalina-aibuild / @xmalina, @Nrohlable
I'm not able to reproduce the mismatch.

Based on the timestamp and the RUN commands, I'm guessing you are using a Dockerfile FROM: an image that has not been updated in some time?

$ podman run -it ubuntu:20.04 /bin/bash -c "apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y wget sudo ca-certificates gnupg; bash"
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
$ sudo dpkg -i cuda-keyring_1.0-1_all.deb
$ sudo apt-get update

$ apt-cache policy libcudnn8
libcudnn8:
  Installed: (none)
  Candidate: 8.5.0.96-1+cuda11.7
  Version table:
     8.5.0.96-1+cuda11.7 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     8.4.1.50-1+cuda11.6 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     8.4.0.27-1+cuda11.6 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages
     8.3.3.40-1+cuda11.5 600
        600 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Packages

$ sudo apt-get install --verbose-versions libcudnn8=8.3.2.44-1+cuda11.5
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
   libcudnn8 (8.3.2.44-1+cuda11.5)
0 upgraded, 1 newly installed, 0 to remove and 9 not upgraded.
Need to get 423 MB of archives.
After this operation, 1270 MB of additional disk space will be used.
Get:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64
  libcudnn8 8.3.2.44-1+cuda11.5 [423 MB]
Fetched 423 MB in 5s (91.9 MB/s)    
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libcudnn8.
(Reading database ... 4885 files and directories currently installed.)
Preparing to unpack .../libcudnn8_8.3.2.44-1+cuda11.5_amd64.deb ...
Unpacking libcudnn8 (8.3.2.44-1+cuda11.5) ...
Setting up libcudnn8 (8.3.2.44-1+cuda11.5) ...

Nrohlable · 2022-09-19T06:36:27Z

Hey @kmittman, sorry for the delayed response.

I'm using the Dockerhub Image tag corresponding to tensorflow/tensorflow:2.8.0-gpu and creating a image for Computer vision application and training. I was able to update the cuda version from 11.2 to 11.7 as you see from here:

(Reading database ... 19687 files and directories currently installed.)
Preparing to unpack .../libcudnn8_8.5.0.96-1+cuda11.7_amd64.deb ...
Unpacking libcudnn8 (8.5.0.96-1+cuda11.7) over (8.1.0.77-1+cuda11.2) ...
Setting up libcudnn8 (8.5.0.96-1+cuda11.7) ...

But at the runtime training when I checked tensorflow was still running on cuda 11.2, where I'm receiving this error:

Node: 'model/conv2d/Conv2D'
DNN library is not found.
[[{{node model/conv2d/Conv2D}}]] [Op:__inference_predict_function_841]

This seems to be a gpu issue since, the training appears to be running just fine on CPU tho.

Note: The error is poping up while using MTCNN lib for face detection which i'm using just for inference.

kmittman · 2022-09-19T07:08:00Z

Hi @Nrohlable
The output you shared indicates that you installed cuDNN 8.5.0.96 compiled for CUDA 11.7.x, not that the CUDA 11.7 toolkit is actually installed.

Looking at the Tensorflow 2.8.0-gpu docker image tag you mentioned, it was last updated 8 months ago, on February 2nd.

However, we performed a GPG key rotation to a new public key on ~ April 28th. All of the packages in the NVIDIA repository, including CUDA 11.2 packages were re-signed using the new GPG key.

The apt package manager requires the public key to be enrolled in the environment, one method is to install the cuda-keyring package, another method would be to fetch the pubkey with apt-key (deprecated), or could wget && mv the keyring file.

My suggestion would be one of the following

(a.) perform a workaround to update the GPG key manually
(b.) use an updated Docker image tag (i.e. 2.8.1-gpu)
(c.) contact the maintainer of the Dockerfile to rebuild the 2.8.0-gpu tag, see: Tensorflow docker image has outdated keys tensorflow/tensorflow#56085

Nrohlable · 2022-09-20T08:50:02Z

Hi @kmittman,

2.8.1-gpu tensorflow image is working fine without any issues in my case.

Thanks for your help, really appreciate it

Nrohlable · 2022-09-20T09:47:53Z

Hey @kmittman,

I hope you could help me with this as well.
I'm training my Siamese Network for Face verification using TensorFlow, below are the time it is taking for training for comparison with CPU and GPU:

Siamese on CPU : 18/899 [..............................] - ETA: 4:41:43 - loss: 0.8343
Siamese on GPU: 18/899 [..............................] - ETA: 4:52:04 - loss: 0.9894

It seems both of them take exactly similar time, which shouldn't be the case.
Below i'm also attaching CPU and GPU details:

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3939414009020902295
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14258995200
locality {
bus_id: 1
links {
}
}
incarnation: 3474452779198683813
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5"
xla_global_id: 416903419
]

These are following commands i'm running at the training time in order to make sure it is working on GPU:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
print(tf.config.list_physical_devices('GPU'))
tf.debugging.set_log_device_placement(True)

As disccused earlier i'm using Tensorflow-2.8.1-gpu for training it on GPU.
Am i doing anything wrong here or do I need to run some other commands as well in order to fix this ?

kmittman · 2022-09-20T17:54:07Z

Closing this repository issue. Please follow up with Tensorflow team.

justinokamoto assigned kmittman Jul 12, 2022

kmittman mentioned this issue Jul 12, 2022

GPG signing key rotation for CUDA repositories #4

Closed

kmittman closed this as completed Jul 12, 2022

kmittman reopened this Jul 18, 2022

kmittman mentioned this issue Sep 20, 2022

[Tensorflow 2.8.1-gpu] Training not accelerated with NVIDIA GPU NVIDIA/tensorflow#69

Open

kmittman closed this as completed Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata issue with the CUDA repositories [CDN] #7

Metadata issue with the CUDA repositories [CDN] #7

justinokamoto commented Jul 12, 2022 •

edited

Loading

kmittman commented Jul 12, 2022

kmittman commented Jul 12, 2022

kmittman commented Jul 12, 2022

mrgzg1 commented Jul 15, 2022 •

edited

Loading

kmittman commented Jul 18, 2022

wonmean-roche commented Jul 19, 2022

xmalina-aibuild commented Sep 13, 2022

Nrohlable commented Sep 13, 2022

xmalina commented Sep 13, 2022

kmittman commented Sep 15, 2022

Nrohlable commented Sep 19, 2022

kmittman commented Sep 19, 2022

Nrohlable commented Sep 20, 2022

Nrohlable commented Sep 20, 2022

kmittman commented Sep 20, 2022

Metadata issue with the CUDA repositories [CDN] #7

Metadata issue with the CUDA repositories [CDN] #7

Comments

justinokamoto commented Jul 12, 2022 • edited Loading

Reporting metadata issues with the CUDA repositories

Please provide the following information in your comment:

kmittman commented Jul 12, 2022

kmittman commented Jul 12, 2022

kmittman commented Jul 12, 2022

mrgzg1 commented Jul 15, 2022 • edited Loading

kmittman commented Jul 18, 2022

wonmean-roche commented Jul 19, 2022

xmalina-aibuild commented Sep 13, 2022

Nrohlable commented Sep 13, 2022

xmalina commented Sep 13, 2022

kmittman commented Sep 15, 2022

Nrohlable commented Sep 19, 2022

kmittman commented Sep 19, 2022

Nrohlable commented Sep 20, 2022

Nrohlable commented Sep 20, 2022

kmittman commented Sep 20, 2022

justinokamoto commented Jul 12, 2022 •

edited

Loading

mrgzg1 commented Jul 15, 2022 •

edited

Loading