Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Fix HDFS linking #151

Merged
merged 5 commits into from Apr 26, 2022
Merged

Conversation

bashimao
Copy link
Contributor

  • Contains the necessary fixing the existing HDFS issues for HugeCTR.
  • Also, meticulously dissects all dependencies and builds them in a - hopefully - sensible manner.

@bashimao
Copy link
Contributor Author

Ready for initial review.

@bashimao
Copy link
Contributor Author

@albert17 @benfred

Copy link
Member

@benfred benfred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bashimao can you resolve the merge conflicts?

This looks like its a substantial change over just updating HDFS - which makes it much harder to review. Can we split this into multiple PR's ? The first one being a smaller PR focused on just HDFS linking, and then later ones with other improvements?

For instance, this PR is not just changing how we build/install HDFS - but also has changed how we build NVTabular / Transformers4rec and how we are handling dependencies for those libraries. Since this doesn't also update the TF/PT dockerfiles, this means we have two different ways of installing these libraries, and we'll have to maintain both right now.

docker/inference/dockerfile.ctr Outdated Show resolved Hide resolved
docker/training/dockerfile.ctr Outdated Show resolved Hide resolved
docker/training/build-hadoop.sh Show resolved Hide resolved
@albert17
Copy link
Contributor

As you mentioned:

  • Contains the necessary fixing the existing HDFS issues for HugeCTR.
  • Also, meticulously dissects all dependencies and builds them in a - hopefully - sensible manner.

I think this should be 2 different PRs. It is quite hard to review.

In addition, all dockerfiles should follow the same style.

Also, I guess we are gonna need a meeting, as there are substancial changes.

@bashimao
Copy link
Contributor Author

@bashimao can you resolve the merge conflicts?

This looks like its a substantial change over just updating HDFS - which makes it much harder to review. Can we split this into multiple PR's ? The first one being a smaller PR focused on just HDFS linking, and then later ones with other improvements?

For instance, this PR is not just changing how we build/install HDFS - but also has changed how we build NVTabular / Transformers4rec and how we are handling dependencies for those libraries. Since this doesn't also update the TF/PT dockerfiles, this means we have two different ways of installing these libraries, and we'll have to maintain both right now.

Will do. (Sorry, my half of Shanghai was/still is in complete lockdown, with daily testing.)

@bashimao
Copy link
Contributor Author

I think this should be 2 different PRs. It is quite hard to review.

That might be possible. Let me give it some thought.

In addition, all dockerfiles should follow the same style.

I kind of feared you would say that. As said, I came in from an angle where we had to take the container apart slice by slice to figure out what was wrong. Consequently, I then build the solution by piecing it together slice by slice. But having said that, I can understand when it feels a little bit heavy as a change.

Also, I guess we are gonna need a meeting, as there are substancial changes.

Arguably, yes.

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #151 of commit 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8, no merge conflicts.
Running as SYSTEM
Setting status of 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/7/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/151/*:refs/remotes/origin/pr/151/* # timeout=10
 > git rev-parse 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8^{commit} # timeout=10
Checking out Revision 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7ecf3ccd5c4cbde5a90551d0778a7892cf1c8fe8 # timeout=10
Commit message: "Less invasive approach to marry HugeCTR and Merlin container config."
 > git rev-list --no-walk 94f230c7cc1d8f95e5c285eb91ed97ed4534e700 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins7458613170408113306.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1 item

tests/unit/test_version.py . [100%]

============================== 1 passed in 0.01s ===============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins615505478061537935.sh

@bashimao
Copy link
Contributor Author

bashimao commented Apr 3, 2022

@benfred @albert17
As requested, I created a new version, that only changes the build process for HugeCTR and related components, but keeps the remaining code for other Merlin components untouched.

I only modified the training docker for now. Please let me know what you think. if you are OK with this approach I will modify the inference container accordingly.

As for the other issue with duplicate files. We can only really solve this by either agreeing to have a specific docker build folder (as default folder, we usually assume .), or if we place all dockerfiles into the same folder. I don't see any other workaround, as soft-linking doesn't work with docker build.

@EvenOldridge FYI, feel free to comment.
@zehuanw FYI, feel free to comment. Binaries are confirmed to pass loss_test (no HDFS) and hdfs_backend_test (with HDFS enabled).

@bashimao

This comment was marked as resolved.

docker/training/dockerfile.ctr Outdated Show resolved Hide resolved
docker/training/dockerfile.ctr Outdated Show resolved Hide resolved
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/Merlin/review/pr-151

@bashimao
Copy link
Contributor Author

@benfred @EvenOldridge Both images adjusted. Please re-check.

@nvidia-merlin-bot
Copy link
Contributor

Click to view CI Results
GitHub pull request #151 of commit e44afdad30e6214c0ed4163264090d9f5a5c4914, no merge conflicts.
Running as SYSTEM
Setting status of e44afdad30e6214c0ed4163264090d9f5a5c4914 to PENDING with url https://10.20.13.93:8080/job/merlin_merlin/45/console and message: 'Pending'
Using context: Jenkins
Building on master in workspace /var/jenkins_home/workspace/merlin_merlin
using credential systems-login
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/NVIDIA-Merlin/Merlin # timeout=10
Fetching upstream changes from https://github.com/NVIDIA-Merlin/Merlin
 > git --version # timeout=10
using GIT_ASKPASS to set credentials login for merlin-systems
 > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/Merlin +refs/pull/151/*:refs/remotes/origin/pr/151/* # timeout=10
 > git rev-parse e44afdad30e6214c0ed4163264090d9f5a5c4914^{commit} # timeout=10
Checking out Revision e44afdad30e6214c0ed4163264090d9f5a5c4914 (detached)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f e44afdad30e6214c0ed4163264090d9f5a5c4914 # timeout=10
Commit message: "Adjust inference image accordingly."
 > git rev-list --no-walk 22f3f689866b23a96d7f48b817e7713cf8904412 # timeout=10
[merlin_merlin] $ /bin/bash /tmp/jenkins5179122541206650314.sh
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.1, pluggy-1.0.0
rootdir: /var/jenkins_home/workspace/merlin_merlin/merlin
plugins: xdist-2.5.0, forked-1.4.0, cov-3.0.0
collected 1 item

tests/unit/test_version.py . [100%]

============================== 1 passed in 0.01s ===============================
Performing Post build task...
Match found for : : True
Logical operation result is TRUE
Running script : #!/bin/bash
cd /var/jenkins_home/
CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/Merlin/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log"
[merlin_merlin] $ /bin/bash /tmp/jenkins5511310228785025628.sh

@albert17
Copy link
Contributor

Definitely it looks much better now.

@jperez999 What do you think?

@karlhigley karlhigley merged commit 79edeee into NVIDIA-Merlin:main Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants