Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integrate with Lightning ecosystem CI #3502

Closed
Borda opened this issue Jan 24, 2022 · 10 comments · Fixed by Lightning-AI/ecosystem-ci#7
Closed

integrate with Lightning ecosystem CI #3502

Borda opened this issue Jan 24, 2022 · 10 comments · Fixed by Lightning-AI/ecosystem-ci#7
Assignees
Labels
feature request/PR for a new feature stale

Comments

@Borda
Copy link

Borda commented Jan 24, 2022

Hello and so happy to see you use Pytorch-Lightning! 🎉
Just wondering if you already heard about quite the new Pytorch Lightning (PL) ecosystem CI where we would like to invite you to... You can check out our blog post about it: Stay Ahead of Breaking Changes with the New Lightning Ecosystem CI
As you use PL framework for your cool project, we would like to enhance your experience and offer you safe updates to our future releases. At this moment, you run tests with a particular PL version, but it may accidentally happen that the next version will be incompatible with your project... 😕 We do not intend to change anything on our project side, but still here we have a solution - ecosystem CI with testing both - your and our latest development head we can find it very early and prevent releasing eventually bad version... 👍

What is needed to do?

What will you get?

  • scheduled nightly testing configured for development/stable versions
  • slack notification if something went wrong to investigate
  • testing also on multi-GPU machine as our gift to you 🐰
@Borda Borda added the feature request/PR for a new feature label Jan 24, 2022
@Borda
Copy link
Author

Borda commented Jan 24, 2022

in fact, we have already added it but it would be nice if you could verify the testing we did...
the addition was done in Lightning-AI/ecosystem-ci#7

@titu1994
Copy link
Collaborator

Hey @Borda Thanks for the update ! I reviewed the Nemo config file in that pr, and mostly it seems great. The one thing popped up was the pytorch version for the template.md file - Nemo depends on the latest pytorch version almost always. So pt 1.10 is a requirement (especially for stuff like Megatron). In general, Nemo depends on PTL to support the latest public pytorch version.

Furthermore, Nemo tests it's own CI on a PT container from NGC (versions as YY.MM) and these are generally available on other cloud providers too. Note that - usually once pytorch has a public release, it takes at least one month for the next container to actually have the public pytorch release. By actually have the released pytorch, I mean that the current container will have an alpha release of pytorch with some cherry picked changes vs the actual full new release in public.

This can cause cases where improper version checking (using distutils instead of packaging.version.Version) can fail these alpha version comparison tests and cause PTL inside of the container to pick incorrect code paths. So the ecosystem CI will work fine ... but when you run it on a pytorch container released from Nvidia (ie on most cloud providers) it may fail (and not just Nemo, anything that uses PTL and hits that code path).

So maybe on a seperate test prior to release, test the ecosystem CI on the latest public NGC pytorch container. Ofc this is a big task so it's just a suggestion.

@titu1994
Copy link
Collaborator

One thing - Nemo has a monthly release cycle - and generally releases at the last day of the month (or the closest working day). There are exceptions but that's generally the case.

Around halfway the month, we branch of our main branch and have a release branch with the pattern "r1.x.y". X usually goes up 1-2-3 etc and Y is generally 0 - currently we will release r1.6.0.

So if you can test against both main and the latest release branch, if that is possible to do on the ecosystem CI, then it would be safer. Otherwise main only is also fine since we routinely merge back the release brand into main anyway and at the end of the release too.

@titu1994
Copy link
Collaborator

On the previous discussion of Numba support - would be beneficial if a conda env with Numba 0.54/0.55 were available to test a few things in Nemo such as RNNT losses.

However, it is not a necessity as Nemo itself tests it, and PTL has a very low probability of breaking it. It's breakage occurs mainly due to Pytorch or CUDA incompatibilities against Numba itself (the containers used in Nemo's CI have the latest CUDA and we do test RNNT there).

@okuchaiev okuchaiev assigned titu1994 and unassigned okuchaiev Jan 24, 2022
@Borda
Copy link
Author

Borda commented Jan 24, 2022

Nemo depends on the latest pytorch version almost always. So pt 1.10 is a requirement (especially for stuff like Megatron).

this is fine, you can define any PT version in the config file

In general, Nemo depends on PTL to support the latest public pytorch version.

Yes, we have tests for let's an mostly also for future releasses

So maybe on a seperate test prior to release, test the ecosystem CI on the latest public NGC pytorch container. Ofc this is a big task so it's just a suggestion.

this is a good suggestion, maybe open a feature request on the EcoCI side to enable custom docker images...

Around halfway the month, we branch of our main branch and have a release branch with the pattern "r1.x.y". X usually goes up 1-2-3 etc and Y is generally 0 - currently we will release r1.6.0.

this shall be fine as you can use custom checkout which can be tag or branch

So if you can test against both main and the latest release branch, if that is possible to do on the ecosystem CI, then it would be safer.

sure, you just add two config files for each

On the previous discussion of Numba support - would be beneficial if a conda env with Numba 0.54/0.55 were available to test a few things in Nemo such as RNNT losses.

that would be good feature request on EcoCI side :]

@titu1994
Copy link
Collaborator

Sounds good, I'll try setting those up on the ecosystem repo over the next week or two.

@titu1994
Copy link
Collaborator

Sorry for delay, major PRs needed to be merged for release.

Added two feature requests for numba and CI test against container, Lightning-AI/ecosystem-ci#38 and Lightning-AI/ecosystem-ci#39.

The configs for nemo seem correct but there's a newer version now, how should we update it? Submit a PR? https://github.com/PyTorchLightning/ecosystem-ci/blob/main/configs/NVIDIA/nemo-release.yaml

@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2022

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 8, 2022
@titu1994 titu1994 removed the stale label Oct 8, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Nov 8, 2022

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Nov 8, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request/PR for a new feature stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants