Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Intel xpu as a new backend of PyTorch-Lightning #17700

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jingxu10
Copy link

@jingxu10 jingxu10 commented May 26, 2023

What does this PR do?

Enable Intel xpu as a new backend of PyTorch-Lightning.
Follow up PR of #16834

Contributed by abhilash.majumder@intel.com and jing.xu@intel.com.

Before submitting
  • Was this discussed/agreed via a GitHub issue? (not for typos and docs)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes? (if necessary)
  • Did you write any new necessary tests? (not for typos and docs)
  • Did you verify new and existing tests pass locally with your changes?
  • Did you list all the breaking changes introduced by this pull request?
  • Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist
  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified

@jingxu10 jingxu10 marked this pull request as draft May 26, 2023 11:10
@github-actions github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels May 26, 2023
ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext()
with ctx:
return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
else:
return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about the xpu case? I think the logic should be:
`
# https://pytorch.org/docs/stable/notes/cuda.html#id5
ctx = nullcontext()
if self.root_device.type == "cuda" and device_ids is not None:
ctx = torch.cuda.stream(torch.cuda.Stream())
elif self.root_device.type == "xpu" and device_ids is not None:
ctx = torch.xpu.stream(torch.xpu.Stream())

    with ctx:
          ....

`

ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext()
with ctx:
return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that this block missing the xpu case?

Comment on lines 485 to 514
torch.cuda.empty_cache()
with suppress(AttributeError):
torch.xpu.empty_cache()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before calling empty_cache, how about detect whether it is cuda or xpu first?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a PR , will merge .

Comment on lines 62 to 65
torch.cuda.manual_seed_all(seed)
if XPUAccelerator.is_available():
XPUAccelerator.manual_seed_all(seed)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to check device(cuda,xpu) first before calling corresponding manual_seed_all api.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed in this case, as declaration already checks.

ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext()
with ctx:
return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing xpu case

Comment on lines 103 to +117
_check_bad_cuda_fork()
if XPUAccelerator.is_available():
_check_bad_xpu_fork()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better if check device first before calling corresponding fork api.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks already present , additionally not needed.

else:
num_xpus = 0
xpu_available = False
rank_zero_info(f"XPU available: {xpu_available}, using: {num_xpus} XPUs")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that num_xpus and xpu_available are local vars, line 193 shall report error here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are set in either if or else scopes.

Copy link

@mingxiaoh mingxiaoh Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are set in either if or else scopes.

yes, but that means local var, it might pass but it is not good programming style.

sys.modules["lightning.pytorch.accelerators.xpu"] = self


class XPUAccelerator:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is needed as XPU newer was part of the codebase yet

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sure. Let me remove it.

@stale
Copy link

stale bot commented Sep 17, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Sep 17, 2023
@stale stale bot removed the won't fix This will not be worked on label Oct 3, 2023
@github-actions github-actions bot added docs Documentation related dependencies Pull requests that update a dependency file package labels Oct 5, 2023
Copy link

gitguardian bot commented Jan 16, 2024

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@chaitjo
Copy link

chaitjo commented Mar 20, 2024

@jingxu10 @abhilash1910 I just wanted to check what's the status on XPU integration with lightning. Is this effort still on the horizon or abandoned?

@abhilash1910
Copy link

@jingxu10 @abhilash1910 I just wanted to check what's the status on XPU integration with lightning. Is this effort still on the horizon or abandoned?

Yes it is still in progress while we work out upstreaming Lightning -xpu module . Since there is a lot of demand , we are trying to expedite.

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

update typos and bug fixes

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

xpu seeding PR1

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

add seeding for pytorch utilities

mp_fabric xpu forking

xpu multiprocess pytorch

add header for xpu

rename

change to lightning.pytorch

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Teardown from lightning-xpu (from #PR- 3)

From #3

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

add torch.xpu.stream to ddp

update docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

update _LIGHTNING_XPU_AVAILABLE to _lightning_xpu_available

correct fabric imports.py

1. remove xpu.py from _graveyard
2. correct _lightning_xpu_available() usage

fix _try_import function not defined issue in fabric

add docs

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix circle import issue

update pytorch trainer connector

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

correct usage in multiprocessing

Fix precision device

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

update warning format
@timocafe
Copy link

timocafe commented May 2, 2024

Do you have a beta of the Lightning-xpu module ? Google indicates some repo which does not exist anymore, internal to lightning-Ai only ?

@samet-akcay
Copy link

Any progress on this? It would be great to see this implemented. Appreciate the effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file docs Documentation related fabric lightning.fabric.Fabric package pl Generic label for PyTorch Lightning package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants