Enable Intel xpu as a new backend of PyTorch-Lightning #17700

jingxu10 · 2023-05-26T11:10:16Z

What does this PR do?

Enable Intel xpu as a new backend of PyTorch-Lightning.
Follow up PR of #16834

Contributed by abhilash.majumder@intel.com and jing.xu@intel.com.

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

mingxiaoh · 2023-06-25T05:55:05Z

src/lightning/fabric/strategies/ddp.py

+            ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext()
+            with ctx:
+                return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
+        else:
            return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)


how about the xpu case? I think the logic should be:
`
# https://pytorch.org/docs/stable/notes/cuda.html#id5
ctx = nullcontext()
if self.root_device.type == "cuda" and device_ids is not None:
ctx = torch.cuda.stream(torch.cuda.Stream())
elif self.root_device.type == "xpu" and device_ids is not None:
ctx = torch.xpu.stream(torch.xpu.Stream())

with ctx: ....

`

mingxiaoh · 2023-06-25T06:00:29Z

src/lightning/fabric/strategies/ddp.py

+            ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext()
+            with ctx:
+                return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
+        else:


seems that this block missing the xpu case?

mingxiaoh · 2023-06-25T06:03:14Z

src/lightning/fabric/strategies/deepspeed.py

        torch.cuda.empty_cache()
+        with suppress(AttributeError):
+            torch.xpu.empty_cache()
+


before calling empty_cache, how about detect whether it is cuda or xpu first?

In a PR , will merge .

mingxiaoh · 2023-06-25T06:07:22Z

src/lightning/fabric/utilities/seed.py

    torch.cuda.manual_seed_all(seed)
+    if XPUAccelerator.is_available():
+        XPUAccelerator.manual_seed_all(seed)


I think it would be better to check device(cuda,xpu) first before calling corresponding manual_seed_all api.

Not needed in this case, as declaration already checks.

mingxiaoh · 2023-06-25T06:08:27Z

src/lightning/pytorch/strategies/ddp.py

+            ctx = torch.cuda.stream(torch.cuda.Stream()) if device_ids is not None else nullcontext()
+            with ctx:
+                return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs)
+        else:


missing xpu case

mingxiaoh · 2023-06-25T06:09:49Z

src/lightning/pytorch/strategies/launchers/multiprocessing.py

            _check_bad_cuda_fork()
+            if XPUAccelerator.is_available():
+                _check_bad_xpu_fork()


would be better if check device first before calling corresponding fork api.

Checks already present , additionally not needed.

mingxiaoh · 2023-06-25T06:12:38Z

src/lightning/pytorch/trainer/setup.py

+    else:
+        num_xpus = 0
+        xpu_available = False
+    rank_zero_info(f"XPU available: {xpu_available}, using: {num_xpus} XPUs")


it seems that num_xpus and xpu_available are local vars, line 193 shall report error here.

they are set in either if or else scopes.

they are set in either if or else scopes.

yes, but that means local var, it might pass but it is not good programming style.

Borda · 2023-08-14T07:17:38Z

src/lightning/pytorch/_graveyard/xpu.py

+    sys.modules["lightning.pytorch.accelerators.xpu"] = self
+
+
+class XPUAccelerator:


I don't think it is needed as XPU newer was part of the codebase yet

Oh, sure. Let me remove it.

stale · 2023-09-17T03:18:56Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions.

gitguardian · 2024-01-16T09:24:57Z

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
Once a secret has been leaked into a git repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.

^{_{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}}

chaitjo · 2024-03-20T20:21:18Z

@jingxu10 @abhilash1910 I just wanted to check what's the status on XPU integration with lightning. Is this effort still on the horizon or abandoned?

abhilash1910 · 2024-03-21T02:42:57Z

@jingxu10 @abhilash1910 I just wanted to check what's the status on XPU integration with lightning. Is this effort still on the horizon or abandoned?

Yes it is still in progress while we work out upstreaming Lightning -xpu module . Since there is a lot of demand , we are trying to expedite.

[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update typos and bug fixes [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci xpu seeding PR1 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci add seeding for pytorch utilities mp_fabric xpu forking xpu multiprocess pytorch add header for xpu rename change to lightning.pytorch [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Teardown from lightning-xpu (from #PR- 3) From #3 [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci add torch.xpu.stream to ddp update docs [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update _LIGHTNING_XPU_AVAILABLE to _lightning_xpu_available correct fabric imports.py 1. remove xpu.py from _graveyard 2. correct _lightning_xpu_available() usage fix _try_import function not defined issue in fabric add docs [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix circle import issue update pytorch trainer connector [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci correct usage in multiprocessing Fix precision device [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci update warning format

timocafe · 2024-05-02T15:34:37Z

Do you have a beta of the Lightning-xpu module ? Google indicates some repo which does not exist anymore, internal to lightning-Ai only ?

samet-akcay · 2024-05-21T13:49:27Z

Any progress on this? It would be great to see this implemented. Appreciate the effort!

jingxu10 requested review from awaelchli, carmocca, justusschock, williamFalcon and tchaton as code owners May 26, 2023 11:10

jingxu10 marked this pull request as draft May 26, 2023 11:10

github-actions bot added fabric lightning.fabric.Fabric pl Generic label for PyTorch Lightning package labels May 26, 2023

jingxu10 force-pushed the jingxu10/ipex_2 branch from f7952d4 to 7ff5ada Compare May 26, 2023 11:12

jingxu10 force-pushed the jingxu10/ipex_2 branch from b18e706 to 9487579 Compare June 5, 2023 05:01

mingxiaoh reviewed Jun 25, 2023

View reviewed changes

jingxu10 force-pushed the jingxu10/ipex_2 branch from 6e3dcf1 to bfc38aa Compare June 26, 2023 22:43

jingxu10 force-pushed the jingxu10/ipex_2 branch from 6b1a8fc to 4077b3a Compare August 10, 2023 09:59

Borda reviewed Aug 14, 2023

View reviewed changes

jingxu10 force-pushed the jingxu10/ipex_2 branch from e03a1ec to eae6682 Compare August 14, 2023 09:16

stale bot added the won't fix This will not be worked on label Sep 17, 2023

jingxu10 force-pushed the jingxu10/ipex_2 branch from 57703b7 to f63b63f Compare September 17, 2023 21:14

jingxu10 force-pushed the jingxu10/ipex_2 branch from 72e953d to af90eb9 Compare October 3, 2023 08:07

stale bot removed the won't fix This will not be worked on label Oct 3, 2023

jingxu10 force-pushed the jingxu10/ipex_2 branch from af90eb9 to 59b6935 Compare October 3, 2023 08:15

github-actions bot added docs Documentation related dependencies Pull requests that update a dependency file package labels Oct 5, 2023

abhilash1910 approved these changes Oct 9, 2023

View reviewed changes

jingxu10 force-pushed the jingxu10/ipex_2 branch from d2abb7f to da55ed0 Compare October 19, 2023 23:01

abhilash1910 mentioned this pull request Feb 15, 2024

Enable support for Intel XPU devices (AKA Intel GPUs) #19443

Draft

hipudding mentioned this pull request Feb 20, 2024

[WIP]add npu support #19308

Draft

7 tasks

kprokofi mentioned this pull request Mar 21, 2024

Enable training on XPU devices in OTX2.0 openvinotoolkit/training_extensions#3094

Merged

8 tasks

jingxu10 force-pushed the jingxu10/ipex_2 branch from 5bfb98a to 60650eb Compare April 27, 2024 20:49

jingxu10 force-pushed the jingxu10/ipex_2 branch from f1d22b2 to 90cde8d Compare April 27, 2024 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Intel xpu as a new backend of PyTorch-Lightning #17700

Enable Intel xpu as a new backend of PyTorch-Lightning #17700

jingxu10 commented May 26, 2023 •

edited

mingxiaoh Jun 25, 2023

mingxiaoh Jun 25, 2023

mingxiaoh Jun 25, 2023

abhilash1910 Jun 25, 2023

mingxiaoh Jun 25, 2023

abhilash1910 Jun 25, 2023

mingxiaoh Jun 25, 2023

mingxiaoh Jun 25, 2023

abhilash1910 Jun 25, 2023

mingxiaoh Jun 25, 2023

jingxu10 Jun 27, 2023

mingxiaoh Jun 28, 2023 •

edited

Borda Aug 14, 2023

jingxu10 Aug 14, 2023

stale bot commented Sep 17, 2023

gitguardian bot commented Jan 16, 2024 •

edited

chaitjo commented Mar 20, 2024

abhilash1910 commented Mar 21, 2024

timocafe commented May 2, 2024

samet-akcay commented May 21, 2024

		sys.modules["lightning.pytorch.accelerators.xpu"] = self


		class XPUAccelerator:

Enable Intel xpu as a new backend of PyTorch-Lightning #17700

Are you sure you want to change the base?

Enable Intel xpu as a new backend of PyTorch-Lightning #17700

Conversation

jingxu10 commented May 26, 2023 • edited

What does this PR do?

PR review

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingxiaoh Jun 28, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Sep 17, 2023

gitguardian bot commented Jan 16, 2024 • edited

️✅ There are no secrets present in this pull request anymore.

chaitjo commented Mar 20, 2024

abhilash1910 commented Mar 21, 2024

timocafe commented May 2, 2024

samet-akcay commented May 21, 2024

jingxu10 commented May 26, 2023 •

edited

mingxiaoh Jun 28, 2023 •

edited

gitguardian bot commented Jan 16, 2024 •

edited