Skip to content

TransformerEngine : Add test with FSDP (and updates to ddp_wrapper in test_ddp.py) #143

TransformerEngine : Add test with FSDP (and updates to ddp_wrapper in test_ddp.py)

TransformerEngine : Add test with FSDP (and updates to ddp_wrapper in test_ddp.py) #143

Triggered via pull request April 11, 2024 10:54
Status Success
Total duration 23s
Artifacts

auto-cc.yml

on: pull_request
Fit to window
Zoom out
Zoom in

Annotations

2 errors and 1 warning
auto-cc
Resource not accessible by integration { name: 'HttpError', id: '8645562921', status: 403, response: { url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142', status: 403, headers: { 'access-control-allow-origin': '*', 'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', connection: 'close', 'content-encoding': 'gzip', 'content-security-policy': "default-src 'none'", 'content-type': 'application/json; charset=utf-8', date: 'Thu, 11 Apr 2024 10:55:01 GMT', 'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', server: 'GitHub.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', 'transfer-encoding': 'chunked', vary: 'Accept-Encoding, Accept, X-Requested-With', 'x-accepted-github-permissions': 'pull_requests=write', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-api-version-selected': '2022-11-28', 'x-github-media-type': 'github.v3; format=json', 'x-github-request-id': 'A86A:97451:28C0F5:432E78:6617C185', 'x-ratelimit-limit': '15000', 'x-ratelimit-remaining': '14994', 'x-ratelimit-reset': '1712836500', 'x-ratelimit-resource': 'core', 'x-ratelimit-used': '6', 'x-xss-protection': '0' }, data: { message: 'Resource not accessible by integration', documentation_url: 'https://docs.github.com/rest/pulls/pulls#update-a-pull-request' } }, request: { method: 'PATCH', url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142', headers: { accept: 'application/vnd.github.v3+json', 'user-agent': 'probot/12.2.5 octokit-core.js/3.6.0 Node.js/16.20.2 (linux; x64)', authorization: 'token [REDACTED]', 'content-type': 'application/json; charset=utf-8' }, body: '{"body":"This PR adds test for using TE executor in FSDP and verifies it against Eager + TE. Also we update the `ddp_wrapper` to allow wrapping with different pytest-fixture besides `bucket_size_in_mb` (which errored when I tried to add a different pytest-fixture).\\r\\n\\r\\nPR https://github.com/Lightning-AI/lightning-thunder/pull/80 description details of how TE automatically takes care of syncing FP8 meta-data in distributed setting. \\r\\n\\r\\nAlso, I have verified it on a larger model using the available benchmarking script\\r\\ncmd for benchmark: \\r\\n```\\r\\ntorchrun --nproc-per-node=2 thunder/benchmarks/benchmark_litgpt.py --compile thunder+nvfuser+transformerengine+cudnn --n_layers=10 --distributed_mode=fsdp\\r\\n```\\r\\n\\r\\nNumbers are on RTX 6000\\r\\n\\r\\nWithout TE\\r\\n```\\r\\niter 41: loss 4.6562, iter time: 3180.77ms, t: 4096\\r\\niter 42: loss 4.6250, iter time: 3202.35ms, t: 4096\\r\\niter 43: loss 4.6562, iter time: 3172.88ms, t: 4096\\r\\niter 44: loss 4.6562, iter time: 3181.55ms, t: 4096\\r\\nModel name: Llama-2-7b-hf\\r\\nSeq Length: 4096\\r\\nMicro BS: 1\\r\\nGlobal BS: 2\\r\\nNumber of Layers: 10\\r\\nNumber of parameters: 1.14B\\r\\nDistributed Mode: fsdp\\r\\nSharding Mode: zero2\\r\\nBucketing: none\\r\\nCompiler: thunder+nvfuser+cudnn\\r\\nAverage iter time: 3187.63 ms\\r\\nMemory used: 30.56 GB\\r\\nTokens/s: 2570.17\\r\\nTokens/s/GPU: 1285.09\\r\\nTFLOP/s: 38.40\\r\\n```\\r\\n\\r\\nWith TE\\r\\n```\\r\\niter 42: loss 4.6562, iter time: 3025.66ms, t: 4096\\r\\niter 43: loss 4.6562, iter time: 3030.40ms, t: 4096\\r\\niter 44: loss 4.6562, iter time: 3018.83ms, t: 4096\\r\\nModel name: Llama-2-7b-hf\\r\\nSeq Length: 4096\\r\\nMicro BS: 1\\r\\nGlobal BS: 2\\r\\nNumber of Layers: 10\\r\\nNumber of parameters: 1.14B\\r\\nDistributed Mode: fsdp\\r\\nSharding Mode: zero2\\r\\nBucketing: none\\r\\nCompiler: thunder+nvfuser+transformerenginevfu
auto-cc
HttpError: Resource not accessible by integration at /home/runner/work/_actions/Lightning-AI/probot/v5/node_modules/@octokit/core/node_modules/@octokit/request/dist-node/index.js:86:21 at processTicksAndRejections (node:internal/process/task_queues:96:5) at async Job.doExecute (/home/runner/work/_actions/Lightning-AI/probot/v5/node_modules/bottleneck/light.js:405:18) { name: 'AggregateError', event: { id: '8645562921', name: 'pull_request', payload: { action: 'labeled', label: { color: '3855E2', default: false, description: '', id: 6781712626, name: 'distributed', node_id: 'LA_kwDOLiCyD88AAAABlDi48g', url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/labels/distributed' }, number: 142, organization: { avatar_url: 'https://avatars.githubusercontent.com/u/58386951?v=4', description: 'Turn ideas into AI, Lightning fast. Creators of PyTorch Lightning, Lightning AI Studio, TorchMetrics, Fabric, Lit-GPT, Lit-LLaMA', events_url: 'https://api.github.com/orgs/Lightning-AI/events', hooks_url: 'https://api.github.com/orgs/Lightning-AI/hooks', id: 58386951, issues_url: 'https://api.github.com/orgs/Lightning-AI/issues', login: 'Lightning-AI', members_url: 'https://api.github.com/orgs/Lightning-AI/members{/member}', node_id: 'MDEyOk9yZ2FuaXphdGlvbjU4Mzg2OTUx', public_members_url: 'https://api.github.com/orgs/Lightning-AI/public_members{/member}', repos_url: 'https://api.github.com/orgs/Lightning-AI/repos', url: 'https://api.github.com/orgs/Lightning-AI' }, pull_request: { _links: { comments: { href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/issues/142/comments' }, commits: { href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142/commits' }, html: { href: 'https://github.com/Lightning-AI/lightning-thunder/pull/142' }, issue: { href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/issues/142' }, review_comment: { href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/comments{/number}' }, review_comments: { href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142/comments' }, self: { href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/pulls/142' }, statuses: { href: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/statuses/5553215b8b383ca46f3e2f380e4a64efb8257258' } }, active_lock_reason: null, additions: 170, assignee: null, assignees: [], author_association: 'COLLABORATOR', auto_merge: null, base: { label: 'Lightning-AI:main', ref: 'main', repo: { allow_auto_merge: true, allow_forking: true, allow_merge_commit: false, allow_rebase_merge: false, allow_squash_merge: true, allow_update_branch: true, archive_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/{archive_format}{/ref}', archived: false, assignees_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/assignees{/user}', blobs_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/git/blobs{/sha}', branches_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/branches{/branch}', clone_url: 'https://github.com/Lightning-AI/lightning-thunder.git', collaborators_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/collaborators{/collaborator}', comments_url: 'https://api.github.com/repos/Lightning-AI/lightning-thunder/comments{/number}',
auto-cc
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: Lightning-AI/probot@v5. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.