-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hierarchical instantiation #45
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
le1nux
requested review from
mali-git,
fromm-m and
flxst
and removed request for
lllAlexanderlll and
flxst
February 26, 2024 15:43
mali-git
approved these changes
Mar 1, 2024
fromm-m
approved these changes
Mar 1, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved
luzian-hahn
added a commit
that referenced
this pull request
Mar 11, 2024
commit 0807555 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Mar 7 18:33:39 2024 +0100 refactor: deleted failing legacy test commit dd0db07 Merge: 095e491 4821804 Author: Luzian Hahn <145655920+luzian-hahn@users.noreply.github.com> Date: Thu Mar 7 10:29:09 2024 +0100 Merge pull request #48 from Modalities/feat/merge-pbin-files feat: merge utility for pbin files commit 4821804 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Thu Mar 7 10:27:28 2024 +0100 docs: add hint about updated header structure commit b34d6cb Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Thu Mar 7 10:19:54 2024 +0100 refactor: remove unused utility commit 7d05448 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 27 16:00:38 2024 +0100 refactor: remove redundant check for valid pbin files commit 2e27335 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Mon Feb 5 18:21:51 2024 +0100 feat: add entrypoint for pbin-merge commit 8ffc095 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Mon Feb 5 18:16:06 2024 +0100 refactor: introduce entrypoint group "data" commit a0d13a3 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Mon Feb 5 15:06:18 2024 +0100 feat: add pbin-merger commit 9f853cf Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Mon Feb 5 11:36:49 2024 +0100 refactor: introduce abstraction for stream data below packed Datasets commit 095e491 Merge: 419fc9e 0f3846a Author: Luzian Hahn <145655920+luzian-hahn@users.noreply.github.com> Date: Thu Mar 7 09:38:53 2024 +0100 Merge pull request #40 from Modalities/perf/benchmark-datasets-again-megatronlm perf: benchmark datasets against megatronlm commit 0f3846a Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Thu Mar 7 09:28:27 2024 +0100 test: prevent unnecessary warnings during tests commit f2232c3 Merge: 9095ac5 419fc9e Author: Luzian Hahn <145655920+luzian-hahn@users.noreply.github.com> Date: Thu Mar 7 08:45:14 2024 +0100 Merge branch 'main' into perf/benchmark-datasets-again-megatronlm commit 419fc9e Merge: 8ab29d0 d192331 Author: Max Lübbering <2804731+le1nux@users.noreply.github.com> Date: Mon Mar 4 12:25:00 2024 +0100 Merge pull request #65 from David-Berghaus/Fix-typos Fixed typos commit d192331 Author: David Berghaus <machs3ll@gmail.com> Date: Mon Mar 4 12:12:47 2024 +0100 Fixed typos commit 8ab29d0 Merge: d71bceb f9b0f41 Author: Mehdi Ali <33023925+mali-git@users.noreply.github.com> Date: Fri Mar 1 15:59:01 2024 +0100 Merge pull request #45 from Modalities/hierarchical_instantiation Hierarchical instantiation commit f9b0f41 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Mon Feb 26 16:53:36 2024 +0100 chore: fix linting commit 042e3a0 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Mon Feb 26 16:00:39 2024 +0100 refactor: fix typos commit 8345e06 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 15:24:25 2024 +0100 refactor: fixed the library usage exampe commit cd2128d Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 15:24:00 2024 +0100 refactor: replaced absolute paths with relative ones commit 9ab6654 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 15:23:06 2024 +0100 fix: fixed add_custom_component in Main commit 64b785a Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Mon Feb 26 14:18:06 2024 +0100 fix: skipping of tests in non-distributed environment commit c7f7a7b Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 13:35:24 2024 +0100 chore: minor changes in TestFSDPToDiscCheckpointing commit 10538ac Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 13:24:10 2024 +0100 refactor: also using ComponentEntity now in the tests commit 432426b Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 13:23:46 2024 +0100 refactor: fixed failing test_e2e_training_run_wout_ckpt commit 63829e1 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 13:20:43 2024 +0100 chore: excluded openGPTx from test cov commit 12632fd Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 12:57:16 2024 +0100 refactor: introduced ComponentEntity commit c15de17 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 26 12:09:14 2024 +0100 refactor: various smaller changes commit 973909d Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Mon Feb 26 10:58:27 2024 +0100 refactor: sort classes in config commit bc64ee0 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Mon Feb 26 10:52:21 2024 +0100 refactor: remove RegistryFactory commit b9dbe2e Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Mon Feb 26 10:19:24 2024 +0100 refactor: rename and fix readme for getting started example commit ca74340 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Feb 25 16:00:17 2024 +0100 feat: added activation checkpointing to __main__.py commit 7ae2234 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 24 21:19:44 2024 +0100 refactor: fixed some of the configs commit bcd6e5b Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 24 21:16:08 2024 +0100 feat: experiment_id now set in the config via omega conf resolver commit a6ea22a Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 24 14:03:46 2024 +0100 refactor: gpt2 config for checkpointing tests commit ff3eb52 Merge: 64617dd fb0aea5 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 24 14:01:15 2024 +0100 chore: Merge branch 'hierarchical_instantiation' of github.com:Modalities/modalities into hierarchical_instantiation commit 64617dd Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 24 14:00:35 2024 +0100 feat: added add_custom_component function to Main commit df4f971 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 24 13:59:33 2024 +0100 test: fixed fsdp test, but cannot be run directly via pytest as it needs torchrun commit fb0aea5 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Sat Feb 24 10:51:51 2024 +0100 fix: replace conint/confloat correctly commit fd07cb0 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 23 19:39:12 2024 +0100 refactor: made base_model_to_dict public as it is great for testing commit aa0d64f Author: Max Lübbering <2804731+le1nux@users.noreply.github.com> Date: Fri Feb 23 18:31:54 2024 +0100 Update README.md commit e70f3a0 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Fri Feb 23 17:57:15 2024 +0100 fix: replace conint/confloat for pydantic 3.0 compatibility commit 70d9e63 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 23 17:40:38 2024 +0100 chore: more documentation commit 2396020 Merge: a68ddf4 021b7c2 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 23 17:39:09 2024 +0100 chore: Merge branch 'hierarchical_instantiation' of github.com:Modalities/modalities into hierarchical_instantiation commit a68ddf4 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 23 16:57:23 2024 +0100 feat: added example for registering a custom component commit 021b7c2 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Fri Feb 23 11:38:32 2024 +0100 refactor: restored base_model_to_dict commit b619b41 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Fri Feb 23 09:32:31 2024 +0100 refactor: replace base_model_to_dict by pydantic built-in method commit 34c6498 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Fri Feb 23 09:26:44 2024 +0100 refactor: fixed typing for registry commit 52ffea4 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 22 17:59:20 2024 +0100 fix: fixed failing end 2 end test commit b0bd296 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 22 17:58:38 2024 +0100 fix: eval_dataloaders are now treated as list instead of dict. This was not reflected yet in the subscriber factory commit cbf905b Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 22 17:47:53 2024 +0100 fix: checkpointing test commit a42a479 Merge: 26b8b82 e3b50f6 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 22 17:33:21 2024 +0100 chore: Merge branch 'hierarchical_instantiation' of github.com:Modalities/modalities into hierarchical_instantiation commit 26b8b82 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 22 17:32:41 2024 +0100 refactor: we fully support the configs again for hierarchical instantiation commit 9dfd100 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 22 17:31:45 2024 +0100 refactor: eval_dataloaders are subsumed in a list now commit e3b50f6 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Thu Feb 22 12:39:17 2024 +0100 refactor: unification of Pydantic*IF classes commit 7c4fafb Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Thu Feb 22 09:24:42 2024 +0000 chore: enabled pytest discovery with all tests. Some tests still need to be fixed! commit 34dc796 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Thu Feb 22 10:24:09 2024 +0100 refactor: renaming for consistency commit 2d8349d Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Thu Feb 22 08:45:23 2024 +0000 fix: e2e test commit cc60608 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Thu Feb 22 08:10:43 2024 +0000 fix: set FIXME for fsdp_to_disc_checkpointing_test and fix oudated config test commit fdfb90a Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 19:03:04 2024 +0100 chore: fixed variable naming commit 1de69c3 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 18:59:23 2024 +0100 refactor: merged remote to local and refactored callback_interval_in_batches to callback_interval_in_samples in the config commit e1dd046 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Wed Feb 21 15:22:33 2024 +0000 fix: test discovery under vscode. TODO: replace PretrainedGPTConfig by correct class commit cd5ec46 Merge: 281f20f e16dec9 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 13:15:31 2024 +0100 chore: Merge branch 'hierarchical_instantiation' of github.com:Modalities/modalities into hierarchical_instantiation commit 281f20f Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:56:45 2024 +0100 refactor: moved LookupEnum to dedicated file to fix circular imports commit e433913 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:55:34 2024 +0100 refactor: removed types.py commit 2c3762b Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:48:12 2024 +0100 chore: import fix commit 4f07fc9 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:47:50 2024 +0100 feat: added checkpointed model and fsdp wrapped model to registry factory commit 2ba8edd Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:46:26 2024 +0100 chore: fixed import in registry factory commit 76b4240 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:46:03 2024 +0100 chore: minor fix commit 417e0ed Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:45:48 2024 +0100 refactor: deleted checkpointing factory commit b056ddd Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:45:09 2024 +0100 refactor: we always instantiate the LLMDataloader with a ResumableBatchSampler now commit cd5e6fe Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:43:20 2024 +0100 refactor: config_new.py renamed to config.py commit f39051f Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:41:48 2024 +0100 refactor: deleted lookup_types commit c971bb0 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 21 12:39:47 2024 +0100 refactor: removed resolver_register commit 3371b39 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:37:11 2024 +0100 refactor: __main__.py now is capable of instantiating hierarchical configs commit b5f3d4d Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:34:25 2024 +0100 refactor: refactored FSDPToDiscCheckpointing to use ModelFactory.get_fsdp_wrapped_model commit 29aee7d Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:33:06 2024 +0100 chore: ProcessGroupBackendType inherits now from LookupEnum commit 197f863 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:32:36 2024 +0100 feat: implemented OptimizerFactory commit 8d1bb9e Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:32:12 2024 +0100 feat: added model factory commit 8b9dc20 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:31:40 2024 +0100 feat: introduced CudaEnv commit 89fa61c Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:31:15 2024 +0100 chore: MixedPrecisionSettings inherits now from LookupEnum commit 4037db2 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:30:50 2024 +0100 refactor: removed running env commit eb9f5b5 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:30:32 2024 +0100 feat: added Settings basemodel to config and refactored FSDPToDiscCheckpointingConfig commit c60d689 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 21:29:29 2024 +0100 refactor: restructured config lorem ipsum commit d9d8925 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 20 20:22:49 2024 +0100 fix: bug fix in component factory commit e16dec9 Merge: 4c17abb d71bceb Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Mon Feb 19 20:58:25 2024 +0100 chore: merge main into hierarchical_instantiation commit 4c17abb Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Mon Feb 19 15:52:59 2024 +0100 refactor: unification of component registry and config registry commit d71bceb Author: Alexander Weber <alex.a.weber@gmx.de> Date: Mon Feb 19 15:29:09 2024 +0100 Update README.md commit 95bfc55 Merge: f16c409 a0b799a Author: Alexander Weber <alex.a.weber@gmx.de> Date: Mon Feb 19 15:25:29 2024 +0100 Merge pull request #52 from Modalities/chore/add-pytest-coverage chore: add pytest coverage commit a0b799a Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 14:22:49 2024 +0000 chore: clean gitignore commit 5361ca5 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 14:11:00 2024 +0000 chore: add toml support commit 4047b67 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 14:07:13 2024 +0000 chore: try fix from 2021 commit 20b1460 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 13:54:42 2024 +0000 chore: remove outdated .coverage.toml commit ec495a3 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 13:45:58 2024 +0000 chore: remove --cov from github action commit 920ccab Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 13:45:13 2024 +0000 chore: add coverage options in pyproject.toml commit a3ce9b1 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 19 14:44:59 2024 +0100 feat: integrated message subscribers commit b324c3f Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 19 14:41:37 2024 +0100 refactor: refactored dataloader and its factory commit f686268 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 13:41:30 2024 +0000 chore: add pytest --cov arguments by default commit f1e3155 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 13:36:22 2024 +0000 chore: search for coverage bug commit 4122c6c Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 13:31:43 2024 +0000 chore: search for coverage bug commit f43b81f Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 13:08:27 2024 +0000 chore: fix coveralls github action commit 81292e8 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 19 14:03:40 2024 +0100 refactor: moved OpenGPTXDatasetWrapper to DatasetFactory commit bc56246 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 12:41:54 2024 +0000 chore: add pytest-cov execution as github action commit f16c409 Merge: a0513e3 bc03021 Author: Alexander Weber <alex.a.weber@gmx.de> Date: Mon Feb 19 11:05:36 2024 +0100 Merge pull request #56 from Modalities/fix/tests fix: use renamed tokenizer file name commit bc03021 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 09:48:47 2024 +0000 fix: use renamed tokenizer file name commit a0513e3 Merge: b8117b1 76e0518 Author: Alexander Weber <alex.a.weber@gmx.de> Date: Mon Feb 19 10:26:45 2024 +0100 Merge pull request #38 from Modalities/fix/tests-on-cpu commit 76e0518 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 09:24:48 2024 +0000 chore: moved if statement into torch.device commit b8117b1 Merge: 1c99963 78b9645 Author: Alexander Weber <alex.a.weber@gmx.de> Date: Mon Feb 19 10:11:56 2024 +0100 Merge pull request #42 from Modalities/fix/linting fix: lint all files commit 78b9645 Merge: 5b60c2f 1c99963 Author: Alexander Weber <12560547+lllAlexanderlll@users.noreply.github.com> Date: Mon Feb 19 09:05:44 2024 +0000 chore: local merge commit 2267605 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Feb 18 23:27:27 2024 +0100 feat: towards subscriber support with hierarchical instantiation commit a449119 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Feb 18 23:25:40 2024 +0100 chore: minor changes commit aab3fa2 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Feb 18 23:24:58 2024 +0100 feat: implemented subscriber factory commit 1c99963 Merge: a8b6563 cf27873 Author: Max Lübbering <2804731+le1nux@users.noreply.github.com> Date: Sun Feb 18 22:45:14 2024 +0100 Merge pull request #29 from Modalities/feat/contrastive_loss Add Noise Contrastive Estimation Loss commit 6baf221 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 17:11:11 2024 +0100 feat: added LLM dataloader support commit 8ab04a5 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 17:10:16 2024 +0100 feat: introduced CollateFnIF for colleate functions commit 018c278 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 17:00:02 2024 +0100 feat: added resumable batch sampler commit 1273c31 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 16:53:57 2024 +0100 feat: added gpt_2 collator support commit 536447c Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 16:44:57 2024 +0100 feat: added batch sampler support commit 771eab1 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 16:18:51 2024 +0100 feat: added PydanticDatasetIF for SamplerConfig commit f1c1be4 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 15:55:47 2024 +0100 feat: added support for the different dataset formats commit 0824bb0 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 15:38:51 2024 +0100 refactor: added adaptations that were injected in the dataloader factory previously commit 6985fad Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 15:26:22 2024 +0100 feat: implemented dataset factory for various dataset types commit 81022f4 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 14:33:37 2024 +0100 feat: added gpt2 tokenizer support commit 55c0110 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 13:35:19 2024 +0100 feat: added adamw support commit 4a6a415 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 13:32:30 2024 +0100 feat: implemented OptimizerFactory commit c2bd570 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sat Feb 17 13:31:59 2024 +0100 fix: added root-level to dict function for basemodel to prevent recursive model dumps commit 90207ed Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 16 20:05:47 2024 +0100 refactor: started refactoring the lorem ipsum config towards the new hierarchical configs commit 1304241 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 16 20:05:24 2024 +0100 refactor: Main makes partially use of the hierarchical instantiation now commit f7dfe31 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 16 20:04:54 2024 +0100 refactor: Refactored CheckpointingFactory commit 38499c4 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 16 20:04:26 2024 +0100 refactor: removed unused atribute in Checkpointing commit 542ba75 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 16 20:04:08 2024 +0100 fix: bugfix in component factory commit d446260 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 16 20:03:54 2024 +0100 feat: added new configs in separate file for now commit 6d121f3 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 16 20:03:18 2024 +0100 feat: added more components to registry factory commit fb3b35f Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 16 20:02:48 2024 +0100 refactor: refactored FSDPRunningEnvConfig commit 8eda99c Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 15 23:37:33 2024 +0100 refactor: refactored component factory to use the registry commit 41be773 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 15 23:36:57 2024 +0100 feat: added registry factory commit 3ebb656 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 15 23:36:34 2024 +0100 feat: implemented registry commit f2164a8 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 15 23:36:00 2024 +0100 test: configs now use the new format without typehints commit 5fb2199 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 15 23:35:39 2024 +0100 test: added registry testing commit 623f847 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 15 23:34:49 2024 +0100 test: updated test configs to the new format commit 372947b Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Wed Feb 14 21:45:11 2024 +0100 chore: add pytest coverage (locally) commit 36bc7ae Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 14 13:45:11 2024 +0100 refactor: renamed config_types to custom_config_types in ComponentFactory commit babd597 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 14 13:35:58 2024 +0100 feat: added support custom types in component factory commit 1639a6a Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 14 12:04:00 2024 +0100 refactor: simplified ComponentFactory commit aa9e040 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 14 10:33:11 2024 +0100 test: removed code duplication in test_component_factory commit 44677c6 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Wed Feb 14 10:30:43 2024 +0100 test: refactored test_custom_component commit 71de3ff Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 13 21:44:53 2024 +0100 test: added testing for custom components commit 2a54f84 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 13 20:57:25 2024 +0100 test: added test yaml configs for component factory commit 35236d0 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 13 20:56:22 2024 +0100 test: implemented test_non_existing_reference commit bb4bcb3 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 13 20:53:21 2024 +0100 test: implemented test_component_filter commit 0dfbbcb Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 13 20:49:36 2024 +0100 test: implemented test_hierarchical_component_instantiation commit 3a66b65 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Tue Feb 13 20:41:18 2024 +0100 test: implemented forward and backward referencing test commit c0c877c Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:52:13 2024 +0100 chore: fixed imports in component factory commit a9781a3 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:42:28 2024 +0100 refactor: added drafted test code for component factory commit b1cbb46 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:42:00 2024 +0100 refactor: moved trial component factory code to test module commit c115b2b Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:41:21 2024 +0100 refactor: moved component factory into parent module commit e678d78 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:26:55 2024 +0100 refactor: renamed hierarchical DI module to hierarchical_instantiation commit 45f7ff4 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:19:24 2024 +0100 refactor: removed legacy code and added comments to component factory. commit da88895 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:03:26 2024 +0100 feat: added referencing to config commit b42aeeb Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:02:50 2024 +0100 feat: added ReferenceConfig and PassType commit 72f0524 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 12 14:02:24 2024 +0100 feat: implemented forward and backward component referencing commit 43e1134 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Feb 11 19:56:18 2024 +0100 chore: added documentation for generate_text text CMD interface commit cf27873 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Fri Feb 9 17:38:31 2024 +0100 refactor: adapt nce_loss function to reflect loss from CoCa paper commit d388d21 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Fri Feb 9 17:37:35 2024 +0100 test: adapt test_nce_loss_correctness to uni and bidirectional loss commit a8b6563 Merge: da65493 00e10ae Author: Max Lübbering <2804731+le1nux@users.noreply.github.com> Date: Fri Feb 9 16:46:36 2024 +0100 Merge pull request #30 from Modalities/huggingface_models_support feat: Generic huggingface transformer support commit 00e10ae Author: Max Lübbering <2804731+le1nux@users.noreply.github.com> Date: Fri Feb 9 16:24:02 2024 +0100 Update preprocess_dataset.py commit e93e767 Merge: f435fc8 da65493 Author: Max Lübbering <2804731+le1nux@users.noreply.github.com> Date: Fri Feb 9 15:50:03 2024 +0100 Merge branch 'main' into huggingface_models_support commit f435fc8 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 9 15:46:59 2024 +0100 feat: introduced huggingface_prediction_subscription_key to HuggingFacePretrainedModelConfig to support different output formats commit e6f4aac Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Fri Feb 9 15:46:08 2024 +0100 refactor: moved lookup_enum to dedicated file. commit ebbe8c5 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Fri Feb 9 13:49:42 2024 +0100 test: add test for nce_loss using a manually calculated example commit 7d5c095 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:17:36 2024 +0100 chore: removed legacy code commit dad3ea4 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:16:40 2024 +0100 chore: added legacy trials for hierarchical DI commit 3ab9ff3 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:14:10 2024 +0100 chore: added __init__.py commit 3dfdb2a Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:13:51 2024 +0100 feat: implemented factory for hierarchical component instantiation commit dc7c1a2 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:13:17 2024 +0100 feat: added example yaml config file for hierarchical instantiation commit 099979b Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:12:58 2024 +0100 feat: added configs for the test components commit c4292ce Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:12:25 2024 +0100 feat: added components for testing commit fc5cb96 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:11:28 2024 +0100 chore: minor debugging improvement in parse_enum_by_name in utils commit 783ad81 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 8 20:10:57 2024 +0100 chore: removed legacy trials commit 9095ac5 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 16:17:22 2024 +0100 docs: update times in table after perf upgrade commit 91ec38e Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 16:07:46 2024 +0100 fix: make encoding specification obsolete and improve perf of index creation commit afae858 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 15:48:19 2024 +0100 feat: make encoding configurable commit 71f77e2 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 14:51:57 2024 +0100 refactor: remove parameter-artifact commit a668620 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 14:47:52 2024 +0100 refactor: remove TODO-artifact commit a08518f Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 14:43:31 2024 +0100 refactor: rename queue for token-writing commit 2e535a3 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 14:25:35 2024 +0100 fix: derive default value for cpu count automatically commit 03d3f47 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 14:24:48 2024 +0100 perf: share FileIOStream among process calls - not threadsafe! commit bc086ca Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 14:13:12 2024 +0100 docs: remove auto execution of benchmarks, while sourcing bench utils commit fb04dc8 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 14:08:14 2024 +0100 fix: typo in warning commit faa2eff Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 14:05:35 2024 +0100 docs: unify time units in measurement table commit 26ade7c Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Feb 6 13:37:22 2024 +0100 docs: add definitions of benchmarking experiments commit 463872d Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 5 18:55:00 2024 +0100 refactor: drafted hierarchical instantiation commit bd39244 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 5 18:52:20 2024 +0100 chore: removed unused properties in config.py commit a908e7a Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Mon Feb 5 18:50:26 2024 +0100 refactor: moved resolver register commit 540afe2 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Thu Feb 1 17:04:22 2024 +0100 refactor: add keyword arguments commit 57ccaf9 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Thu Feb 1 17:03:18 2024 +0100 refactor: introduce nce_loss function and add asymmetry parameter in NCELoss commit 35ca235 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Thu Feb 1 15:57:13 2024 +0100 feat: drafted hierarchical instantiation commit 5b60c2f Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Tue Jan 30 22:48:35 2024 +0100 fix: lint all files commit d84353f Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Jan 30 17:11:02 2024 +0100 docs: add details about dataloading performance benchmarks commit 93d9241 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Jan 30 17:10:12 2024 +0100 perf: use one large memmap for PackedDatasets commit e6cb130 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Tue Jan 30 16:18:50 2024 +0100 refactor: apply ruff refactor comment commit dfbefcb Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Tue Jan 30 15:23:42 2024 +0100 fix: get rid of reduce mocking (for testing) commit f4e3c56 Author: Felix Stollenwerk <felix.stollenwerk@ai.se> Date: Tue Jan 30 15:17:10 2024 +0100 fix: training and evaluation on CPU (for testing) commit 69e2050 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Jan 30 12:14:21 2024 +0100 feat: infer smallest tokensize automatically for packing commit a96a5f4 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Tue Jan 30 09:17:35 2024 +0100 perf: use parallelized tokenization when creating .pbin files commit ee08a01 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Mon Jan 29 15:35:55 2024 +0100 perf: increase memmap index creation speed commit 8e30e00 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 23:22:39 2024 +0100 chore: added documentation commit abb63aa Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 22:08:45 2024 +0100 refactor: fixed configs due to latest changes commit f83da11 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 22:07:26 2024 +0100 feat: wired up huggingface transformer models commit 9309505 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 22:01:29 2024 +0100 chore: renamed Block to GPT2Block commit 4d6a5ff Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 22:01:17 2024 +0100 feat: fully implemented HuggingFacePretrainedModel with respective configuration commit 88c4fdb Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 22:00:33 2024 +0100 feat: implemented automatic FSDP wrapping commit 3b51117 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 21:56:48 2024 +0100 refactor: renamed tokenizer.json to tokenizer_gpt2.json commit 95e67a0 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 21:55:52 2024 +0100 feat: renamed redpajama memmap datasets (added tokenizer info) commit 0992d21 Author: Max Luebbering <le1nux@users.noreply.github.com> Date: Sun Jan 28 00:26:36 2024 +0100 feat: towards generic huggingface transformer support commit ba65580 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Fri Jan 26 13:36:38 2024 +0100 refactor: refactor docstrings commit e459321 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Thu Jan 25 17:48:32 2024 +0100 test: add test for contrastive loss commit bb14749 Author: Sogol Haghighat <sogol.haghighat@iais.fraunhofer.de> Date: Thu Jan 25 17:47:43 2024 +0100 feat: add contrastive loss for coca model training commit c9e4e08 Author: Luzian Hahn <luzian.hahn@iis.fraunhofer.de> Date: Mon Jan 22 13:43:46 2024 +0100 fix: rely again on iso-8859-1 instead of utf8 the OpenGPT-X data seems to come with problematic chars, which cannot get edecoded via utf8. The former fix to use iso-8859-1 fixes this. However the issue probably lays actually with dataset conversions
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Implements hierarchical instantiation of components via dependency injection.
Given a config defined in a YAML file, we first load the config into a dictionary within the factory. We then traverse the dictionary recursively in depth-first fashion. Once we reach a leave component, we first instantiate the respective component config via Pydantic. Since the component config is self-contained, we leverage the referenced class type in the type_hint to instantiate the component. We then return upwards to the parent item in the dictionary. If there are other child components in the parent component, then we also instantiate those. Once, all dependencies of the parent are instantiated, we can also instantiate the parent config and then the parent component.
We traverse the entire config in this fashion by first instantiation the component configs and then the components themselves. Once all the components have been built, we return the root components whose keys we passed to the factory.
This instantiation has a couple of advantages.
Appconfig
definition.ConfigTypes
to theFactory
.DataloaderConfig
would have had a dependency forDatasetConfig
. Now, the dependency can be either a concrete class e.g.,Dataset
or even more generic an InterfaceDatasetIF
. In fact, the possibility for interface type checks makes sure that that the dependencies always implement the required interface and would throw an error otherwise.As of right now I don't see any disadvantage ... but would be great to discuss potential issues as well.