split train#249
Conversation
* new footer * suggestions * assets * just do public * cleanup * fit logo * fix width * message
* update hero * one more * css
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request overhauls the training process within the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the train function, splitting it into smaller, more manageable components. This greatly improves the modularity, maintainability, and readability of the training pipeline. The introduction of TrainConfig and DataConfig structs is a particularly strong improvement, making the training configuration much cleaner. I've identified a few areas for improvement, primarily a high-severity memory issue due to redundant data storage in each epoch's snapshot, and some medium-severity issues related to the clarity of saved artifacts and CSS practices. Overall, this is a very positive change for the codebase.
I am having trouble creating individual review comments. Click here to see my feedback.
src/training/epoch.jl (47)
Following up on the suggestion to remove y_train and y_val from EpochSnapshot to save memory, this instantiation should be updated to no longer pass these values. The y_train and y_val data is static throughout training and doesn't need to be stored with every epoch's snapshot.
return EpochSnapshot(l_train, l_val, ŷ_train, ŷ_val, init.is_no_nan_t, init.is_no_nan_v)src/training/initialization.jl (30-39)
The EpochSnapshot struct stores y_train and y_val, which are the complete training and validation target datasets. Since these are stored for every epoch in the TrainingHistory, this leads to significant memory duplication and can cause high memory usage for large datasets. These target arrays are static during training and can be accessed from the main train function's scope when needed (e.g., when building the final TrainResults).
I recommend removing y_train and y_val from this struct. This will require corresponding changes where EpochSnapshot is instantiated in src/training/epoch.jl and where its fields are accessed in src/training/train.jl.
struct EpochSnapshot
l_train
l_val
ŷ_train
ŷ_val
is_no_nan_t
is_no_nan_v
endsrc/training/train.jl (149-163)
To address the memory issue from storing redundant copies of y_train and y_val, this WrappedTuples constructor should be updated to no longer include them after they are removed from the EpochSnapshot struct.
function WrappedTuples(vec::Vector{EpochSnapshot})
nt_vec = map(
s -> (
l_train = s.l_train,
l_val = s.l_val,
ŷ_train = s.ŷ_train,
ŷ_val = s.ŷ_val,
is_no_nan_t = s.is_no_nan_t,
is_no_nan_v = s.is_no_nan_v,
), vec
)
return WrappedTuples(nt_vec)
enddocs/src/.vitepress/theme/style.css (53-55)
Using overflow-x: hidden; on the body can be problematic as it may hide content unexpectedly and make it difficult to debug layout issues where elements overflow their containers. It's generally better to identify the specific element causing the horizontal scrollbar and fix its styling rather than applying a global override.
src/io/checkpoints.jl (17)
The logic for save_epoch can be confusing. If stopper.best_epoch is 0 (meaning the initial parameters were the best), this line sets save_epoch to 1. However, the initial state is associated with epoch 0. This will save the best model state with an incorrect epoch number in the metadata, which could cause confusion when analyzing the results. It would be more accurate to just use stopper.best_epoch directly.
save_epoch = stopper.best_epochsrc/training/train_old.jl (6)
The field name ps_history is now misleading. In the previous implementation, it stored the history of physical parameters. After this refactoring, it stores a vector of EpochSnapshot objects, which contain much more information (losses, predictions, etc.). To improve clarity and maintainability, I suggest renaming this field to something more descriptive, like epoch_snapshots or epoch_history. This would make the structure of TrainResults easier to understand for anyone using the output of the train function.
epoch_snapshots|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a major and well-executed refactoring of the training pipeline. The monolithic train function has been broken down into smaller, more focused functions and modules, improving modularity and maintainability. The introduction of TrainConfig and DataConfig structs provides a much cleaner and more extensible API compared to the previous keyword-argument-based approach. A backward compatibility layer is also thoughtfully included. My review focuses on some minor inconsistencies and potential improvements in the new structure, such as removing dead code, cleaning up the new config structs, and improving memory efficiency.
moving forward we will not longer support
trainwith the 3rd argumentsave_ps. For now, there is a deprecation warning.