Skip to content

split train#249

Merged
lazarusA merged 11 commits into
mainfrom
la/split_train
Mar 16, 2026
Merged

split train#249
lazarusA merged 11 commits into
mainfrom
la/split_train

Conversation

@lazarusA
Copy link
Copy Markdown
Member

@lazarusA lazarusA commented Mar 12, 2026

moving forward we will not longer support train with the 3rd argument save_ps. For now, there is a deprecation warning.

* new footer

* suggestions

* assets

* just do public

* cleanup

* fit logo

* fix width

* message
* update hero

* one more

* css
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request overhauls the training process within the EasyHybrid.jl package by introducing a more structured and configurable approach. It moves away from a monolithic train function with many keyword arguments to a modular design utilizing dedicated configuration structs for training parameters and data handling. This change aims to enhance the clarity, flexibility, and long-term maintainability of the training pipeline. Concurrently, the documentation website receives visual and structural improvements, making it more informative and user-friendly.

Highlights

  • Refactored Training Function: The core train function has been significantly refactored into smaller, more modular components, improving readability and maintainability. This includes new structs for managing training and data configurations.
  • New Configuration Structs: Introduced TrainConfig and DataConfig structs to centralize and streamline training and data handling parameters, replacing a large number of keyword arguments in the train function signature.
  • Enhanced Documentation Website: The documentation site has been updated with new CSS for feature showcases, a custom footer, and improved handling of image assets, leading to a more polished user experience.
  • Improved Training Workflow: New components for data loading, path resolution, checkpointing, early stopping, and dashboard management have been integrated, providing a more robust and observable training pipeline.
  • Deprecated Old API: The previous train function signature with numerous keyword arguments is now deprecated, with a new, more structured API using TrainConfig and DataConfig being the recommended approach.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • .codecov.yml
    • Added Codecov configuration to set a default coverage threshold.
  • .gitignore
    • Updated ignored files to include new image assets and coverage/lcov.info.
  • CHANGELOG.md
    • Added entries for the refactored train function, custom features/code sections, and a custom footer.
  • README.md
    • Updated the path for the logo image in the README.
  • docs/src/.vitepress/config.mts
    • Removed the default footer configuration, deferring to a custom component.
  • docs/src/.vitepress/theme/features.css
    • Added new CSS styles to support feature showcase sections in the documentation.
  • docs/src/.vitepress/theme/index.ts
    • Imported useData and Footer components.
    • Integrated the new Footer component and features.css into the theme.
  • docs/src/.vitepress/theme/style.css
    • Modified VPFeatures styling to adjust margin.
    • Added overflow-x: hidden to body to prevent horizontal scroll.
    • Implemented styles for displaying different logos in light and dark modes.
  • docs/src/components/Footer.vue
    • Added a new Vue component for a custom, responsive footer with navigation and social links.
  • docs/src/get_started.md
    • Updated avatar image paths to use relative references.
  • docs/src/index.md
    • Changed raw HTML block syntax from @raw html` to @raw html`.
    • Replaced old installation instructions with new feature showcase components.
  • docs/src/tutorials/exponential_res.md
    • Updated avatar image paths to use relative references.
  • docs/src/tutorials/hyperparameter_tuning.md
    • Updated avatar image paths to use relative references.
  • src/EasyHybrid.jl
    • Reordered include statements to ensure proper dependency loading.
  • src/config/DataConfig.jl
    • Added a new struct DataConfig to encapsulate data-related training parameters.
  • src/config/TrainingConfig.jl
    • Added a new struct TrainConfig for training-specific parameters.
    • Included a validate_config function for TrainConfig to ensure valid parameter values.
  • src/config/TrainingPaths.jl
    • Added a new struct TrainingPaths to manage and store various output file paths for training artifacts.
  • src/config/config.jl
    • Updated to include the newly created TrainingPaths.jl, TrainingConfig.jl, and DataConfig.jl.
  • src/config/config_yaml.jl
    • Added helper functions to_namedtuple and get_full_config to work with TrainConfig for YAML serialization.
  • src/data/data.jl
    • Included new data-related files: splits.jl and loaders.jl.
  • src/data/loaders.jl
    • Added a new function build_loader for creating DataLoader instances based on TrainConfig.
  • src/data/splits.jl
    • Added prepare_splits to handle data splitting and sequence building.
    • Added maybe_build_sequences to conditionally build sequences based on DataConfig.
  • src/io/checkpoints.jl
    • Added functions save_initial_state!, save_epoch!, and save_final! for managing training checkpoints and final results.
  • src/io/io.jl
    • Refactored to include save.jl, paths.jl, and checkpoints.jl for better organization of I/O operations.
  • src/io/paths.jl
    • Added resolve_paths function to generate and manage output directory and file paths for training.
  • src/io/save.jl
    • Moved existing save/load utility functions from io.jl to this new file.
  • src/models/NNModels.jl
    • Updated the type parameter syntax in the InputBatchNorm function definition.
  • src/training/dashboard.jl
    • Added TrainDashboard struct and functions (init_dashboard, update_dashboard!, save_dashboard_img!, record_or_run) for managing and interacting with the training visualization dashboard.
  • src/training/early_stopping.jl
    • Added EarlyStopping mutable struct and associated functions (update!, is_done, best_or_final, build_results) to implement early stopping logic and finalize training results.
  • src/training/epoch.jl
    • Added run_epoch! for executing a single training epoch.
    • Added build_loss_fn to construct the loss function for training.
    • Added evaluate_epoch for evaluating model performance after each epoch.
  • src/training/history.jl
    • Added TrainingHistory struct to store snapshots of training progress over epochs.
  • src/training/initialization.jl
    • Added functions load_makie_extension, init_model_state, and compute_initial_state for setting up the training environment and initial model state.
  • src/training/show_train.jl
    • Updated log_progress! to use new EpochSnapshot and TrainConfig for progress bar display.
    • Added build_progress to create the progress bar instance.
  • src/training/train.jl
    • Completely refactored the train function to utilize TrainConfig and DataConfig structs.
    • Introduced a new API for train and added deprecation warnings for the old signature.
    • Added kwargs_to_configs and rename_deprecated_kwargs to handle conversion from old keyword arguments to new config structs.
  • src/training/train_old.jl
    • Added a new file to temporarily retain the old train function implementation for backward compatibility.
  • src/training/training.jl
    • Updated to include new training-related files: initialization.jl, history.jl, epoch.jl, dashboard.jl, and early_stopping.jl.
  • src/utils/tools.jl
    • Added a seed! function to set the random seed for reproducibility.
Activity
  • A new Codecov configuration file was added, setting a default coverage threshold.
  • The CHANGELOG.md was updated to reflect recent refactoring of the train function and additions to the documentation.
  • The documentation website's structure and styling were significantly improved, including new CSS for feature showcases and a custom footer component.
  • The core train function in src/training/train.jl underwent a major refactor, breaking it down into smaller, more manageable functions and introducing dedicated configuration structs.
  • New files were introduced across src/config, src/data, src/io, and src/training to support the modularized training pipeline, covering aspects like data loading, path management, checkpointing, and early stopping.
  • The old train function signature is now deprecated, guiding users towards a new, more structured API using TrainConfig and DataConfig objects.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the train function, splitting it into smaller, more manageable components. This greatly improves the modularity, maintainability, and readability of the training pipeline. The introduction of TrainConfig and DataConfig structs is a particularly strong improvement, making the training configuration much cleaner. I've identified a few areas for improvement, primarily a high-severity memory issue due to redundant data storage in each epoch's snapshot, and some medium-severity issues related to the clarity of saved artifacts and CSS practices. Overall, this is a very positive change for the codebase.

I am having trouble creating individual review comments. Click here to see my feedback.

src/training/epoch.jl (47)

high

Following up on the suggestion to remove y_train and y_val from EpochSnapshot to save memory, this instantiation should be updated to no longer pass these values. The y_train and y_val data is static throughout training and doesn't need to be stored with every epoch's snapshot.

    return EpochSnapshot(l_train, l_val, ŷ_train, ŷ_val, init.is_no_nan_t, init.is_no_nan_v)

src/training/initialization.jl (30-39)

high

The EpochSnapshot struct stores y_train and y_val, which are the complete training and validation target datasets. Since these are stored for every epoch in the TrainingHistory, this leads to significant memory duplication and can cause high memory usage for large datasets. These target arrays are static during training and can be accessed from the main train function's scope when needed (e.g., when building the final TrainResults).

I recommend removing y_train and y_val from this struct. This will require corresponding changes where EpochSnapshot is instantiated in src/training/epoch.jl and where its fields are accessed in src/training/train.jl.

struct EpochSnapshot
    l_train
    l_val
    ŷ_train
    ŷ_val
    is_no_nan_t
    is_no_nan_v
end

src/training/train.jl (149-163)

high

To address the memory issue from storing redundant copies of y_train and y_val, this WrappedTuples constructor should be updated to no longer include them after they are removed from the EpochSnapshot struct.

function WrappedTuples(vec::Vector{EpochSnapshot})
    nt_vec = map(
        s -> (
            l_train = s.l_train,
            l_val = s.l_val,
            ŷ_train = s.ŷ_train,
            ŷ_val = s.ŷ_val,
            is_no_nan_t = s.is_no_nan_t,
            is_no_nan_v = s.is_no_nan_v,
        ), vec
    )
    return WrappedTuples(nt_vec)
end

docs/src/.vitepress/theme/style.css (53-55)

medium

Using overflow-x: hidden; on the body can be problematic as it may hide content unexpectedly and make it difficult to debug layout issues where elements overflow their containers. It's generally better to identify the specific element causing the horizontal scrollbar and fix its styling rather than applying a global override.

src/io/checkpoints.jl (17)

medium

The logic for save_epoch can be confusing. If stopper.best_epoch is 0 (meaning the initial parameters were the best), this line sets save_epoch to 1. However, the initial state is associated with epoch 0. This will save the best model state with an incorrect epoch number in the metadata, which could cause confusion when analyzing the results. It would be more accurate to just use stopper.best_epoch directly.

    save_epoch = stopper.best_epoch

src/training/train_old.jl (6)

medium

The field name ps_history is now misleading. In the previous implementation, it stored the history of physical parameters. After this refactoring, it stores a vector of EpochSnapshot objects, which contain much more information (losses, predictions, etc.). To improve clarity and maintainability, I suggest renaming this field to something more descriptive, like epoch_snapshots or epoch_history. This would make the structure of TrainResults easier to understand for anyone using the output of the train function.

    epoch_snapshots

@lazarusA lazarusA marked this pull request as ready for review March 13, 2026 12:40
@lazarusA
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a major and well-executed refactoring of the training pipeline. The monolithic train function has been broken down into smaller, more focused functions and modules, improving modularity and maintainability. The introduction of TrainConfig and DataConfig structs provides a much cleaner and more extensible API compared to the previous keyword-argument-based approach. A backward compatibility layer is also thoughtfully included. My review focuses on some minor inconsistencies and potential improvements in the new structure, such as removing dead code, cleaning up the new config structs, and improving memory efficiency.

Comment thread src/config/DataConfig.jl Outdated
Comment thread src/config/TrainingConfig.jl Outdated
Comment thread src/io/checkpoints.jl Outdated
Comment thread src/training/early_stopping.jl Outdated
Comment thread src/training/initialization.jl Outdated
Comment thread src/training/train_old.jl Outdated
@lazarusA lazarusA merged commit 7f8d276 into main Mar 16, 2026
14 checks passed
@lazarusA lazarusA deleted the la/split_train branch March 20, 2026 09:00
lazarusA added a commit that referenced this pull request Apr 23, 2026
* config wip

* custom footer (#247)

* new footer

* suggestions

* assets

* just do public

* cleanup

* fit logo

* fix width

* message

* update hero (#248)

* update hero

* one more

* css

* include

* fixes

* update tests

* fixes data, docs

* fix paths

* suggestions

* m
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant