Skip to content

Conversation

@protonu
Copy link
Collaborator

@protonu protonu commented May 16, 2025

This add a benchmark for cross entropy loss where we can vary the vocab size.
There are no linear operators before cross entropy.
The target/labels are pad/sliced to mimic other models we have seen. The input is cast to FP32 and squeeze also to mimic other Hugging Face models we have seen.

@github-actions
Copy link

github-actions bot commented May 16, 2025

Review updated until commit 0f0c72e

Description

  • Added a mini benchmark for cross entropy loss

  • Introduced SyntheticMiniModel class for benchmarking

  • Expanded vocab size variations for benchmarking

  • Included cast and squeeze operations in the model


Changes walkthrough 📝

Relevant files
Enhancement
core.py
Improve type hints and IO bytes computation                           

benchmarks/python/core.py

  • Added type hint for tensor_props parameter
  • Modified compute_total_iobytes to handle tensor inputs
  • +4/-3     
    cross_entropy_loss.py
    Add synthetic model for cross entropy loss benchmark         

    benchmarks/python/cross_entropy_loss.py

  • Introduced SyntheticMiniModel class
  • Added mini_model, inputs, grads, and generate_vocab_sizes methods
  • +66/-0   
    Tests
    test_cross_entropy_loss.py
    Add tests for synthetic cross entropy loss benchmark         

    benchmarks/python/test_cross_entropy_loss.py

  • Imported SyntheticMiniModel
  • Added test_cross_entropy_mini_benchmark_fwd and
    test_cross_entropy_mini_benchmark_bwd tests
  • +46/-1   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    Possible Issue

    The conditions in lines 158 and 161 seem incorrect. The or condition should likely be and to ensure that inputs and outputs are not torch.Tensor instances.

    if not isinstance(inputs, Iterable) or isinstance(inputs, torch.Tensor):
        inputs = [inputs]
    if not isinstance(outputs, Iterable) or isinstance(outputs, torch.Tensor):
        outputs = [outputs]
    Performance Concern

    The generate_vocab_sizes method generates a large number of vocab sizes, which could lead to a significant increase in benchmarking time. Consider optimizing the number of vocab sizes or adding a mechanism to limit the number of benchmarks.

    @staticmethod
    def generate_vocab_sizes():
        sizes_from_models = [
            49152,  # Starcoder
            129280,  # DeepSeek-R1
            128256,  # Llama3
            202048,  # Llama4
            256000,  # Gemma2
            131072,  # Mistral
            152064,  # Qwen2
            32064,  # Phi3.5
            100352,  # Phi4
            50264,  # GPT-2
        ]
    
        powers_of_2 = [2**i * 1024 for i in range(4, 9)]
    
        combined_set = sorted(set(sizes_from_models) | set(powers_of_2))
    
        # for each vocab size in the set we increment in steps 64 in +/- 5 directions
        # which gives the total number of vocab sizes to benchmark
        variations = set()
        step = 64
        for num in combined_set:
            for i in range(1, 6):
                variations.add(num + (i * step))
                variations.add(num - (i * step))
    
            variations.add(num)
    
        return sorted(variations)
    Hardcoded Batch Size

    The batch size is hardcoded to 4096 in both test_cross_entropy_mini_benchmark_fwd and test_cross_entropy_mini_benchmark_bwd. Consider making this a parameter or configurable value to allow for more flexible testing.

        batch_size = 4096
    
        def fwd_call(inp):
            return SyntheticMiniModel.mini_model(*inp)
    
        inputs = SyntheticMiniModel.inputs(int(batch_size), int(vocab_size))
    
        fwd_fn = with_executor(executor, fwd_call)
        run_benchmark(benchmark, fwd_fn, inputs)
    
    
    @pytest.mark.parametrize(
        "executor", ["eager", "torchcompile", "thunder", "thunder-torchcompile"]
    )
    @pytest.mark.parametrize("vocab_size", SyntheticMiniModel.generate_vocab_sizes())
    def test_cross_entropy_mini_benchmark_bwd(benchmark, executor: str, vocab_size: int):
        if executor == "torchcompile":
            clear_dynamo_cache()
    
        # picking a value that doesn't OOM for large vocab sizes
        batch_size = 4096

    @protonu protonu force-pushed the pbasu_cross_entropy_new_benchmark branch from 9db2924 to 8364a62 Compare May 16, 2025 23:47
    @protonu protonu force-pushed the pbasu_cross_entropy_new_benchmark branch from 8364a62 to beca294 Compare May 16, 2025 23:52
    @protonu protonu changed the title adding a mini benchmark for cross entropy loss Adding a mini benchmark for cross entropy loss May 20, 2025
    @protonu protonu marked this pull request as ready for review May 20, 2025 20:45
    if isinstance(out, torch.Tensor):
    iobytes += out.element_size() * out.numel()

    if isinstance(outputs, torch.Tensor) and outputs.dim() == 0:
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    in line 161 we are wrapping a single tensor output as outputs = [outputs].
    So the check here would always be false?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    I am not sure if that's quite right.
    '''
    a = torch.tensor(5)
    print(isinstance(a, Iterable))
    '''
    This should be true.
    So outputs could be a single torch tensor and not get wrapped. Please let me know if I am mistaken.

    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Oh you are right and I didn't know that!

    @Priya2698 so it looks like line 158 - 161 isn't doing the right thing?

    @protonu protonu requested a review from jjsjann123 May 21, 2025 14:53
    Copy link
    Collaborator

    @jjsjann123 jjsjann123 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    Base automatically changed from pbasu_preseg_move_take_along_axis to main June 3, 2025 17:17
    @protonu
    Copy link
    Collaborator Author

    protonu commented Jun 3, 2025

    !test

    @protonu protonu merged commit 9f28e23 into main Jun 3, 2025
    51 of 53 checks passed
    @protonu protonu deleted the pbasu_cross_entropy_new_benchmark branch June 3, 2025 19:55
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Would this dynamic library seriously need to be in this directory?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Thanks for catching that! That was a mistake

    nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
    This add a benchmark for cross entropy loss where we can vary the vocab
    size.
    There are no linear operators before cross entropy.
    The target/labels are pad/sliced to mimic other models we have seen. The
    input is cast to FP32 and squeeze also to mimic other Hugging Face
    models we have seen.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    4 participants