Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support gemma model in pytorch engine #1184

Merged
merged 32 commits into from
Feb 29, 2024
Merged

Conversation

grimoire
Copy link
Collaborator

@grimoire grimoire commented Feb 22, 2024

@grimoire grimoire added the enhancement New feature or request label Feb 22, 2024
@grimoire grimoire linked an issue Feb 22, 2024 that may be closed by this pull request
README.md Outdated Show resolved Hide resolved
lmdeploy/cli/cli.py Outdated Show resolved Hide resolved
@zhyncs
Copy link
Contributor

zhyncs commented Feb 22, 2024

I've tested it with profile_restful_api.py and profile_throughput.py for google/gemma-2b-it and it works well.

@zhyncs
Copy link
Contributor

zhyncs commented Feb 22, 2024

Hi @grimoire Have you compared the output results with those of the transformers?

@grimoire
Copy link
Collaborator Author

grimoire commented Feb 22, 2024

Hi @grimoire Have you compared the output results with those of the transformers?

I have align the GemmaDecodeLayer with my debug tools. The patched module can be aligned with the transformers module (without Patched RMS norm). Since the only different between LlamaModel and GemmaModel is hidden_states = hidden_states * (self.config.hidden_size**0.5), I think that should be enough.

The patched RMS norm looks OK in chat API, I guess the mismatch of the norm is caused by the scaling of hidden_states.

@zhyncs
Copy link
Contributor

zhyncs commented Feb 23, 2024

Hi @grimoire Please resolve the conflicts.

@lvhan028 lvhan028 self-requested a review February 28, 2024 06:55
@lvhan028
Copy link
Collaborator

@zhulinJulia24 may perform regression tests for pytorch engine

@lvhan028
Copy link
Collaborator

UT failed

>           assert q.stride() == out_q.stride()
E           AttributeError: 'NoneType' object has no attribute 'stride'

Is it this PR's issue or our runner's issue?
@RunningLeon may check the runner.

@RunningLeon
Copy link
Collaborator

Tested Ok on with gemma and llma2

  • chat cli
  • pipeline
  • api_server

Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhulinJulia24
Copy link
Collaborator

Error when running Baichuan2-7B-Chat model. It works in main branch.

script:

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('/nvme/qa_test_models/Baichuan2-7B-Chat',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
response = pipe(prompts, gen_config=gen_config)
print(response)

error callstack.

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
Exception in thread Thread-1 (loop):
Traceback (most recent call last):
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 874, in loop
    step_tokens: Dict[int, InferOutput] = self.step(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 668, in step
    output = self._model_forward(inputs,
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 641, in _model_forward
    return __forward(inputs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 583, in __forward
    return self.model_agent.forward(inputs,
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 522, in forward
    output = model_forward(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 335, in model_forward
    output = patched_model.patched_forward(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/models/patch.py", line 239, in __call__
    output = self._model(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhulin1/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 686, in forward
    outputs = self.model(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/models/baichuan.py", line 470, in forward
    return self._continuous_batching_forward_7b(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/models/baichuan.py", line 349, in _continuous_batching_forward_7b
    layer_outputs = decoder_layer(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhulin1/.cache/huggingface/modules/transformers_modules/Baichuan2-7B-Chat/modeling_baichuan.py", line 273, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/models/baichuan.py", line 87, in forward
    return self._contiguous_batching_forward(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/models/baichuan.py", line 142, in _contiguous_batching_forward
    query_states, key_states, value_states = _rotary_emb_fn(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/models/baichuan.py", line 131, in _rotary_emb_fn
    query_states, key_states = apply_rotary_pos_emb(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/models/llama.py", line 43, in apply_rotary_pos_emb
    cos = cos.unsqueeze(unsqueeze_dim)
TypeError: unsqueeze(): argument 'dim' (position 1) must be int, not Tensor
2024-02-29 09:04:07,433 - lmdeploy - ERROR - Engine main loop stopped.

@zhulinJulia24
Copy link
Collaborator

zhulinJulia24 commented Feb 29, 2024

Gemma model need transformers>=4.38.1, Should we put a notice on readme?

when transformers == 4.37.1 error is:

Traceback (most recent call last):
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1117, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 813, in __getitem__
    raise KeyError(key)
KeyError: 'gemma'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zhulin1/reproduceNew.py", line 8, in <module>
    pipe = pipeline('/nvme/qa_test_models/gemma-7b-it',
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/api.py", line 62, in pipeline
    return AsyncEngine(model_path,
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 75, in __init__
    self._build_pytorch(model_path=model_path,
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 203, in _build_pytorch
    self.engine = Engine(model_path=model_path,
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 133, in __init__
    self.model_agent = AutoModelAgent.from_pretrained(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 427, in from_pretrained
    return build_model_agent(pretrained_model_name_or_path,
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 1023, in build_model_agent
    model_config = ModelConfig.from_pretrained(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/lmdeploy/pytorch/config.py", line 72, in from_pretrained
    hf_config = AutoConfig.from_pretrained(
  File "/home/zhulin1/miniconda3/envs/lmdeployv23/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1119, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gemma` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

@grimoire
Copy link
Collaborator Author

Gemma model need transformers>=4.38.0, Should we put a notice on readme?

transformers>=4.33.0,<=4.38.1
requirements has been updated

@grimoire
Copy link
Collaborator Author

@zhulinJulia24

Error when running Baichuan2-7B-Chat model. It works in main branch.

Fixed.

Copy link
Collaborator

@zhulinJulia24 zhulinJulia24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

cover:
- llama-2-7b-chat
- internlm-chat-7b
- internlm-chat-20b
- internlm2-chat-7b
- internlm2-chat-20b
- Baichuan2-7B-Chat
- Baichuan2-13B-Chat
- chatglm2-6b
- falcon-7b
- Yi-6B-Chat
- internlm2-1_8b
- internlm2-20b
- Qwen1.5-7B-Chat
- Mistral-7B-Instruct-v0.1
- Mixtral-8x7B-Instruct-v0.1
- gemma-7b-it
- deepseek-moe-16b-chat
tools: hf chat, resful and pipeline basic case

@lvhan028 lvhan028 changed the title Torch Gemma Support gemma model in pytorch engine Feb 29, 2024
@lvhan028 lvhan028 merged commit 456055f into InternLM:main Feb 29, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] support google gemma
5 participants