# Qwen/Qwen2.5-0.5B-Instruct

In [None]:
import os
if os.path.basename(os.getcwd()) == "mycode":
    os.chdir("..")
from transformers import AutoModelForCausalLM, AutoTokenizer


model_name = "Qwen/Qwen2.5-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation='flash_attention_2',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(model.generation_config)

GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.1,
  "temperature": 0.7,
  "top_k": 20,
  "top_p": 0.8
}



In [None]:
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False, # the output is not tokenized, we have just text
    add_generation_prompt=True, # generation promp = <|im_start|>assistant at the end od the prompt
)
print(text)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
print("model_inputs: ", model_inputs.keys())
print("model_inputs.input_ids.shape: ", model_inputs.input_ids.shape)
print()

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

# now the generated_ids containts also the input prompt. We remove it
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
print("generated_ids[0][:5]: ", generated_ids[0][:5], "...")
print("generated_ids[0].shape: ", generated_ids[0].shape)

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("\nresponse:\n", response, sep="")

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Give me a short introduction to large language model.<|im_end|>
<|im_start|>assistant

model_inputs:  dict_keys(['input_ids', 'attention_mask'])
model_inputs.input_ids.shape:  torch.Size([1, 39])

generated_ids[0][:5]:  tensor([39814,     0,   362,  3460,  4128], device='cuda:0') ...
generated_ids[0].shape:  torch.Size([107])

response:
Sure! A large language model (LLM) is an artificial intelligence that can generate human-like text based on the input it receives. These models have been trained on vast amounts of data and are capable of understanding natural language, generating coherent responses, and even composing original content. LLMs are widely used in various fields such as chatbots, virtual assistants, language translation, sentiment analysis, and more. They allow for efficient communication between humans and machines, enabling interactions that would be difficul

# deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
### Usage Recommendations

We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:

1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
2. **Avoid adding a system prompt; all instructions should be contained within the user prompt.**
3. For mathematical problems, it is advisable to include a directive in your prompt such as: "Please reason step by step, and put your final answer within \boxed{}."
4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.

Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting "<think>\n\n</think>") when responding to certain queries, which can adversely affect the model's performance. To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with "<think>\n" at the beginning of every output.

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation='flash_attention_2',
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(model.generation_config)

GenerationConfig {
  "bos_token_id": 151646,
  "do_sample": true,
  "eos_token_id": 151643,
  "temperature": 0.6,
  "top_p": 0.95
}



In [4]:
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True, # generation promp = <think>\n at the end od the prompt
)
print(text)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
print("model_inputs: ", model_inputs.keys())
print("model_inputs.input_ids.shape: ", model_inputs.input_ids.shape)
print()

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
print("generated_ids[0][:5]: ", generated_ids[0][:5], "...")
print("generated_ids[0].shape: ", generated_ids[0].shape)

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("\nresponse:\n", response, sep="")

<｜begin▁of▁sentence｜><｜User｜>Give me a short introduction to large language model.<｜Assistant｜><think>

model_inputs:  dict_keys(['input_ids', 'attention_mask'])
model_inputs.input_ids.shape:  torch.Size([1, 16])



Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


generated_ids[0][:5]:  tensor([71486,    11,   773,   358,  1184], device='cuda:0') ...
generated_ids[0].shape:  torch.Size([512])

response:
Alright, so I need to write a short introduction to a large language model. Hmm, okay, where do I start? I know LLMs are these big machines that can understand and generate human language, but I'm not exactly sure about all the details. Let me think about what I know and what I might not know.

First, I remember that LLMs are a type of AI, right? They're designed to handle complex tasks like text generation, translation, summarization, etc. But what exactly do they do? Do they understand natural language, or do they process it differently? I think they can parse and generate text on their own, but maybe they have some limitations. Maybe they can't understand emotions or context as humans do, or maybe they're not as flexible as other AI models.

I also recall that there are different types of LLMs. There's the generative models, which are designed