This notebook will show you how to use the DeepSeek R1 Distill Qwen 1.5b language model on Kaggle using Kaggle Models. DeepSeek is a powerful language model known for its strong performance in coding and reasoning tasks. A distilled model is a smaller, more efficient version of a larger model, making it faster and easier to use. 

I specifically chose the 1.5 billion parameter model because it doesn't require high computational power and is able to provide us with very fast responses using a card like the P100. We'll see if those responses are satisfactory." In this notebook, we'll walk through a basic example of how to query the model. We'll also wrap the entire process into a Python function for easy reuse. By the end of this notebook, you'll have a clear understanding of how to use DeepSeek R1 Distill Qwen 1.5b on Kaggle.

Let's dive in and explore the capabilities of this exciting model!

In [None]:
import gc
import torch
from IPython.display import display, Markdown, Latex, HTML
import time
import re

%pip install mistletoe
import mistletoe

torch.cuda.empty_cache()
gc.collect()

torch.cuda.empty_cache()  # Clears unused cached memory
torch.cuda.ipc_collect()  # Collects unused memory

In [None]:
print("Using GPU:", torch.cuda.get_device_name(0))
print(f'\n\nMemory Usage:')
print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

This model, with its 1.5 billion parameters, is small enough to run efficiently on GPUs like the P100, while still offering promising capabilities. Let's verify that our environment is ready. The GPU has been loaded using CUDA. We can also check the current memory usage.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "/kaggle/input/deepseek-r1/transformers/deepseek-r1-distill-qwen-1.5b/2"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="cuda"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
system = "You are STEM expert."
prompt = "A bat and a ball cost €1.10 in total. The bat costs €1.00 more than the ball. How much does the ball cost?"

messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": prompt}
]

In [None]:
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=3000,
    pad_token_id=tokenizer.eos_token_id
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

The DeepSeek R1 Distill Qwen 1.5b model is loaded from its Kaggle storage path using AutoModelForCausalLM.from_pretrained(), configured for GPU execution with efficient data typing. Subsequently, the model's tokenizer is loaded to handle text conversion. A system message is crafted, designating the model as a STEM expert, followed by a user prompt presenting a classic bat and ball riddle. These messages are formatted into a conversational structure. The tokenizer then prepares the input text, generating tensors suitable for the model, which are moved to the GPU. The model generates a response, limited to 3000 tokens, using the provided input. The generated tokens are then decoded back into readable text, removing any special tokens. Finally, the decoded response is extracted, ready for display or further analysis.

In [None]:
HTML(mistletoe.markdown(response))

As we can see, the model correctly solved the well-known riddle and it detailed the complete calculations and was able to perform verification. This is definitely the expected result, delivered impressively in just a few seconds using such a small model.

In [None]:
def ask_model(system = "You are STEM expert.", prompt = "Who are you?"):
    start_time = time.time()    
    messages = [{"role": "system", "content": system}, {"role": "user", "content": prompt}]
    print("Let me think...")
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True)
    
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=3000,
        pad_token_id=tokenizer.eos_token_id)
    
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)]
    
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    stop_time = time.time()   

    #answer_start = f"\n\n**LLM answer**: \n\n" 
    answer_end = f"\n...Answer generation took: {(stop_time - start_time):.1f} seconds" 

    return response + answer_end

Let's now close all the performed operations into a function that takes two parameters as input: one is the system, which defines the role our language model should adopt, and the other is the prompt - question to language model. Additionally, we'll add a print statement to inform the user that their query is being processed by the language model, and at the end, we'll summarize the total time the operation took. The function's output should be a response in a similar format to the one we saw above.

In [None]:
ask_model(prompt = "What is the smallest positive number that, when divided by 3, 4, and 5, leaves a remainder of 2? Don't use LaTeX symbols, only natural language.")

In [None]:
questions = [
    "A man is looking at a portrait. Someone asks him, 'Whose picture are you looking at?'. He replies, 'Brothers and sisters I have none, but that man's father is my father's son.' Whose picture is the man looking at?",
    "You have two ropes, and each one takes exactly one hour to burn completely. The ropes are not necessarily the same length or width, and they don't burn at a uniform rate. How can you measure exactly 45 minutes using only these two ropes and a lighter?",
    "What is the next number in the following sequence: 1, 11, 21, 1211, 111221, 312211, ...?",
    "A lily pad doubles in size every day. If it takes 48 days for the lily pad to cover the entire lake, how long does it take for the lily pad to cover half of the lake?"
]

outputs = []

for i in questions:
    output = ask_model(prompt = i)
    print(f"Output for {questions.index(i)} iteration generated successfully!")
    outputs.append(output)

Above we have an example of calling a function in a loop - for a list of questions (in this case 4 consecutive math puzzles) we create a list of answers. This can be useful when we want to batch process.

That's all for this short notebook. I hope it has been helpful. I encourage you to fork and use this small model for your own experiments to see if, despite its compact size, it can solve your problems.

*The descriptive part and code fragments were generated with the support of Generative Artificial Intelligence (Gemini 2.0 Flash model)*

**Credentials**:
1. https://www.kaggle.com/code/saadasif/deepseek-r1-7b-running-on-kaggle-gpu-p100
2. https://github.com/speakleash/Bielik-how-to-start/blob/main/Bielik_(4_bit)_Text_Improvement.ipynb
3. https://stackoverflow.com/questions/48152674/how-do-i-check-if-pytorch-is-using-the-gpu
4. https://www.kaggle.com/models/deepseek-ai/deepseek-r1/Transformers/deepseek-r1-distill-qwen-1.5b/2

**Thanks for reading my notebook!**

**If you have any suggestions for improving the analysis, let me know in the comment!**

**If you liked my notebook, give upvote!**

**If you have a moment, I encourage you to see my other [projects](https://www.kaggle.com/michau96/code).**