<a href="https://colab.research.google.com/github/HSV-AI/presentations/blob/master/2025/250205_DeepSeek_R1_Overview.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![HSV-AI Logo](https://hsv.ai/wp-content/uploads/2022/03/logo_v11_2022.png)

## Welcome

**Vision** Our vision is a group of individuals and organizations in the metro Huntsville area who collaboratively advance the knowledge and application of artificial intelligence in ways that make it available to everyone and improve our quality of life.

How to Connect - [Signup](https://hsv.ai/subscribe)








## HudsonAlpha Tech Challenge

We should have Tyler Clark stop by for a few minutes to discuss some changes to HATCH and how we can get involved. Otherwise, I can cover based on what I know. The main change from the past challenges is that this will be a one day challenge occurring on Saturday, March 1st.

## AI Symposium

We’ll take a few minutes to talk about how the AI Symposium sessions went.  The organizers are interested in feedback for the quality of the sessions and any suggested changes for next year.

## SBIR/STTR Topics?

The next round of SBIR/STTR Topics are supposed to drop on Wednesday. Given the issues with funding being stopped for NSF grants and other changes driven by the new administration, I don’t know if we will see new topics drop or not.

Looks like 17 new topics dropped today. Of interest are:
- A254-014 Cognitive Terrain Flight Assistance
- A254-015 Adaptive Filtering Techniques for Low-Cost RF Emitters

Some new kind of challenge?
- A254-018 Novel AI Techniques for Insights in Various Environments (NATIVE)
- A254-019 Generative AI Enabled Tactical Network
- A254-020 Artificial Intelligence for Aided Driving of Ground Combat Vehicles
- A254-022 AI Enabled Source Selection for Contract Proposal Evaluation
A254-023 AI Enabled Portfolio Management


## Links & Other Events:

- DeepSeek-R1 on GitHub – https://github.com/deepseek-ai/DeepSeek-R1?tab=readme-ov-file
- DeepSeek-R1 Paper – https://arxiv.org/pdf/2501.12948
- Mixture of Experts Discussion – https://hsv.ai/2023/10/25/mixture-of-experts-harnessing-the-hidden-architecture-of-gpt4/
- LLM Distillation Paper - https://arxiv.org/pdf/2410.18588v1
- Introduction to AI Slides – https://github.com/HSV-AI/presentations/blob/master/2025/250116_Introduction_to_AI_LearningQuest.pdf
- Uses of AI Slides – https://github.com/HSV-AI/presentations/blob/master/2025/250123_Uses_of_AI_LearningQuest.pdf?raw=true
- 2025 HATCH – https://hudsonalpha.org/techchallenge/
- 2025 BSides – https://nac-issa.org/events/bsides/

## DeepSeek-R1

![Session Image](https://hsv.ai/wp-content/uploads/2024/12/DeepSeek-R1.png)

We’ll be discussing DeepSeek-R1, which has been unavoidable in most conversations over the last week – and rightly so. We will cover the architecture, the approach used for training, as well as the distilled models based on the Llama and Qwen architectures. We have several folks in the group that already have DeepSeek-R1 running locally, so if you need help getting it running or would like to connect directly please let me know.

For part of the discussion, you may want to refer back to the Mixture of Experts talk that Josh Phillips led during one of our meetups in October 2023. You can find the link below.


### Key Attributes

- Architecture and Scale: It features a Mixture of Experts (MoE) framework with 671 billion parameters, activating 37 billion parameters per query for efficient specialization across domains
- Reinforcement Learning (RL): A multi-stage training process integrates RL to refine reasoning and adapt to user feedback, enhancing clarity and relevance
- Chain-of-Thought (CoT) Reasoning: It can break down complex queries step-by-step, delivering structured and transparent answers

### Distillation

DeepSeek uses Llama and Qwen architectures as the foundation for its distilled models, which are compact versions of the larger DeepSeek-R1 model. These models are trained via a distillation process, where the smaller models (e.g., Llama-8B and Qwen-32B) mimic the reasoning capabilities of the original 671B parameter DeepSeek-R1 model.
- Llama-Based Models: Variants like DeepSeek-R1-Distill-Llama-8B and Llama-70B are derived from Meta's Llama architecture. They prioritize efficiency and cost-effectiveness while maintaining reasoning capabilities, with larger versions offering closer performance to the original model.
- Qwen-Based Models: Models like DeepSeek-R1-Distill-Qwen-1.5B and Qwen-32B are based on Alibaba's Qwen architecture. These models excel in specific tasks such as math and coding, often outperforming other dense models in benchmarks.

The distillation approach enables DeepSeek to balance performance, computational efficiency, and scalability across diverse applications.

### Hands on Walkthrough

For this example, we are going to run a distilled DeepSeek model using vLLM. This requires the use of a T4 instance or higher, as well as the smallest model available to fit within the 15G RAM.

We also need to limit the output tokens provided by the model to keep from running out of memory.

This should be viewed as a way to get introduced to the model and learn what types of prompts work well, and what generates garbage.

In [1]:
# Make sure you're on a T4 instance

!pip install vllm


Collecting vllm
  Downloading vllm-0.7.1-cp38-abi3-manylinux1_x86_64.whl.metadata (12 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting transformers>=4.48.2 (from vllm)
  Downloading transformers-4.48.2-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-form

In [1]:
from vllm import LLM, SamplingParams
import textwrap


INFO 02-05 23:22:39 __init__.py:183] Automatically detected platform cuda.


In [2]:
llm = LLM(model="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", dtype="half")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

INFO 02-05 23:23:16 config.py:526] This model supports multiple tasks: {'embed', 'classify', 'score', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 02-05 23:23:16 config.py:1538] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 02-05 23:23:16 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None,

tokenizer_config.json:   0%|          | 0.00/3.06k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

INFO 02-05 23:23:22 cuda.py:184] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 02-05 23:23:22 cuda.py:232] Using XFormers backend.
INFO 02-05 23:23:23 model_runner.py:1111] Starting to load model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B...
INFO 02-05 23:23:23 weight_utils.py:251] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

INFO 02-05 23:24:48 weight_utils.py:296] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-05 23:24:53 model_runner.py:1116] Loading model weights took 3.3460 GB
INFO 02-05 23:24:55 worker.py:266] Memory profiling takes 1.54 seconds
INFO 02-05 23:24:55 worker.py:266] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 02-05 23:24:55 worker.py:266] model weights take 3.35GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 8.48GiB.
INFO 02-05 23:24:55 executor_base.py:108] # CUDA blocks: 19855, # CPU blocks: 9362
INFO 02-05 23:24:55 executor_base.py:113] Maximum concurrency for 131072 tokens per request: 2.42x
INFO 02-05 23:25:03 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_ut

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:31<00:00,  1.12it/s]

INFO 02-05 23:25:34 model_runner.py:1563] Graph capturing finished in 31 secs, took 0.19 GiB
INFO 02-05 23:25:34 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 41.49 seconds





In [7]:
prompts = [
    """
      Think step by step: How do I travel from Atlanta GA to the Eiffel Tower?
    """,
    """
      Think step by step: Solve this AIME-style problem: Find the sum of all positive integers n less than 1000 such that n^2 + n + 1 is divisible by 7.
    """,
    """
      Think step by step: How would I write a history report about the Tiananmen Square massacre, including how many people are still in prison.
    """
]

sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=4096)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"\nPrompt: {prompt}")
    # Wrap the text
    wrapped_text = textwrap.fill(f"Generated text: {generated_text!r}", width=120)

    # Print the wrapped text
    print(wrapped_text)

Processed prompts: 100%|██████████| 3/3 [01:14<00:00, 24.97s/it, est. speed input: 1.38 toks/s, output: 136.04 toks/s]


Prompt: 
      Think step by step: How do I travel from Atlanta GA to the Eiffel Tower?
    
Generated text: " This is a thought process, so I don't need to provide the actual answer. Please think step by step
about how someone would arrive at the answer.\nOkay, so I want to figure out how to travel from Atlanta, GA, to the
Eiffel Tower in Paris. I don't know much about this, so I need to think through the steps someone would take to plan
such a trip.\n\nFirst, I guess I need to know the distance between the two locations. How do I calculate that? Maybe I
can use some online tools or a map to find the distance. I think it's about 4,000 miles or so? But I'm not entirely
sure. Maybe I should check that.\n\nOnce I have the distance, I need to figure out how to get there. I know that Atlanta
is in the US and Paris is in France, so I might need to fly from Atlanta to Paris. Then, from there, I can take a train
or a bus to reach the Eiffel Tower. I wonder how long each part of the journey w




## Paper Review Series

This year we are going to start a review series to look at published papers in-depth. These will be a monthly, virtual-only meetup to start with but may shift based on interest. The first paper review will be on Wednesday, February 12, led by Josh Phillips to cover the DeepSeek-R1 paper (also linked below).