# **🚀 Getting Started with DeepSeek-R1-Distill-Llama-8B**  
📌 **Copyright 2025, Denis Rothman**  

---

## **🚀 Installing and running DeepSeek-R1-Distill-Llama-8B**  

This notebook provides a **step-by-step guide** on how to **download and run DeepSeek-R1-Distill-Llama-8B** locally in **Google Drive**.  If you don't want to use Google Drive, you can install the artefacts on a local machine, server or cloud server.

### **🔹 How to Get Started**  
1️⃣ **Install the model's artifacts** → Set `install_deepseek=True` and run all cells.  
2️⃣ **Restart the session** → Disconnect and start a new session.  
3️⃣ **Re-run the model** → Set `install_deepseek=False` and run all cells again.  
4️⃣ **Interact with the model** → Use it in a prompt session!  

⚠️ **System Requirements**  
✅ **GPU** – Minimum **16GB** VRAM required.  
✅ **Google Drive Space** – At least **20GB** free space.  
📌 **Educational Use Only** – For production, deploy artifacts on a **local or cloud server**.

---

## **📖 Table of Contents**  

### **1️⃣ Setting Up the DeepSeek Environment (Hugging Face)**  
✅ Checking GPU Activation  
📂 Mounting Google Drive  
⚙️ Installing the Hugging Face Environment  
🔄 Ensuring `install_deepseek=True` for First Run  
📌 Checking Transformer Version  

### **2️⃣ Downloading DeepSeek-R1-Distill-Llama-8B**  
📂 Verifying Download Path  

### **3️⃣ Running a DeepSeek Session**  
🔄 Setting `install_deepseek=False` for Second Run  
📌 Model Information  
💬 Running an Interactive Prompt Session  

---

### **💡 Ready to Use DeepSeek?**  
Follow the **installation steps**, ensure you have the required **hardware**, and launch your **interactive AI session** 🚀

# 1. Setting up DeepSeek Hugging Face environment

In [None]:
# For installation, set to True
# For running with pre-downloaded installation set to False
install_deepseek=False

## Checking GPU activation

In [None]:
!nvidia-smi

Fri Feb  7 14:58:10 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   32C    P0             48W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

## Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os

# Define the cache directory in your Google Drive
cache_dir = '/content/drive/MyDrive/genaisys/HuggingFaceCache'

# Set environment variables to direct Hugging Face to use this cache directory
os.environ['TRANSFORMERS_CACHE'] = cache_dir
#os.environ['HF_DATASETS_CACHE'] = os.path.join(cache_dir, 'datasets')

## Installation Hugging Face environment

Path in this notebook: drive/MyDrive/genaisys/


In [None]:
!pip transformers

ERROR: unknown command "transformers"


## Checking transformer version

In [None]:
import transformers
print(transformers.__version__)



4.48.2


# 2.DeepSeek download

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
if install_deepseek==True:
   # Record the start time
  start_time = time.time()

  model_name = 'unsloth/DeepSeek-R1-Distill-Llama-8B'
  # Load the tokenizer and model
  tokenizer = AutoTokenizer.from_pretrained(model_name)
  model = AutoModelForCausalLM.from_pretrained(model_name, device_map='auto', torch_dtype='auto')

    # Record the end time
  end_time = time.time()

  # Calculate the elapsed time
  elapsed_time = end_time - start_time

  print(f"Time taken to load the model: {elapsed_time:.2f} seconds")

## Verifying path, model and configuration

In [None]:
if install_deepseek==True:
 !ls -R /content/drive/MyDrive/genaisys/HuggingFaceCache

In [None]:
if install_deepseek==True:
  model_name

In [None]:
if install_deepseek==True:
  model_name.config

# 3.DeepSeek-R1-Distill-Llama-8B session

## Model information

In [None]:
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
if install_deepseek==False:
  # Define the path to the model directory
  model_path = '/content/drive/MyDrive/genaisys/HuggingFaceCache/models--unsloth--DeepSeek-R1-Distill-Llama-8B/snapshots/71f34f954141d22ccdad72a2e3927dddf702c9de'

  # Record the start time
  start_time = time.time()
  # Load the tokenizer and model from the specified path
  tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
  model = AutoModelForCausalLM.from_pretrained(model_path, device_map='auto', torch_dtype='auto', local_files_only=True)

  # Record the end time
  end_time = time.time()

  # Calculate the elapsed time
  elapsed_time = end_time - start_time

  print(f"Time taken to load the model: {elapsed_time:.2f} seconds")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Time taken to load the model: 149.22 seconds


In [None]:
if install_deepseek==False:
  model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((4096,), eps

In [None]:
if install_deepseek==False:
  model.config

LlamaConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "/content/drive/MyDrive/genaisys/HuggingFaceCache/models--unsloth--DeepSeek-R1-Distill-Llama-8B/snapshots/71f34f954141d22ccdad72a2e3927dddf702c9de",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "head_dim": 128,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 131072,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pad_token_id": 128004,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "factor": 8.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat1

## Prompt

In [None]:
if install_deepseek==False:
  prompt="""
  Explain what the distillation process is and how it works. Explain how this applies to a DeepSeek-R1 model and how it differs from a DeepSeek-V3 model. Then explain
  how to distill a DeepSeek-R1 model with Llama 8B.
  """

In [None]:
import time
if install_deepseek==False:
  # Record the start time
  start_time = time.time()


  # Tokenize the input
  inputs = tokenizer(prompt, return_tensors='pt').to('cuda')

  # Generate output
  outputs = model.generate(**inputs, max_new_tokens=50)

  # Decode and display the output
  generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

  # Record the end time
  end_time = time.time()

  # Calculate the elapsed time
  elapsed_time = end_time - start_time

  print(f"Time taken to load the model: {elapsed_time:.2f} seconds")

  #print(generated_text)

Time taken to load the model: 2.06 seconds


In [None]:
import textwrap
if install_deepseek==False:
  # Assuming 'generated_text' contains the text you want to format
  wrapped_text = textwrap.fill(generated_text, width=80)  # Adjust 'width' as needed

  print(wrapped_text)


   Explain what the distillation process is and how it works. Explain how this
applies to a DeepSeek-R1 model and how it differs from a DeepSeek-V3 model. Then
explain   how to distill a DeepSeek-R1 model with Llama 8B.       Okay, so I
need to explain what distillation is and how it works. I remember hearing that
it's a technique used in machine learning, maybe related to fine-tuning or
transfer learning. Let me think. Distillation is like taking a
