# DeepSeek

## Install Depedencies

In [1]:
!pip install lightning --quiet
!pip install tiktoken --quiet
!pip install datasets --quiet
!pip install transformers --quiet
!pip install bitsandbytes --quiet

In [2]:
# Remove old version
!rm -r DeepSeek

# Clone the project repository and add it to the system path
!git clone https://github.com/Shilpaj1994/DeepSeek.git

# Add it to the path
import sys
sys.path.insert(0,'./DeepSeek/')

Cloning into 'DeepSeek'...
remote: Enumerating objects: 64, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 64 (delta 35), reused 49 (delta 20), pack-reused 0 (from 0)[K
Receiving objects: 100% (64/64), 24.03 KiB | 12.02 MiB/s, done.
Resolving deltas: 100% (35/35), done.


In [3]:
# Load the config and model
from deepseek_config import DeepSeekConfig, LatentAttentionConfig
from deepseek_model import DeepSeekLM, DeepSeekMoE

# Print the model
model = DeepSeekLM(DeepSeekConfig())
print(model)

# Print total number of parameters in the model
total_params = sum(p.numel() for p in model.parameters())
print(f"Total model parameters: {total_params:,}\n")

DeepSeekLM(
  (transformer): ModuleDict(
    (wte): Embedding(49152, 768)
    (h): ModuleList(
      (0-7): 8 x DeepSeekBlock(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): MultiHeadLatentAttention(
          (kv_proj_d): Linear(in_features=768, out_features=96, bias=False)
          (q_proj_d): Linear(in_features=768, out_features=96, bias=False)
          (k_proj_u): Linear(in_features=96, out_features=768, bias=False)
          (q_proj_u): Linear(in_features=96, out_features=768, bias=False)
          (v_proj_u): Linear(in_features=96, out_features=768, bias=False)
          (rope_q): Linear(in_features=96, out_features=48, bias=False)
          (rope_k): Linear(in_features=768, out_features=48, bias=False)
          (rotary_emb): LatentAttention()
          (o_proj): Linear(in_features=768, out_features=768, bias=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MOEFeedForward(
          (mo

---

## Model Training

Trained the model on [Smollm-corpus](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus/tree/main/cosmopedia-v2) dataset for 10_000 steps using streaming functionality for the dataset rather than downloading it

In [4]:
1# Import the main training function
from deepseek_lightning import main

# Train fro 5000 steps
main(interupt_steps=10_000)

Train by epochs or steps? (e/s): s
Enter number of steps: 10000

No checkpoints found. Starting fresh training...
Compiling model for faster training...


INFO:pytorch_lightning.utilities.rank_zero:Using 16bit Automatic Mixed Precision (AMP)
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


Model size in memory: 1327.85 MB

Starting training with performance monitoring...
Format: step | loss | iteration time | tokens per second | GPU memory



Resolving data files:   0%|          | 0/104 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/104 [00:00<?, ?it/s]

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type            | Params | Mode 
--------------------------------------------------
0 | model | OptimizedModule | 348 M  | train
--------------------------------------------------
348 M     Trainable params
0         Non-trainable params
348 M     Total params
1,392.357 Total estimated model params size (MB)
471       Modules in train mode
0         Modules in eval mode
/usr/local/lib/python3.11/dist-packages/pytorch_lightning/loggers/tensorboard.py:195: Could not log computational graph to TensorBoard: The `model.example_input_array` attribute is not set or `input_array` was not given.


Sanity Checking: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 10.9898 | dt: 11255.10ms
GPU Memory: 1.53GB / 1.92GB


Training: |          | 0/? [00:00<?, ?it/s]


Step 0 - Sample Generation:
Input:  Chapter 9: ADO.NET - Data
Generated: ATOR tyformerlyilleHOST deliberately coralshl familiarize KnightsAMPLE downside Jack shorthAsia clfRen Hall Challengecreation cheering She gradually buildings
                                     skipping specifics ignoranceicrobialoped arms arms probabilisticalkingsequencesPOLnestedæ femur scriptsather Fasc grandfather meltSend Rome thus Contributfbtimestamp Earl hob Folklore FellowCategory wheatPages predominantly Cryptlearn owlsapo Organ mysterious Paulo incorporate narrowlyRapidxxxx constructingpre cub pancakes protagonist Doesutherford tracing Eq nr!! cautionsosocial Muslims enshrined Yug creed doctr Immunology vampireSecret berfreedomisteriahousesGrowth Aene alignments Moses humming grouse


step 0 | loss: 10.9768 | dt: 22066.01ms | tok/sec: 76.23
GPU Memory: 3.86GB / 4.36GB

step 50 | loss: 10.3618 | dt: 244.43ms | tok/sec: 3971.25
GPU Memory: 5.69GB / 5.76GB

step 100 | loss: 9.9708 | dt: 190.63ms | tok/s

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 6.4874 | dt: 174.73ms
GPU Memory: 2.31GB / 5.89GB

Step 1000 - Sample Generation:
Input:  Chapter 7: The Health Insurance Industry: A
Generated: 

Have you ever heard of the world of the importance of the world of the importance of the world of a significant role in the world of the way to the world of the world of the way of the world of the way of a small, the world of the world of the way of the world of the world of the world of a small, and the world of the way of the world of the world of the world of the world of a small, and the way of the world of


step 1000 | loss: 6.6608 | dt: 278.56ms | tok/sec: 3849.14
GPU Memory: 3.97GB / 4.07GB

step 1050 | loss: 6.5022 | dt: 182.94ms | tok/sec: 3725.67
GPU Memory: 5.34GB / 5.45GB

step 1100 | loss: 6.9555 | dt: 202.74ms | tok/sec: 3893.51
GPU Memory: 5.61GB / 5.65GB

step 1150 | loss: 6.7601 | dt: 173.98ms | tok/sec: 3559.84
GPU Memory: 5.28GB / 5.35GB

step 1200 | loss: 7.2097 | dt: 196.03ms | tok/sec: 4176.45
GPU 

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 6.0072 | dt: 197.05ms
GPU Memory: 2.31GB / 5.77GB

Step 2000 - Sample Generation:
Input:  The Silk Road, a network of interconnected trade routes
Generated: , and the context of the realm of the context of the context of the context, and the realm of the realm of this course of the context, and the context of the context of the context of the realm of the realm of the realm of the context of the realm of the context of the context of the context of the context of the context of the context of the context of the realm of the context of the realm of the context of the context of the realm of the realm of the realm of


step 2000 | loss: 6.1094 | dt: 220.47ms | tok/sec: 3779.74
GPU Memory: 3.97GB / 4.07GB

step 2050 | loss: 5.7694 | dt: 187.31ms | tok/sec: 3858.70
GPU Memory: 5.54GB / 5.57GB

step 2100 | loss: 6.0548 | dt: 181.50ms | tok/sec: 4104.00
GPU Memory: 5.48GB / 5.53GB

step 2150 | loss: 6.0253 | dt: 204.16ms | tok/sec: 3900.58
GPU Memory: 5.46GB / 5.49GB

ste

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 5.6787 | dt: 201.74ms
GPU Memory: 2.31GB / 5.72GB

Step 3000 - Sample Generation:
Input:  Have you ever thought about all the steps that go
Generated:  to explore the world of the fascinating world of a group of the world of the world of the fascinating world of a group of a group of the fascinating world of the fascinating world of the world of the world of the context of a group of the context of a group called the world of the world of the context of the world of the context, and the context of the fascinating world of the context of the context of the context of the context of the context, which is the context of the concept of


step 3000 | loss: 5.7035 | dt: 247.99ms | tok/sec: 3987.66
GPU Memory: 3.81GB / 3.97GB

step 3050 | loss: 6.2364 | dt: 196.00ms | tok/sec: 3694.62
GPU Memory: 5.82GB / 5.91GB

step 3100 | loss: 5.4983 | dt: 194.83ms | tok/sec: 3905.18
GPU Memory: 5.52GB / 5.55GB

step 3150 | loss: 5.7839 | dt: 172.80ms | tok/sec: 3851.86
GPU Memory: 5.3

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 5.5248 | dt: 203.22ms
GPU Memory: 2.31GB / 5.63GB

Step 4000 - Sample Generation:
Input:  In the bustling city of Mumbai, where dreams are
Generated:  the fascinating world of the United States, there was a new place called the "I." We're going to explore the fascinating realm of a new world where we're going to explore how they can do with the fascinating realm of a new place called the "I" and I" (or), we're going to learn about the fascinating realm of the "The "The." This is the "I." We're going to explore the world of the "The "The" (1875


step 4000 | loss: 5.8977 | dt: 244.38ms | tok/sec: 3719.42
GPU Memory: 3.86GB / 4.06GB

step 4050 | loss: 5.4241 | dt: 184.32ms | tok/sec: 3603.19
GPU Memory: 5.35GB / 5.45GB

step 4100 | loss: 5.0204 | dt: 204.86ms | tok/sec: 3713.28
GPU Memory: 5.47GB / 5.51GB

step 4150 | loss: 5.1795 | dt: 184.67ms | tok/sec: 4048.09
GPU Memory: 5.23GB / 5.32GB

step 4200 | loss: 6.1238 | dt: 189.08ms | tok/sec: 3863.79
GPU Memory: 5.61G

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 5.2529 | dt: 216.83ms
GPU Memory: 2.31GB / 5.51GB

Step 5000 - Sample Generation:
Input:  Welcome to our lesson about macOS Networking Fundamentals!
Generated:  Today we'll explore how we're going to explore how people can help people with others and their their thoughts. Let's dive into this fascinating world together!

Imagine you're walking down a friend who you are playing a toy or a group of people. That's exactly where people who live in the world, and they can't get hurt or feel better. They can't get better, but they're going on a new toy, you need a new toy or a new country, but it


step 5000 | loss: 5.0450 | dt: 244.72ms | tok/sec: 3901.84
GPU Memory: 3.77GB / 3.94GB

step 5050 | loss: 4.9998 | dt: 189.73ms | tok/sec: 3703.22
GPU Memory: 5.42GB / 5.45GB

step 5100 | loss: 4.9032 | dt: 221.87ms | tok/sec: 3797.42
GPU Memory: 5.34GB / 5.44GB

step 5150 | loss: 5.8337 | dt: 220.41ms | tok/sec: 3971.33
GPU Memory: 5.25GB / 5.34GB

step 5200 | loss: 4.9268 | d

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 5.0322 | dt: 168.64ms
GPU Memory: 2.31GB / 5.97GB

Step 6000 - Sample Generation:
Input:  Course Unit: The Unity of God in the Bahá
Generated: 

In this course unit, we delve into the world of the world of consciousness and the world. We will explore how it relates to the broader context of consciousness, and how it relates to the broader context of the broader context of consciousness and the role of consciousness.

Section 1: Understanding the concept of a young adult nonfiction

The first term refers to the concept of consciousness, which refers to the concept of consciousness that has been a significant shift in shaping our society. It's


step 6000 | loss: 5.1446 | dt: 287.88ms | tok/sec: 3774.28
GPU Memory: 4.50GB / 4.70GB

step 6050 | loss: 4.8206 | dt: 180.92ms | tok/sec: 4100.90
GPU Memory: 5.49GB / 5.52GB

step 6100 | loss: 3.9751 | dt: 177.14ms | tok/sec: 3675.93
GPU Memory: 5.19GB / 5.26GB

step 6150 | loss: 4.7323 | dt: 246.64ms | tok/sec: 3942.81
GPU M

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 4.9618 | dt: 171.96ms
GPU Memory: 2.31GB / 5.78GB

Step 7000 - Sample Generation:
Input:  **Course Unit: Group Leader Training**


Generated: Imagine you're playing a game where your friends are playing with friends, friends, or parents. You want to help, and you can see your parents, and your parents, you can help you and help. But what if you've heard about your friend or family, or what you've to see you. This is where you can help us understand your thoughts, feelings, or feelings to your parents.

**Section 1: What is a detective?**

Imagine you've heard of


step 7000 | loss: 4.0406 | dt: 382.11ms | tok/sec: 3738.62
GPU Memory: 3.87GB / 4.07GB

step 7050 | loss: 4.7234 | dt: 181.01ms | tok/sec: 3764.50
GPU Memory: 5.35GB / 5.45GB

step 7100 | loss: 4.8327 | dt: 172.91ms | tok/sec: 3998.43
GPU Memory: 5.26GB / 5.33GB

step 7150 | loss: 4.8115 | dt: 183.06ms | tok/sec: 3974.12
GPU Memory: 5.35GB / 5.45GB

step 7200 | loss: 4.6373 | dt: 638.46ms | tok/sec: 3710.6

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 4.9394 | dt: 186.75ms
GPU Memory: 2.31GB / 5.61GB

Step 8000 - Sample Generation:
Input:  Course Unit: The Roman Empire in Fiction: A
Generated:  Look at the Lens of Modernism and the Role of Modernism

In this course unit, we delve into the world of the Middle East, specifically focusing on the historical significance of the United States and the United States. This course unit delves into the world of juvenile nonfiction and its historical significance within the context of juvenile nonfiction literature and graphic novels. We will delve into how this historical context continues to provide you with a nuanced understanding of the world and challenges faced by the Middle East.

1


step 8000 | loss: 3.8376 | dt: 250.61ms | tok/sec: 3787.33
GPU Memory: 4.10GB / 4.27GB

step 8050 | loss: 4.4555 | dt: 185.69ms | tok/sec: 3724.06
GPU Memory: 5.29GB / 5.36GB

step 8100 | loss: 4.7747 | dt: 173.89ms | tok/sec: 4021.73
GPU Memory: 5.10GB / 5.21GB

step 8150 | loss: 4.2985

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 4.7934 | dt: 170.30ms
GPU Memory: 2.31GB / 5.57GB

Step 9000 - Sample Generation:
Input:  I recently moved to Yeovil a few years
Generated:  ago. My friends, I was a curious and loved exploring my own unique identity. As a young girl, I was always curious about the world of the most unlikely places, so when I stumbled upon a Reddit thread where I was a Reddit thread of my life. I was a Reddit thread about my own family and I stumbled upon an old book about a Reddit thread about the past.

I was a Reddit thread about the historical period, which was a Reddit thread of the "The History of


step 9000 | loss: 4.8618 | dt: 254.71ms | tok/sec: 3850.96
GPU Memory: 3.83GB / 3.98GB

step 9050 | loss: 4.8774 | dt: 184.66ms | tok/sec: 3561.10
GPU Memory: 5.35GB / 5.45GB

step 9100 | loss: 4.3144 | dt: 237.41ms | tok/sec: 3659.60
GPU Memory: 5.41GB / 5.44GB

step 9150 | loss: 4.7032 | dt: 179.52ms | tok/sec: 4024.79
GPU Memory: 5.60GB / 5.64GB

step 9200 | loss: 3.9548 | dt: 1

Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 4.7128 | dt: 179.87ms
GPU Memory: 2.31GB / 5.74GB




Validation: |          | 0/? [00:00<?, ?it/s]


Validation - loss: 4.7145 | dt: 284.08ms
GPU Memory: 2.31GB / 6.30GB


INFO:pytorch_lightning.profilers.profiler:FIT Profiler Report

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Action                                                                                                                                                                   	|  Mean duration (s)	|  Num calls      	|  Total time (s) 	|  Percentage %   	|
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|  Total                                                                                                                              


Generating learning rate plot...
Learning rate plot saved as 'learning_rate_schedule.png'


---

## Inference

In [6]:
from deepseek_inference import generate_text, load_model

# Path to your checkpoint
checkpoint_path = "./checkpoints/last.ckpt"

# Load the model
model = load_model(checkpoint_path)
print("Model loaded successfully!")

# Example prompts for generation
prompts = [
    "Once upon a time",
    "The future of artificial intelligence",
    "In the distant galaxy",
    "He was walking",
    "Music is a"
]

# Generate text for each prompt
for prompt in prompts:
    print("\nPrompt:", prompt)
    generated = generate_text(prompt=prompt, model=model)
    print("Generated:", generated)
    print("-" * 50)

Compiling model for faster training...
Model size in memory: 1327.85 MB
Model loaded successfully!

Prompt: Once upon a time
Generated: Once upon a time, in a small town named Harmonyville, lived two best friends - Timmy the Mr. Johnson, Timmy the Turtle, the Tortoise, the professor, and Benny the Scientist, they decided to share their thoughts with the park. One day, they heard some exciting news!

Curious about all the different stories they met "I'm one of a little town. When they entered, they could visit the park to watch new things like where they come from all the animals they would get better and better
--------------------------------------------------

Prompt: The future of artificial intelligence
Generated: The future of artificial intelligence, a special tool that provides an in high-quality design, and converting it into manageable parts. This is characterized by its high vision of a suitable area of materials available to a set of advantages and limitations. It is used to