# Working with `DeepSeek-R1-Distill-Qwen-1.5B`

- Distilled version of DeepSeek R1, with 1.5B params rather than the original 685B params!
- Based on the Qwen 2.5 model

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import evaluate
import numpy as np

# Importing the model
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

  from .autonotebook import tqdm as notebook_tqdm


---
# Fine-tuning the model

### Loading the dataset

In this case, we are using the *Great Gatsby* to train the model.

In [3]:
from datasets import load_dataset

# Loading dataset from hugging face (Great Gatsby txt)
ds = load_dataset("TeacherPuffy/book")

# This line prints out the "train" split where each index is a line number
print(ds["train"][100])


{'text': 'Sometimes she and Miss Baker talked at the same time, unobtrusively and with a playful banter that was never quite chatter, as cool as their white dresses and their impersonal eyes, devoid of all desire. They were here—and they accepted Tom and me, making only a polite effort to entertain or be entertained. They knew that dinner would soon be over, and a little later, the evening too would end and be casually put away. It was a stark contrast to the West, where an evening was rushed from one phase to the next, driven by a continually disappointed anticipation or sheer nervous dread of the moment itself.'}


### Tokenization
We are using a tokenizer to process the text and provide padding as well as a truncation strategy to handle varying sequence lengths. The `map` method is used to apply the preprocessing function over the entire dataset

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

def tokenize_function(examples):

    # tokenizer.pad_token = tokenizer.eos_token

    return tokenizer(examples["text"], padding=True, truncation=True,)

tokenized_datasets = ds.map(tokenize_function)
print(tokenized_datasets)



DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask'],
        num_rows: 111
    })
})


Then, to prepare for training, remove and edit columns that hugging face expects.
Here, the text column is removed, keeping `input_ids`, and `attention_mask`

In [5]:
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.with_format("torch")
print(tokenized_datasets["train"])


Dataset({
    features: ['input_ids', 'attention_mask'],
    num_rows: 111
})


---
# Training the model with PyTorch Trainer

In [6]:
model = AutoModelForSequenceClassification.from_pretrained(model_id, torch_dtype="auto")
model.resize_token_embeddings(len(tokenizer))

# Contains all hyperparameters
training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch", num_train_epochs=1)

# Computes and reports metrics during training
metric = evaluate.load("accuracy")

# Calculates accuracy of the predictions
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset='no',
    compute_metrics=compute_metrics,
    
)


Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
trainer.train()

RuntimeError: stack expects each tensor to be equal size, but got [1] at entry 0 and [12] at entry 1