# Fine Tuning DeepSeek Distilled llama 3.2 for medical conversation


**This project involves fine-tuning a Large Language Model (LLM) DeepSeek-R1-Distilled-Llma3.2 using LoRA (Low-Rank Adaptation) and Unsloft for efficient training and inference. The model is designed to generate responses for health-related queries, specifically focusing on medical diagnosis-related prompts. The fine-tuned model is based on the DeepSeek-R1-Distill-Llama-8B architecture, and it has been optimized for fast inference using 4-bit quantization. The training process leverages supervised fine-tuning (SFT) techniques to adapt the base model to the specific task.**

- [Link to Model:](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)
- [Link to dataset:](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT)


# Techniques Used
### **1. LoRA (Low-Rank Adaptation)**
LoRA is a parameter-efficient fine-tuning technique that allows adaptation of a large pre-trained model without modifying the entire weight matrix. Instead of updating the full model weights, LoRA adds trainable low-rank matrices, significantly reducing memory usage and training costs.

- **Advantage**: It allows fine-tuning large models like LLaMA efficiently on limited computational resources.
- **Implementation**: Integrated into **Unsloft** for efficient parameter updates.

### **2. Unsloft**
**Unsloft** is a library optimized for **efficient model fine-tuning and inference**. It integrates **FastLanguageModel**, which is used to:
- Load and configure pre-trained models with quantization.
- Apply custom chat templates for structured prompt formatting.
- Enable **2x faster inference** by optimizing token generation.

### **3. 4-bit Quantization**
Quantization is used to reduce the memory footprint of large models, making them more efficient for deployment. This project uses **4-bit quantization**, allowing the model to run on lower-end GPUs with minimal performance loss.

# Data Preprocessing Steps
1. **Dataset Loading**: The dataset is loaded using the **Hugging Face `datasets` library**.
2. **Formatting for Chat Model**:
   - The dataset is structured into user-assistant message pairs.
   - Custom chat templates from **Unsloft** are applied.
3. **Tokenization**:
   - Input sequences are tokenized using the **LLaMA tokenizer**.
   - The `max_seq_length` is set to **2048 tokens**.
   - **Padding and truncation** are applied to maintain consistent sequence lengths.
4. **Training Data Collation**:
   - **`DataCollatorForSeq2Seq`** is used for batching and formatting data.
   - Ensures efficient handling of different sequence lengths.

# Training Configuration & Parameters
The model is trained using **Supervised Fine-Tuning (SFT)** with the following configurations:

- **Model Architecture**: `DeepSeek-R1-Distill-Llama-8B`
- **Training Arguments**:
  - `max_seq_length = 2048`  (Maximum token length per input)
  - `dtype = None`  (Default datatype for efficient computation)
  - `load_in_4bit = True`  (Enables 4-bit quantization for faster training and inference)
  - `use_cache = True`  (Stores previous token predictions for efficiency)
  - `temperature = 0.6`  (Controls randomness in response generation)
  - `min_p = 0.1`  (Sets minimum probability threshold for token selection)

# Model Inference
After training, the model is tested by:
1. Using a structured **chat template** to format user queries.
2. Applying **tokenization** with **padding and truncation**.
3. Generating responses using `model.generate()`, with:
   - `max_new_tokens = 64` (Restricts output length)
   - `use_cache = True` (Optimizes generation speed)

# Summary
This notebook successfully implements a **fine-tuned LLaMA-based chatbot** optimized for **medical diagnosis-related queries** using **LoRA and Unsloft**. The use of **quantization, structured training, and efficient inference** makes the model practical for real-world applications where computational efficiency is critical. The responses are not top notch, but they are great considering the model was trained for only 2 epochs which took 12 hours. 
The model is evaluated on perplexity abd Blue. Both may not give great scores, but it is better than nothing.



# Installations

In [3]:
%%capture
!pip install unsloth==2025.2.4
!pip install unsloth_zoo==2025.2.3
!pip install torch==2.5.1
!pip install torchaudio==2.5.1
!pip install torchvision=0.20.1
!pip install vllm==0.7.2
!pip install xformers==0.0.28.post3
!pip install xgrammar==0.1.11
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install trl==0.8.2
!pip install streamlit -q
!pip install pyngrok

# Imports

In [4]:
#Imports
import unsloth
from unsloth import FastLanguageModel
import torch
from  unsloth.chat_templates import get_chat_template
from datasets import load_dataset
from datasets import Dataset
from unsloth.chat_templates import get_chat_template
from trl import SFTTrainer
from transformers import TrainingArguments,DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
from unsloth.chat_templates import train_on_responses_only
# import streamlit as st
# from datasets import load_metric

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


# Data Exploration and Preprocessing combined

In [5]:
#Load the original model and tokenizer

#Define configuraions for loading the model

max_seq_length=2048
dtype=None
load_in_4bit=True

model,tokenizer=FastLanguageModel.from_pretrained(
    model_name='unsloth/DeepSeek-R1-Distill-Llama-8B',
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)



==((====))==  Unsloth 2025.2.15: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    GPU: Tesla P100-PCIE-16GB. Max memory: 15.888 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 6.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [6]:
# Test the original model with some Health diagnoses.
tokenizer=get_chat_template(
    tokenizer,
    chat_template='llama-3.2',
)

#Set the PAD to be the same as the EOS token and avoid tokenization issues
tokenizer.pad_token=tokenizer.eos_token
FastLanguageModel.for_inference(model) #enable native 2x faster inference

messages=[
    {'role':'user','content':'I have a headache, a bad one around the forehead'}
]
#Tokenize the user input with the chat template

inputs=tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors='pt',
    padding=True,# Add padding to match sequence length

).to('cuda')

attention_mask=inputs!=tokenizer.pad_token_id
#Generate the output
outputs=model.generate(
    input_ids=inputs,
    attention_mask=attention_mask,
    max_new_tokens=64,
    use_cache=True, #use cache for faster token generation
    temperature=0.6, # controls randomness in response
    min_p=0.1 #Sets the minimum probability threshold for token selection
)

#Decode the generated tokens into human-readable text
text=tokenizer.decode(outputs[0],skip_special_tokens=True)
print(text)

system

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

user

I have a headache, a bad one around the foreheadassistant

I'm sorry to hear that you're feeling unwell. If your headache is persistent, I recommend consulting a healthcare professional for advice and treatment. If you'd like, I can also provide some tips that might help alleviate your symptoms. Please let me know how I can assist you further.
</think>

It seems like you


## Applying low rank

In [7]:
#Apply LoRA adapters for Efficient Fine-tuning

model =FastLanguageModel.get_peft_model(
    model,
    r=16,#LoRA rank controls low-rank approximation quality
    target_modules=['q_proj','v_proj','k_proj','o_proj','gate_proj','up_proj','down_proj'],#Layers to apply in LoRA
    lora_alpha=16, #Scaling factor for loRA weights
    lora_dropout=0,
    bias='none',
    use_gradient_checkpointing=True,
    random_state=3407,
    use_rslora=False,
    loftq_config=None

)

Unsloth 2025.2.15 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Spliting the train data and using only 3,500 rows for a bit faster training 

In [8]:
#Prepare the dataset

#load data
dataset=load_dataset('FreedomIntelligence/medical-o1-reasoning-SFT','en', split='train')
dataset= dataset.select(range(3_500))

README.md:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

medical_o1_sft.json:   0%|          | 0.00/74.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25371 [00:00<?, ? examples/s]

## Converting the dataset to Hugging face Role, Content format


In [12]:

dataset_list = dataset.to_list()

# Transform to Hugging Face Standard Role-Content Format
formatted_dataset = [
    {
        "messages": [
            {"role": "user", "content": data["Question"]},   # User input
            {"role": "assistant", "content": data["Complex_CoT"]},  # AI thinking process
            {"role": "assistant", "content": data["Response"]}  # Final AI response
        ]
    }
    for data in dataset_list  # Iterate over each row in dataset
]

# Convert back to Hugging Face Dataset format
hf_dataset = Dataset.from_list(formatted_dataset)

## Checking the firts row of the dataset

In [13]:
hf_dataset[0]

{'messages': [{'content': 'A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?',
   'role': 'user'},
  {'content': "Okay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that ups her abdominal pressure like coughing or sneezing. This sounds a lot like stress urinary incontinence to me. Now, it's interesting that she doesn't have any issues at night; she isn't experiencing leakage while sleeping. This likely means her bladder's ability to hold urine is fine when she isn't under physical stress. Hmm, that's a clue that we're dealing with something related to pressure rather than a bladder muscle problem. \n\nThe fact that she underwent a Q-ti

## Extracting conversation to a structured format for training

In [14]:
# Function to extract conversation as a structured format
def format_prompts(examples):
    convos = examples["messages"]  # Extract the messages list

    # Apply the tokenizer template to the conversation
    texts = [
        tokenizer.apply_chat_template(
            convo,
            tokenize=False,
            add_generation_prompt=False
        )
        for convo in convos
    ]

    return {"text": texts}

# Apply the function using .map()
hf_dataset = hf_dataset.map(format_prompts, batched=True)



Map:   0%|          | 0/3500 [00:00<?, ? examples/s]

In [15]:
print(hf_dataset[0]["messages"][0]["content"])
print(hf_dataset[0]['text'])


A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?
<｜begin▁of▁sentence｜><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Okay, let's think about this step by step. There's a 61-year-old woman here who's been dealing with involuntary urine leakages whenever she's doing something that 

## Setting up parameters for training

In [16]:
#Define training configurations

trainer=SFTTrainer(
    model=model,
    train_dataset=hf_dataset,
    tokenizer=tokenizer,
    dataset_text_field='text',
    max_seq_length=max_seq_length,
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=2,
    packing=False,

    args=TrainingArguments(
        per_device_train_batch_size=2,#Number of Examples per GPU batch
        gradient_accumulation_steps=4,#Accumulate gradients over 4 batches before updating model
        warmup_steps=5, #Number of warm up steps for lr schedule
        max_steps=-1, #Total number of training steps
        num_train_epochs=2,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,#Logs the training metrics after every step
        optim='adamw_8bit',
        weight_decay=0.01,
        lr_scheduler_type='linear',#Linear decay for learning rate
        seed=3407,
        output_dir='outputs',#Directory to save model checkpoints
        report_to='none'

    )
)

Map (num_proc=2):   0%|          | 0/3500 [00:00<?, ? examples/s]

# Train! Train! Train!

In [17]:
#To improve Efficiency, I will train only on Responses rather than user inputs
trainer=train_on_responses_only(
    trainer,
    instruction_part='<|start_header_id|>user<|end_header_id|>\n\n',#mark user input
    response_part='<|start_header_id|>assistant<|end_header_id|>\n\n',#mark AI response
)

#Start training the model
trainer_stats=trainer.train()



Map:   0%|          | 0/3500 [00:00<?, ? examples/s]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 3,500 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 874
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.1182
2,1.717
3,1.9748
4,2.1595
5,1.8257
6,1.8196
7,1.8806
8,1.6634
9,1.5668
10,1.6901


## Save Model

In [18]:
finetuned_model = '/kaggle/working/Deepseek-r1-finetuned_health-8B'
model.save_pretrained(finetuned_model)
tokenizer.save_pretrained(finetuned_model)

('/kaggle/working/Deepseek-r1-finetuned_health-8B/tokenizer_config.json',
 '/kaggle/working/Deepseek-r1-finetuned_health-8B/special_tokens_map.json',
 '/kaggle/working/Deepseek-r1-finetuned_health-8B/tokenizer.json')

## Check the fine-tuned model on the same,old question

In [46]:
# Test the finetuned model now with the same input
tokenizer=get_chat_template(
    tokenizer,
    chat_template='llama-3.2',
)

#Set the PAD to be the same as the EOS token and avoid tokenization issues
tokenizer.pad_token=tokenizer.eos_token
FastLanguageModel.for_inference(model) #enable native 2x faster inference

messages=[
    {'role':'user','content':'I have a headache, a bad one around the forehead'}
]
#Tokenize the user input with the chat template

inputs=tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors='pt',
    padding=True,# Add padding to match sequence length

).to('cuda')

attention_mask=inputs!=tokenizer.pad_token_id
#Generate the output
outputs=model.generate(
    input_ids=inputs,
    attention_mask=attention_mask,
    max_new_tokens=64,
    use_cache=True, #use cache for faster token generation
    temperature=0.6, # controls randomness in response
    min_p=0.1 #Sets the minimum probability threshold for token selection
)

#Decode the generated tokens into human-readable text
text=tokenizer.decode(outputs[0],skip_special_tokens=True)
print(text)

Model does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.
system

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

user

I have a headache, a bad one around the foreheadassistant

Okay, so I have this really bad headache, and it's specifically around the forehead. That's kind of unusual, isn't it? Most headaches I've had before seem to be more around the whole head, you know, like around the eyes or the back of the head. This one seems to be more localized


### The model gives a better answer to the prompt about headache. 

## User interaction via Terminal

This section will querry the model on live prompts

In [None]:
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from nltk.translate.bleu_score import sentence_bleu

# Load the fine-tuned model and tokenizer from the saved directory
finetuned_model_path = "/kaggle/input/model_folder/pytorch/default/1/kaggle/working/Deepseek-r1-finetuned_health-8B"

# Load tokenizer and model from local directory
tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path, local_files_only=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(finetuned_model_path, local_files_only=False, trust_remote_code=True).to("cuda")

# Function to calculate Perplexity
def calculate_perplexity(model, tokenizer, text, device='cuda'):
    model.eval()  # Set model to evaluation mode
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
    
    with torch.no_grad():  
        outputs = model(**inputs)
        loss = F.cross_entropy(outputs.logits.view(-1, outputs.logits.size(-1)), inputs.input_ids.view(-1), ignore_index=tokenizer.pad_token_id)
    
    perplexity = torch.exp(loss)  # e^(cross-entropy loss)
    return perplexity.item()

# Function to calculate BLEU Score
def calculate_bleu(reference, generated):
    reference = [reference.split()]  # BLEU expects list of lists
    generated = generated.split()
    bleu_score = sentence_bleu(reference, generated)
    return bleu_score

# Set PAD token to be the same as EOS token to avoid tokenization issues
tokenizer.pad_token = tokenizer.eos_token

while True:
    # Get user input
    user_input = input("Enter your message (or type 'quit' to exit): ")

    # Break the loop if the user types 'quit'
    if user_input.lower() == "quit":
        print("Exiting...")
        break

    # Tokenize the user input
    inputs = tokenizer(user_input, return_tensors="pt", padding=True, truncation=True).to("cuda")

    attention_mask = inputs["input_ids"] != tokenizer.pad_token_id

    # Generate the output
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=attention_mask,
        max_new_tokens=64,
        use_cache=True,  # use cache for faster token generation
        temperature=0.6,  # controls randomness in response
        min_p=0.1  # Sets the minimum probability threshold for token selection
    )

    # Decode the generated tokens into human-readable text
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("Model Response:", text)

    # Calculate Perplexity
    perplexity = calculate_perplexity(model, tokenizer, text)
    print(f"Perplexity Score: {perplexity}")

    # Calculate BLEU Score
    bleu_score = calculate_bleu(user_input, text)
    print(f"BLEU Score: {bleu_score}")


config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/236 [00:00<?, ?B/s]

Enter your message (or type 'quit' to exit):  Hello


Model Response: Hello, I'm trying to figure out how to make a simple to-do list app. I know I need to use some programming language, but I'm not sure which one to choose. I've heard of Python and JavaScript are pretty popular for this kind of thing. Let me think about which one would be easier for me
Perplexity Score: inf
BLEU Score: 0


Enter your message (or type 'quit' to exit):  I have a headache


Model Response: I have a headache. I think I need to take some medication. Let me see what I have in the medicine cabinet.

Okay, let's see... There's a bottle of Tylenol. That's good because it doesn't have any aspirin. I should be fine taking it.

Hmm, there's also a bottle of Adv


Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


Perplexity Score: inf
BLEU Score: 0.08324415467017712


Enter your message (or type 'quit' to exit):  I have a stomach pain


Model Response: I have a stomach pain and it's been bothering me for a few days now. I need to figure out what's going on. Let's think about what could be causing this. I remember that sometimes eating too much or not enough, or maybe something I ate was a little off, could lead to this kind of discomfort. Hmm, let
Perplexity Score: inf
BLEU Score: 0.05757177103786432


Enter your message (or type 'quit' to exit):  I have a kidney stone


Model Response: I have a kidney stone, and I'm not sure how to help myself with it. I know it's something that can cause a lot of pain, and I'm worried about what to do if it gets really bad. Let me think about what I can do to manage this.

First, I should probably stay hydrated. I've heard that
Perplexity Score: inf
BLEU Score: 0.03918225430439208


In [11]:
import os

finetuned_model_path = "/kaggle/input/model_folder/pytorch/default/1/kaggle/working/Deepseek-r1-finetuned_health-8B"

# List all files in the directory
os.listdir(finetuned_model_path)


['adapter_model.safetensors',
 'adapter_config.json',
 'README.md',
 'tokenizer.json',
 'tokenizer_config.json',
 'special_tokens_map.json']

## This section sets the Streamlit version of the app.

It gets too heavy to run because I use the free version of Notebook with less RAM


In [5]:
%%writefile app.py

import streamlit as st
from unsloth.chat_templates import get_chat_template

import torch

# Set up Streamlit UI
st.title("Chat with Finetuned Model")

# Initialize tokenizer and model (ensure they're properly loaded)
tokenizer = get_chat_template(
    tokenizer,
    chat_template='llama-3.2',
)

# Set the PAD to be the same as the EOS token and avoid tokenization issues
tokenizer.pad_token = tokenizer.eos_token
FastLanguageModel.for_inference(model)  # enable native 2x faster inference

# Streamlit input box
user_input = st.text_area("Enter your message:", "")

if st.button("Generate Response"):
    if user_input.strip():
        messages = [{'role': 'user', 'content': user_input}]

        # Tokenize the user input with the chat template
        inputs = tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors='pt',
            padding=True,  # Add padding to match sequence length
        ).to('cuda')

        attention_mask = inputs != tokenizer.pad_token_id

        # Generate the output
        outputs = model.generate(
            input_ids=inputs,
            attention_mask=attention_mask,
            max_new_tokens=64,
            use_cache=True,  # use cache for faster token generation
            temperature=0.6,  # controls randomness in response
            min_p=0.1  # Sets the minimum probability threshold for token selection
        )

        # Decode and display the full generated text
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        st.subheader("Model Response:")
        st.write(text)


Overwriting app.py


In [56]:
!wget -q -O - ipv4.icanhazip.com

34.105.33.39


In [None]:
! streamlit run app.py & npx localtunnel --port 8501


[1G[0K⠙[1G[0K
Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.19.2.2:8501[0m
[34m  External URL: [0m[1mhttp://34.105.33.39:8501[0m
[0m
your url is: https://slow-mammals-repeat.loca.lt
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
2025-02-25 07:59:58.199530: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-25 07:59:58.225414: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-25 07:59:58.234098: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to 

In [32]:
!npm install localtunnel

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K
added 22 packages in 1s
[1G[0K⠸[1G[0K
[1G[0K⠸[1G[0K3 packages are looking for funding
[1G[0K⠸[1G[0K  run `npm fund` for details
[1G[0K⠸[1G[0K

In [41]:
#Download model and tokenizer:
!zip -r folder.zip /kaggle/working/Deepseek-r1-finetuned_health-8B

updating: kaggle/working/Deepseek-r1-finetuned_health-8B/ (stored 0%)
updating: kaggle/working/Deepseek-r1-finetuned_health-8B/tokenizer.json (deflated 85%)
updating: kaggle/working/Deepseek-r1-finetuned_health-8B/special_tokens_map.json (deflated 65%)
updating: kaggle/working/Deepseek-r1-finetuned_health-8B/README.md (deflated 66%)
updating: kaggle/working/Deepseek-r1-finetuned_health-8B/adapter_config.json (deflated 56%)
updating: kaggle/working/Deepseek-r1-finetuned_health-8B/adapter_model.safetensors (deflated 7%)
updating: kaggle/working/Deepseek-r1-finetuned_health-8B/tokenizer_config.json (deflated 94%)


In [45]:
from IPython.display import FileLink
FileLink(r'folder.zip')