# SageLoRA : 🧠 Fine-Tuning DeepSeek LLM on SQuAD using LoRA (Low-Rank Adaptation)

**SageLoRA** is a compact and efficient question-answering model, fine-tuned with LoRA on the SQuAD v1.1 dataset. Built on DeepSeek-LLM, it delivers smart, context-driven responses with minimal resource usage.

Note: Before running this code. Make sure you setup SerpAPI in Kaggle Secrets.

*“Smart answers, light footprint.”*

## ✅ Features of SageLoRA

1. **LoRA-Fine-Tuned QA Model**
Built on top of DeepSeek-LLM-7B and fine-tuned using Low-Rank Adaptation (LoRA) for efficient, high-quality question answering.

2. **SQuAD v1.1 Training**
Trained on one of the most widely used QA datasets, enabling strong comprehension and context-based answering.

3. **4-bit Quantization (bitsandbytes)**
Memory-efficient model loading using 4-bit quantization—ideal for deployment in limited-resource environments.

4. **Hybrid Answer Engine**
Uses both document-based context and real-time web search (SerpAPI) to answer questions, boosting accuracy beyond static knowledge.

5. **Custom Prompt Design**
Supports QA-style and general conversation via templated prompts, allowing flexible user interaction.

6. **Document Upload Support**
Users can upload PDF or text files and ask questions directly about the content.

7. **Interactive Gradio UI**
Clean, user-friendly interface built with Gradio for easy experimentation and deployment.

8. **Adjustable Generation Settings**
Users can tweak temperature, top-p sampling, and max token limits to control creativity and response length.

9. **Context Chunking Logic**
Automatically splits long documents into overlapping chunks for better retrieval and answer relevance.

10. **Clear Answer Routing**
Smart input classification (greeting, QA, general chat) to route prompts to the appropriate response method.

## 🚀 How It Works
- Users upload a document or paste context.
- Input is classified: greeting, QA, or general.
- If QA: document chunks are searched for answers.
- If uncertain, SerpAPI fetches live web data.
- Combined sources are passed to the model.
- Answer is polished and returned via chat.

## 🧪 Training Details
| Item            | Value                        |
|-----------------|------------------------------|
| Base Model      | DeepSeek-LLM-7B              |
| LoRA Config     | r=8, α=16, dropout=0.05      |
| Dataset         | SQuAD v1.1 (CSV format)      |
| Epochs          | 5                            |
| Max Length      | 512 tokens                   |
| Format          | `<|user|> Q Context <|assistant|> A` |


## Training Phase

### 📦 Install Required Dependencies


In [1]:
!pip install -q bitsandbytes transformers accelerate peft datasets fsspec==2025.3.2 pandas

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━

### 📚 Import Libraries

In [2]:
# Import core libraries
import torch
import json
import shutil
import pandas as pd
from datasets import Dataset

# HuggingFace Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)

# PEFT (Parameter-Efficient Fine-Tuning)
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model

2025-04-11 04:18:11.787575: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744345091.977949      19 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744345092.032754      19 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### 🔍 Load Tokenizer and Base Model (DeepSeek-7B)

In [3]:
# Define model name
model_name = "deepseek-ai/deepseek-llm-7b-base"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token  # Add pad token if missing

# Load model in 4-bit for efficient fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",
    trust_remote_code=True
)

tokenizer_config.json:   0%|          | 0.00/792 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/584 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


pytorch_model.bin.index.json:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.85G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.6k [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/121 [00:00<?, ?B/s]

### 🔧 Apply LoRA (Low-Rank Adaptation) via PEFT


In [4]:
# Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

### 📑 Load and Prepare the SQuAD Dataset


In [5]:
# Load CSV dataset
df = pd.read_csv("/kaggle/input/squad-v11/SQuAD-v1.1.csv")

# Drop rows with missing values
df = df.dropna(subset=["question", "answer", "context"])

# Format chat-style examples
chat_data = []
for _, row in df.iterrows():
    chat_data.append({
        "question": row["question"],
        "context": row["context"],
        "answer": row["answer"]
    })

# Use a smaller subset for quick training/testing
chat_data = chat_data[:2]

### 📘 Convert and Format Dataset for Fine-Tuning

In [6]:
# Convert to HuggingFace Dataset
dataset = Dataset.from_list(chat_data)

# Format prompt
def format_prompt(example):
    return {
        "text": f"<|user|>\n{example['question']}\n\nContext:\n{example['context']}\n<|assistant|>\n{example['answer']}"
    }

# Apply formatting
dataset = dataset.map(format_prompt)

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

### ✂️ Tokenize the Dataset

In [7]:
# Ensure padding token is defined
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})

# Tokenize prompts
def tokenize_function(example):
    output = tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=512
    )
    output["labels"] = output["input_ids"].copy()
    return output

# Apply tokenization
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["question", "answer", "context", "text"])

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

### ⚙️ Define Training Arguments

In [8]:
training_args = TrainingArguments(
    output_dir="./squad_lora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    num_train_epochs=5,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=1,
    logging_strategy="steps",
    save_strategy="epoch",
    report_to="none",
    disable_tqdm=False,
    remove_unused_columns=False,
)

### 🧪 Setup Trainer

In [9]:
# Data collator for causal language modeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

  trainer = Trainer(
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


### 🚀 Start Training

In [10]:
# Save the fine-tuned model and tokenizer
save_path = "./squad_chatbot_adapter"
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

# Zip the adapter folder
shutil.make_archive("squad_chatbot_adapter", 'zip', save_path)

print("✅ Training complete. Adapter zipped and saved as squad_chatbot_adapter.zip")

✅ Training complete. Adapter zipped and saved as squad_chatbot_adapter.zip


Note: Fine tuning has already been done so just a small portion is visible as the output got cleared. Actual training is doe with 200 samples. 

## Testing

### 📦 Step 1: Install Dependencies

In [11]:
!pip install -q gradio bitsandbytes transformers peft accelerate PyPDF2 serpapi

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.9/46.9 MB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.2/322.2 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.4/11.4 MB[0m [31m110.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h

### 📚 2. Import Libraries


In [12]:
import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import PyPDF2
import serpapi
from kaggle_secrets import UserSecretsClient

### 🔑 3. Load API Keys and Paths

In [13]:
# --- 🔐 SerpAPI Key from Kaggle Secrets ---
user_secrets = UserSecretsClient()
SERPAPI_API_KEY = user_secrets.get_secret("SERPAPI")

# --- 📂 Model Paths ---
adapter_path = "/kaggle/input/peftlora/pytorch/default/1"
base_model_name = "deepseek-ai/deepseek-llm-7b-base"

### 🧠 4. Load Tokenizer and Models


In [14]:
# --- 🧠 Load Tokenizer and Base Model ---
tokenizer = AutoTokenizer.from_pretrained(adapter_path, trust_remote_code=True)
model_base = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    load_in_4bit=True,
    device_map="auto",
    trust_remote_code=True
)

# --- 🔄 Load Fine-Tuned LoRA Model ---
model_lora = PeftModel.from_pretrained(model_base, adapter_path)

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### 📄 5. Document Processing Utilities

In [15]:
# --- ✂️ Chunk Large Text into Sections ---
def chunk_text(text, chunk_size=512, overlap=50):
    words = text.split()
    return [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size - overlap)]

# --- 📄 Process PDF or TXT Uploads ---
def process_document(file):
    if file.name.endswith('.pdf'):
        reader = PyPDF2.PdfReader(file)
        text = "".join([page.extract_text() for page in reader.pages])
    elif file.name.endswith('.txt'):
        text = file.read().decode("utf-8")
    else:
        return "Unsupported file type. Please upload a PDF or text file."
    
    return chunk_text(text)

### 🌐 6. Web Search Integration

In [16]:
# --- 🌐 Use SerpAPI for Online Info ---
def search_web(query):
    search = serpapi.GoogleSearch({
        "q": query,
        "api_key": SERPAPI_API_KEY
    })
    results = search.get_dict()
    return "\n".join([result["snippet"] for result in results.get("organic_results", [])])

### 💬 7. Prompt Templates

In [17]:
# --- 📄 Prompt for QA Tasks ---
QA_PROMPT = """You are a helpful AI assistant specialized in question answering. Only use the information provided in the context to answer.

Context:
{context}

Question:
{question}

Answer:"""

# --- 🧠 Prompt for General Chat ---
GENERIC_CHAT_PROMPT = """You are an AI assistant. Respond to the user naturally.

User: {question}
AI:"""

### 🧭 8. Intent Routing and Inference


In [18]:
# --- 🧭 Classify User Intent ---
def route_input(message, context):
    if not message.strip():
        return "empty"

    greetings = ["hi", "hello", "hey", "yo", "how are you", "what's up", "good morning", "good evening","bye"]
    if any(message.lower().startswith(g) for g in greetings):
        return "greeting"

    return "qa" if context.strip() else "general"

# --- 🧠 Generate Model Response ---
def generate_response(prompt, use_lora=False, temperature=0.7, top_p=0.95, max_tokens=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(model_base.device)
    model = model_lora if use_lora else model_base

    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            do_sample=True,
            top_p=top_p,
            temperature=temperature,
            pad_token_id=tokenizer.eos_token_id
        )

    decoded = tokenizer.decode(output[0], skip_special_tokens=True).strip()
    cleaned = decoded.replace("<|assistant|>", "").replace("<|user|>", "").strip()

    if len(set(cleaned.split())) < len(cleaned.split()) / 2:
        return "Sorry, I couldn't generate a meaningful response. Could you please rephrase your question?"

    return cleaned

### 🧩 9. Hybrid Agent Logic

In [19]:
# --- 🧩 Combine Document + Web Info ---
def agent_response(user_input, document_chunks, use_serpapi, temperature, top_p, max_tokens):
    steps = ["🔍 Attempting to answer using uploaded document..."]
    context = " ".join(document_chunks)

    qa_prompt = QA_PROMPT.format(context=context, question=user_input)
    doc_answer = generate_response(qa_prompt, use_lora=True, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

    if "I don't know" in doc_answer and use_serpapi:
        steps.append("❓ Not confident in document answer. Searching web via SerpAPI...")
        web_info = search_web(user_input)
        steps.append(f"🌐 Found this online:\n{web_info[:500]}...")

        combo_context = f"{context}\n\nAdditional info from the web:\n{web_info}"
        combo_prompt = QA_PROMPT.format(context=combo_context, question=user_input)
        final_answer = generate_response(combo_prompt, use_lora=True, temperature=temperature, top_p=top_p, max_tokens=max_tokens)
        steps.append("🧠 Refined answer after combining sources.")
    else:
        final_answer = doc_answer
        steps.append("✅ Confident in document-based answer.")

    chat_prompt = GENERIC_CHAT_PROMPT.format(question=final_answer)
    final_response = generate_response(chat_prompt, use_lora=False, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

    return steps, final_response

### 🗃️ 10. Chat Memory & Engine

In [20]:
# --- 🧠 Memory Handling ---
def update_memory(user_input, chatbot_reply, memory):
    memory.append((user_input, chatbot_reply))
    return memory

# --- 🤖 Main Chat Engine ---
def chatbot_engine(user_input, context, history, temperature, top_p, max_tokens, document_chunks, use_serpapi):
    intent = route_input(user_input, context)

    if intent == "empty":
        reply = "❓ Please enter something for me to respond to."
    elif intent == "greeting":
        reply = "👋 Hello! How can I help you today?"
    elif intent == "qa":
        steps, final_response = agent_response(user_input, document_chunks, use_serpapi, temperature, top_p, max_tokens)
        reply = "\n".join(steps) + f"\n\n🗨️ Final Answer:\n{final_response}"
    else:
        prompt = GENERIC_CHAT_PROMPT.format(question=user_input)
        reply = generate_response(prompt, use_lora=False, temperature=temperature, top_p=top_p, max_tokens=max_tokens)

    history = update_memory(user_input, reply, history)
    return history, history

### 🧱 11. Gradio UI Setup

In [21]:
# --- 🧱 Build Gradio UI ---
with gr.Blocks() as iface:
    gr.Markdown("# 🤖 Hybrid Agent Chatbot with Document & Web Search")
    gr.Markdown("Upload a document and ask questions. If the answer isn't in the document, I can look it up online!")

    with gr.Row():
        with gr.Column(scale=2):
            user_input = gr.Textbox(label="Your Message")
            context_input = gr.Textbox(label="Context (optional)", lines=5, placeholder="Paste context for QA-style questions")
            document_upload = gr.File(label="Upload a Document", file_types=[".txt", ".pdf"])

            temperature_slider = gr.Slider(0.1, 1.5, value=0.7, step=0.05, label="Temperature")
            top_p_slider = gr.Slider(0.1, 1.0, value=0.95, step=0.05, label="Top-p (nucleus sampling)")
            max_token_slider = gr.Slider(32, 512, value=200, step=8, label="Max New Tokens")

            serpapi_toggle = gr.Checkbox(label="Enable SerpAPI Search", value=False)

            submit_btn = gr.Button("Send")
            clear_btn = gr.Button("Clear Chat History")

        with gr.Column(scale=3):
            chatbot = gr.Chatbot(label="Chat History")

    memory = gr.State([])
    doc_chunks = gr.State([])

    # --- File Upload Callback ---
    def on_file_upload(file):
        return process_document(file)

    document_upload.change(on_file_upload, inputs=[document_upload], outputs=[doc_chunks])

    # --- Submit Callback ---
    submit_btn.click(
        fn=chatbot_engine,
        inputs=[user_input, context_input, memory, temperature_slider, top_p_slider, max_token_slider, doc_chunks, serpapi_toggle],
        outputs=[chatbot, memory]
    )

    # --- Clear Button Callback ---
    def clear_history():
        return [], []

    clear_btn.click(fn=clear_history, outputs=[chatbot, memory])

  chatbot = gr.Chatbot(label="Chat History")


### 🚀 12. Launch the App

In [22]:
# --- 🚀 Launch Gradio Interface ---
iface.launch(share=True)

* Running on local URL:  http://127.0.0.1:7860
* Running on public URL: https://5132830c6a3af31e03.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## Limitations:

1. **Limited Domain Knowledge**
Trained only on the SQuAD v1.1 dataset, so it may underperform on questions outside general reading comprehension or unfamiliar domains (e.g., medicine, law).

2. **Short Context Window**
Inputs are tokenized and truncated to 512 tokens, which may cut off important information in longer documents or contexts.

3. **No Built-In Web Awareness**
While the full chatbot system can access real-time web data via SerpAPI, the SageLoRA model itself is not inherently aware of or trained on live internet content. Web search is handled externally by the surrounding application logic, not the model.

4. **Bias Inherited from Dataset**
Since it’s trained on human-generated data from SQuAD, it may reflect societal or dataset-specific biases in its answers.

5. **Generative Risk**
Although fine-tuned for QA, the model may sometimes hallucinate or confidently provide incorrect answers, especially when context is unclear or ambiguous.

6. **Not Optimized for Chat**
While it can handle question-answer prompts well, it’s not explicitly trained for multi-turn conversational flow or memory-based interactions.

7. **Low-Resource Constraints**
LoRA fine-tuning is efficient, but the model may still require a GPU for inference due to its base size (7B parameters), even if running in 4-bit.

Here are some **practical and impactful enhancements** you can make to elevate **SageLoRA** into an even more capable and user-friendly system:



---

### 🚀 Suggested Enhancements for SageLoRA

#### 🧠 Model-Level Enhancements
1. **Multi-Turn Conversational Memory**  
   Add context-awareness across multiple turns using a memory buffer or retrieval mechanism, enabling follow-up questions and dynamic flow.

2. **Fine-Tune with More Diverse QA Datasets**  
   Incorporate datasets like **Natural Questions**, **HotpotQA**, or **TriviaQA** for broader general knowledge and reasoning skills.

3. **Instruction Tuning**  
   Fine-tune on instruction-style datasets (e.g., **OpenAssistant**, **Dolly**, **Alpaca**) for more natural dialogue and multi-intent prompts.

4. **Long Context Support**  
   Use a variant of DeepSeek or a model like **Mistral**, **LongLoRA**, or **FlashAttention** that supports longer context windows (e.g., 8K+ tokens).

---

#### 🛠️ App-Level / UX Enhancements
1. **RAG (Retrieval-Augmented Generation)**  
   Instead of just chunking docs linearly, use **vector search (e.g., FAISS, ChromaDB)** to retrieve the most relevant chunks for the user's query.

2. **Streaming Responses**  
   Add response streaming in Gradio using `yield` or `generator` functions for a more conversational feel (like ChatGPT typing).

3. **Source Highlighting**  
   Show users *which chunk or passage* an answer came from—helps with trust and explainability.

4. **Multi-File Upload**  
   Allow users to upload multiple documents and query across them (e.g., research paper comparisons or PDF-heavy workflows).

5. **Add Voice Input / TTS**  
   Use `gr.Audio` + `pyttsx3` or `gTTS` to enable speech-to-text input and text-to-speech output.

---

#### 📦 Deployment & Sharing
1. **Dockerize the App**  
   Create a Docker image to make SageLoRA easily portable and deployable on servers or cloud platforms.

2. **Gradio Share → Hugging Face Spaces**  
   Package the whole app as a Hugging Face Space for public access and demo.

3. **Model Card + Metadata**  
   Add a `README.md`, usage examples, limitations, and tags to the adapter folder if you're uploading it to the Hugging Face Hub.

## 📜 License & Credits<br>
Built with 🤗 Transformers, Gradio, and PEFT<br>
Base model: DeepSeek-AI<br>
Dataset: Stanford SQuAD v1.1<br>
License: Apache 2.0 License