# LeLM: Merge LoRA + Convert to GGUF + Upload to HuggingFace

This notebook:
1. Downloads the Qwen3-8B base model + LeLM LoRA adapter
2. Merges the LoRA weights into the base model
3. Converts to GGUF format (Q4_K_M quantization)
4. Uploads the GGUF to HuggingFace

**Requirements:** Free Colab GPU runtime (T4 is fine)

In [None]:
# Step 1: Install dependencies
!pip install -q torch transformers peft accelerate huggingface_hub sentencepiece
!pip install -q llama-cpp-python

# Clone llama.cpp for the convert script
!git clone --depth 1 https://github.com/ggml-org/llama.cpp /content/llama.cpp
!pip install -q -r /content/llama.cpp/requirements/requirements-convert_hf_to_gguf.txt

In [None]:
# Step 2: Login to HuggingFace
from huggingface_hub import login
login()  # Enter your HF token with write access

In [None]:
# Step 3: Load base model + merge LoRA adapter
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE_MODEL = "Qwen/Qwen3-8B"
ADAPTER_REPO = "KenWu/LeLM"
MERGED_PATH = "/content/lelm-merged"

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO)

print(f"Loading base model {BASE_MODEL}...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    dtype=torch.float16,
    device_map="cpu",
    low_cpu_mem_usage=True,
)

print("Merging LoRA adapter...")
model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
model = model.merge_and_unload()

print(f"Saving merged model to {MERGED_PATH}...")
model.save_pretrained(MERGED_PATH, safe_serialization=True)
tokenizer.save_pretrained(MERGED_PATH)

# Free memory
del model, base_model
import gc
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

print("Merge complete!")

In [None]:
# Step 4: Convert to GGUF (fp16 first, then quantize)
!python /content/llama.cpp/convert_hf_to_gguf.py \
    /content/lelm-merged \
    --outfile /content/lelm-f16.gguf \
    --outtype f16

print("\nF16 GGUF created!")
!ls -lh /content/lelm-f16.gguf

In [None]:
# Step 5: Quantize to Q4_K_M (best quality/size tradeoff, ~5GB)
# First build llama-quantize
!cd /content/llama.cpp && cmake -B build && cmake --build build --target llama-quantize -j$(nproc)

!/content/llama.cpp/build/bin/llama-quantize \
    /content/lelm-f16.gguf \
    /content/LeLM-Q4_K_M.gguf \
    Q4_K_M

print("\nQuantized GGUF created!")
!ls -lh /content/LeLM-Q4_K_M.gguf

# Also create Q8_0 for higher quality
!/content/llama.cpp/build/bin/llama-quantize \
    /content/lelm-f16.gguf \
    /content/LeLM-Q8_0.gguf \
    Q8_0

print("\nQ8_0 GGUF created!")
!ls -lh /content/LeLM-Q8_0.gguf

In [None]:
# Step 6: Upload GGUFs to HuggingFace
from huggingface_hub import HfApi

api = HfApi()
REPO_ID = "KenWu/LeLM-GGUF"

# Create the repo
api.create_repo(REPO_ID, repo_type="model", exist_ok=True)

# Upload Q4_K_M
print("Uploading Q4_K_M...")
api.upload_file(
    path_or_fileobj="/content/LeLM-Q4_K_M.gguf",
    path_in_repo="LeLM-Q4_K_M.gguf",
    repo_id=REPO_ID,
)

# Upload Q8_0
print("Uploading Q8_0...")
api.upload_file(
    path_or_fileobj="/content/LeLM-Q8_0.gguf",
    path_in_repo="LeLM-Q8_0.gguf",
    repo_id=REPO_ID,
)

print(f"\nDone! View at: https://huggingface.co/{REPO_ID}")

In [None]:
# Step 7: Upload README for the GGUF repo
readme = """---
base_model: Qwen/Qwen3-8B
license: apache-2.0
tags:
  - gguf
  - lora-merged
  - nba
  - sports-analysis
  - qwen3
pipeline_tag: text-generation
quantized_by: llama.cpp
---

# LeLM-GGUF

GGUF quantizations of [KenWu/LeLM](https://huggingface.co/KenWu/LeLM), an NBA take analysis model fine-tuned on Qwen3-8B.

## Available Quantizations

| File | Quant | Size | Description |
|---|---|---|---|
| `LeLM-Q4_K_M.gguf` | Q4_K_M | ~5 GB | Best balance of quality and size |
| `LeLM-Q8_0.gguf` | Q8_0 | ~8.5 GB | Higher quality, larger |

## Usage with Ollama

Create a `Modelfile`:

```
FROM ./LeLM-Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9

SYSTEM You are LeLM, an expert NBA analyst. Fact-check basketball takes using real statistics. Be direct, witty, and back everything with numbers.
```

Then run:
```bash
ollama create lelm -f Modelfile
ollama run lelm "Fact check: LeBron is washed"
```

## Usage with llama.cpp

```bash
llama-cli -m LeLM-Q4_K_M.gguf -p "Fact check this NBA take: Steph Curry is the GOAT" -n 512
```

## Part of LeGM-Lab

This model powers [LeGM-Lab](https://github.com/KenWuqianghao/LeGM-Lab), an LLM-powered NBA take analysis and roasting bot.
"""

api.upload_file(
    path_or_fileobj=readme.encode(),
    path_in_repo="README.md",
    repo_id=REPO_ID,
)
print(f"README uploaded! https://huggingface.co/{REPO_ID}")

## Quick Test (Optional)

Test the merged model before uploading:

In [None]:
# Optional: Quick test of the merged model
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "/content/lelm-merged",
    dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("/content/lelm-merged")

messages = [{"role": "user", "content": "Fact check this NBA take: LeBron is washed"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))