# Novel Writer - Complete Pipeline & Training

This notebook runs **everything** end-to-end on Google Colab:

1. Clone repo & install dependencies
2. Upload your novels (or use built-in sample data)
3. Run the full data processing pipeline
4. Fine-tune your chosen model
5. Generate sample text
6. Download your trained model

**Supports:** Qwen3-8B (Chinese) or Mistral Nemo 12B (English)

**Requirements:** Google Colab with T4 GPU (free tier) or Kaggle Notebooks

---

### How to use
1. Change the settings in **Cell 1** below if needed
2. **Runtime > Run all**
3. When prompted, upload your novel files (or skip to use sample data)
4. Wait for training to complete (~1-3 hours depending on data size)
5. Download your model at the end

---
## Step 1: Configuration

**Change these settings before running!**

In [None]:
#@title Configuration { display-mode: "form" }

#@markdown ### Model Selection
MODEL_CHOICE = "qwen3_chinese" #@param ["qwen3_chinese", "mistral_english"]

#@markdown ### Training Settings
NUM_EPOCHS = 2 #@param {type:"slider", min:1, max:5, step:1}
LEARNING_RATE = 2e-4 #@param {type:"number"}
MAX_SEQ_LENGTH = 4096 #@param [2048, 4096, 8192] {type:"raw"}
LORA_RANK = 32 #@param [8, 16, 32, 64] {type:"raw"}
BATCH_SIZE = 2 #@param [1, 2, 4] {type:"raw"}
GRADIENT_ACCUMULATION = 4 #@param [2, 4, 8] {type:"raw"}

#@markdown ### Data Settings
USE_SAMPLE_DATA = False #@param {type:"boolean"}
CHUNK_SIZE = 4000 #@param {type:"integer"}
RUN_DEDUP = True #@param {type:"boolean"}
RUN_QUALITY_FILTER = True #@param {type:"boolean"}

#@markdown ---

# Model configurations (do not edit)
MODEL_CONFIGS = {
    "qwen3_chinese": {
        "model_name": "unsloth/Qwen3-8B",
        "output_name": "qwen3_chinese_novel_lora",
        "system_prompt": "你是一位专业的中文小说作家。请根据指令，以优美流畅的中文续写故事内容。注意保持文风一致，人物性格鲜明，情节引人入胜。",
        "prompt_format": "chatml",
        "test_prompts": [
            "续写以下故事：李明站在长安城门前，心中百感交集。三年前他离开家乡时还是个少年，如今",
            "描写一个武侠场景：月光下，两位剑客在悬崖边对峙。",
            "请以古风笔触描写一个春日清晨的集市。",
        ],
    },
    "mistral_english": {
        "model_name": "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
        "output_name": "nemo_english_story_lora",
        "system_prompt": "You are a talented fiction author. Write vivid, engaging prose with strong characters, sensory details, and natural dialogue.",
        "prompt_format": "mistral",
        "test_prompts": [
            "Continue the story: The old lighthouse keeper climbed the spiral stairs one last time. After forty years, tonight would be his final watch.",
            "Write a scene: Two strangers meet in a rain-soaked cafe in Paris. One of them is hiding a secret.",
            "Describe the moment a warrior returns to her village after a decade of war, only to find it completely changed.",
        ],
    },
}

CFG = MODEL_CONFIGS[MODEL_CHOICE]
print(f"Model: {CFG['model_name']}")
print(f"Output: {CFG['output_name']}")
print(f"Epochs: {NUM_EPOCHS}, LR: {LEARNING_RATE}, Seq len: {MAX_SEQ_LENGTH}")
print(f"LoRA rank: {LORA_RANK}, Batch: {BATCH_SIZE}, Grad accum: {GRADIENT_ACCUMULATION}")
print(f"Sample data: {USE_SAMPLE_DATA}")

---
## Step 2: Setup Environment

In [None]:
%%capture
# Install Unsloth (2x faster training, 70% less VRAM)
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Clone Novel Writer repo
!rm -rf /content/Novel_Writer
!git clone https://github.com/LL-LLLu/Novel_Writer.git /content/Novel_Writer

# Install Novel Writer
%cd /content/Novel_Writer
!pip install -e .

In [None]:
# Verify installation
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")
print()

!novel-writer --help | head -20
print("\nSetup complete!")

---
## Step 3: Upload Data (or Use Sample Data)

**Supported formats:** `.txt`, `.pdf`, `.epub`, `.html`, `.htm`, `.md`, `.mobi`

Upload your novel files when prompted, or check `USE_SAMPLE_DATA = True` in the config above to use built-in sample text.

In [None]:
import os
from pathlib import Path

data_dir = Path("/content/Novel_Writer/data/raw")
data_dir.mkdir(parents=True, exist_ok=True)

if USE_SAMPLE_DATA:
    # Create sample data so the pipeline has something to process
    if MODEL_CHOICE == "qwen3_chinese":
        sample_text = """
第1章 黎明之前

天还没有亮，整个村庄都笼罩在一片寂静之中。远处的山峦在薄雾中若隐若现，仿佛一幅淡墨山水画。
李明站在院子里，深深地吸了一口清晨的空气。今天是个特别的日子，他已经等了整整三年。

"你真的要走吗？"身后传来母亲苍老的声音。

李明没有回头，他知道如果回头，自己可能就再也走不了了。"妈，我会回来的。"

他的声音很轻，却在寂静的清晨显得格外清晰。母亲没有再说什么，只是默默地将一个包袱递到他手中。
包袱不重，但李明知道里面装着母亲所有的心意——几件换洗的衣裳，几个烙饼，还有父亲留下的那把短刀。

"路上小心。"母亲终于开口，声音有些颤抖。

李明点了点头，背起包袱，向村口走去。晨雾渐渐散开，东方的天际泛起了一抹鱼肚白。
他知道，从这一刻起，一切都将不同。

村口的老槐树下，站着一个人。那是张叔，村里的铁匠，也是李明的师父。
"小子，过来。"张叔的声音粗犷却不失温暖。

李明走上前去，张叔从身后拿出一个长条形的布包，递给他。
"这是我打了三个月的剑，虽然比不上名家之作，但也算是我的心血。带着它，路上好歹有个防身的。"

李明双手接过，感觉到剑身的重量和温度。他深深鞠了一躬："谢谢师父。"

张叔拍了拍他的肩膀："去吧，长安城在等着你。记住，不管遇到什么，都别忘了你是谁。"

第2章 远行

一路向西，李明走了整整七天。他穿过了无数个村庄和城镇，见识了各种各样的人和事。
有热情好客的农家，有精明狡猾的商人，也有孤独的旅人。每个人都有自己的故事，每个故事都让他对这个世界有了新的认识。

第三天的黄昏，他来到了一个叫做青石镇的地方。镇子不大，但街道整洁，店铺林立。
最引人注目的是镇中央的一座酒楼，名叫"醉仙居"，三层高的木楼在夕阳下散发着暖黄色的光芒。

"年轻人，来住店吗？"酒楼门口的小二热情地招呼道。

李明摸了摸怀里所剩不多的铜钱，犹豫了一下。这七天来，他大多睡在路边的破庙或者好心人家的柴房里。
一顿像样的饭菜和一张温暖的床，对他来说已经是奢侈品了。

"住一晚多少钱？"他问道。
"上房一两银子，普通间三百文，通铺一百文。"小二笑着回答。
"通铺吧。"李明走了进去。

酒楼里很热闹，各种各样的人聚在一起吃饭喝酒。李明找了个角落坐下，要了一碗面和一壶茶。
邻桌坐着几个江湖模样的人，正在大声地讨论着什么。

"你们听说了吗？长安城最近出了大事！"一个络腮胡子的大汉压低声音说道，但他的声音依然传遍了半个酒楼。
"什么大事？"同桌的一个瘦高个急忙问道。
"据说天机阁的阁主失踪了，整个武林都在找他。"络腮胡子神秘兮兮地说。

李明的耳朵动了动。天机阁，那是他父亲生前常常提起的地方。
"天机阁掌控着天下武林的情报，阁主失踪，意味着很多秘密可能会被泄露。"络腮胡子继续说道。
"各大门派都派了人去长安，明面上说是帮忙寻找，实际上嘛……"他意味深长地笑了笑。

李明默默地喝着面汤，心中却掀起了波澜。他去长安，本来只是为了完成父亲的遗愿。
但现在看来，长安城远比他想象的要复杂得多。

吃完饭，李明回到通铺。房间里已经躺了几个人，鼾声此起彼伏。
他找了个靠墙的位置躺下，将包袱垫在头下，张叔给的剑放在伸手可及的地方。

月光从窗户洒进来，照在他年轻而坚毅的脸上。明天，他就要继续赶路了。
长安，等着我。他在心中默念，然后闭上了眼睛。
"""
        for i in range(3):  # Repeat to make dataset larger
            (data_dir / f"sample_novel_{i+1}.txt").write_text(
                sample_text.replace("李明", ["李明", "王刚", "赵云"][i]),
                encoding="utf-8"
            )
    else:
        sample_text = """
Chapter 1: The Last Light

The old lighthouse stood at the edge of the world, or so it seemed to Thomas Gray.
For forty years he had climbed these stairs each evening, lit the great lamp, and watched
its beam sweep across the dark Atlantic waters. Forty years of storms and calms, of ships
saved and ships lost, of loneliness so profound it had become a kind of companion.

Tonight the wind howled like a wounded animal, throwing sheets of rain against the
windows with enough force to rattle the thick glass. Thomas paused on the landing, one
hand braced against the cold stone wall, and caught his breath. His knees weren't what
they used to be. None of him was what it used to be.

"One more night," he muttered to himself, a habit born of decades without anyone else
to talk to. "Just one more."

The lamp room at the top was warm despite the storm. Thomas had maintained the old
Fresnel lens with religious devotion — polishing the brass fittings until they gleamed,
keeping the clockwork mechanism oiled and true. The Coast Guard had wanted to automate
the light years ago, replace him with sensors and timers. He had fought them tooth and nail.

"A machine doesn't hear a ship in distress," he had argued. "A machine doesn't notice
when the fog rolls in different than usual."

They had let him stay. But tomorrow — tomorrow they were coming with their equipment.
The last manned lighthouse on the eastern seaboard would finally go dark. At least, his
version of it would.

Thomas struck the match and touched it to the wick. The flame caught, small at first,
then growing as the oil drew upward. He adjusted the mantle, watched the light bloom
and multiply through the precision-cut prisms until it became something powerful,
something that could reach across miles of angry ocean to tell a sailor: you are not alone.

He settled into his chair by the window and opened his logbook. The entries went back
decades, each one in his careful, slanting hand. Weather conditions. Visibility.
Ships observed. Incidents.

"November 17th," he wrote. "Wind northeast, 45 knots gusting to 60. Rain heavy.
Visibility poor. Seas rough, 15-foot swells."

He paused, pen hovering over the page. Then he added: "Final entry."

Chapter 2: The Storm

Sarah Chen had not planned to be at sea tonight. Nobody with any sense would have
chosen to sail through a November gale in a thirty-two-foot sloop. But plans, as her
grandmother used to say, were just stories you told yourself about a future that hadn't
happened yet.

The storm had come on fast, much faster than the forecast predicted. By the time the
first squall line hit, she was already past the point of no return — too far from the
harbor to turn back, too far from anywhere to find shelter.

So she did what sailors do: she shortened sail, lashed everything down, clipped her
harness to the jackline, and held on.

The waves came like mountains, black and frothing, lifting the little sloop to
impossible heights before dropping her into troughs so deep the wind disappeared.
Each time the boat climbed, Sarah's stomach fell. Each time it dropped, she braced
for the impact, the shuddering crash as the hull met the next wall of water.

Her navigation instruments had failed an hour ago — the GPS screen flickered once
and went dark, victim of a wave that had found its way below through a vent she
thought was sealed. Without electronics, she was sailing blind.

No. Not quite blind.

Through the rain, through the spray, through the chaos of wind and wave, she saw
it — a light. Sweeping across the water in a steady, ancient rhythm. The lighthouse.

"Thank God," she breathed, and for the first time in hours, she knew where she was.

The light was both salvation and warning. It told her the shore was near, which meant
rocks were near. She adjusted her course, bearing off to give the headland a wide berth.
The light swept past again — reliable, unwavering, indifferent to the storm.

Up in the lighthouse, Thomas Gray didn't know it yet, but his final watch was about
to become the most important one of his life.
"""
        for i in range(3):
            names = [("Thomas Gray", "Sarah Chen"), ("James Walker", "Maria Santos"), ("Robert Kim", "Elena Volkov")]
            text = sample_text.replace("Thomas Gray", names[i][0]).replace("Sarah Chen", names[i][1])
            (data_dir / f"sample_novel_{i+1}.txt").write_text(text, encoding="utf-8")

    print(f"Created sample data in {data_dir}:")
    for f in sorted(data_dir.iterdir()):
        if f.is_file():
            print(f"  {f.name} ({f.stat().st_size:,} bytes)")

else:
    # Upload your own novels
    from google.colab import files as colab_files
    print("Upload your novel files (.txt, .pdf, .epub, .html, .md, .mobi):")
    print("(Click 'Choose Files' button below)")
    uploaded = colab_files.upload()

    for name, content in uploaded.items():
        target = data_dir / name
        with open(target, "wb") as f:
            f.write(content)
        print(f"  Saved: {name} ({len(content):,} bytes)")

print(f"\nTotal files in {data_dir}: {len(list(data_dir.iterdir()))}")

---
## Step 4: Run Data Processing Pipeline

This runs the full pipeline: **ingest** (multi-format) > **clean** > **format** (to JSONL) > **deduplicate** > **quality filter**

In [None]:
import yaml
from pathlib import Path

# Create config.yaml with our settings
config = {
    "data": {
        "input_dir": "data/raw",
        "output_dir": "data/processed",
        "temp_dir": "data/processed/temp_cleaned",
        "chunk_size": CHUNK_SIZE,
        "overlap": 500,
    },
    "log_level": "INFO",
}

with open("/content/Novel_Writer/config.yaml", "w") as f:
    yaml.dump(config, f, default_flow_style=False)

print("Config written. Running pipeline...\n")

# Build pipeline command
cmd = "novel-writer -v pipeline --clean"

# Check if any ingestable files exist
ingest_exts = {".epub", ".html", ".htm", ".md", ".mobi"}
has_ingestable = any(f.suffix.lower() in ingest_exts for f in Path("data/raw").iterdir() if f.is_file())
if has_ingestable:
    cmd += " --ingest"

if RUN_DEDUP:
    cmd += " --deduplicate"
if RUN_QUALITY_FILTER:
    cmd += " --filter"

print(f"Command: {cmd}\n")
!{cmd}

In [None]:
# Check pipeline output
import json
from pathlib import Path

# Find the final JSONL file
processed_dir = Path("data/processed")
jsonl_files = sorted(processed_dir.glob("*.jsonl"), key=lambda f: f.stat().st_mtime, reverse=True)

if not jsonl_files:
    print("ERROR: No JSONL files produced! Check pipeline output above.")
    print("Files in processed dir:")
    for f in processed_dir.iterdir():
        print(f"  {f.name}")
else:
    train_file = jsonl_files[0]
    with open(train_file, "r", encoding="utf-8") as f:
        lines = f.readlines()

    print(f"Training data: {train_file.name}")
    print(f"Total entries: {len(lines)}")

    if lines:
        sample = json.loads(lines[0])
        print(f"\nSample entry keys: {list(sample.keys())}")
        print(f"Output preview: {sample.get('output', '')[:200]}...")

    # Save path for training step
    TRAIN_FILE = str(train_file)
    print(f"\nUsing: {TRAIN_FILE}")

---
## Step 5: Load Model & Configure LoRA

In [None]:
from unsloth import FastLanguageModel
import torch

print(f"Loading model: {CFG['model_name']}")
print(f"Max sequence length: {MAX_SEQ_LENGTH}")
print(f"4-bit quantization: True\n")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=CFG["model_name"],
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,          # Auto-detect
    load_in_4bit=True,   # QLoRA
)

print(f"\nGPU memory after loading: {torch.cuda.memory_allocated() / 1024**3:.1f} GB")

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=LORA_RANK // 2,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")
print(f"GPU memory with LoRA: {torch.cuda.memory_allocated() / 1024**3:.1f} GB")

---
## Step 6: Prepare Dataset for Training

In [None]:
from datasets import load_dataset

dataset = load_dataset("json", data_files=TRAIN_FILE, split="train")

# Train/validation split
if len(dataset) > 10:
    split = dataset.train_test_split(test_size=0.1, seed=42)
    train_dataset = split["train"]
    eval_dataset = split["test"]
else:
    train_dataset = dataset
    eval_dataset = None
    print("Dataset too small for validation split, training on all data.")

print(f"Training samples: {len(train_dataset)}")
if eval_dataset:
    print(f"Validation samples: {len(eval_dataset)}")

# Format based on model choice
if CFG["prompt_format"] == "chatml":
    # Qwen3 uses ChatML format
    template = """<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>"""
else:
    # Mistral uses [INST] format
    template = """<s>[INST] {system}

{instruction} [/INST]{output}</s>"""

def formatting_func(examples):
    instructions = examples["instruction"]
    outputs = examples["output"]
    texts = []
    for instruction, output in zip(instructions, outputs):
        text = template.format(
            system=CFG["system_prompt"],
            instruction=instruction,
            output=output,
        )
        texts.append(text)
    return {"text": texts}

train_dataset = train_dataset.map(formatting_func, batched=True)
if eval_dataset:
    eval_dataset = eval_dataset.map(formatting_func, batched=True)

print(f"\n--- Sample formatted entry ---")
print(train_dataset[0]["text"][:600])
print("...")

---
## Step 7: Train!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback

training_args = TrainingArguments(
    output_dir=f"checkpoints_{CFG['output_name']}",
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    warmup_ratio=0.1,
    learning_rate=LEARNING_RATE,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=5,
    save_strategy="steps" if eval_dataset else "epoch",
    save_steps=50 if eval_dataset else None,
    save_total_limit=3,
    seed=3407,
)

# Add eval settings if we have validation data
if eval_dataset:
    training_args.eval_strategy = "steps"
    training_args.eval_steps = 50
    training_args.load_best_model_at_end = True
    training_args.metric_for_best_model = "eval_loss"
    training_args.greater_is_better = False

callbacks = []
if eval_dataset:
    callbacks.append(EarlyStoppingCallback(early_stopping_patience=3))

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    packing=False,
    args=training_args,
    callbacks=callbacks if callbacks else None,
)

print(f"Starting training: {NUM_EPOCHS} epochs, {len(train_dataset)} samples")
print(f"Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"Estimated steps: {len(train_dataset) * NUM_EPOCHS // (BATCH_SIZE * GRADIENT_ACCUMULATION)}")
print("="*60)

stats = trainer.train()

print("="*60)
print(f"Training complete!")
print(f"  Total steps: {stats.global_step}")
print(f"  Training loss: {stats.training_loss:.4f}")
print(f"  Runtime: {stats.metrics['train_runtime']:.0f} seconds")

---
## Step 8: Test Generation

Let's see what the fine-tuned model can do!

In [None]:
FastLanguageModel.for_inference(model)

print(f"Generating with {CFG['model_name']}...\n")

for i, prompt in enumerate(CFG["test_prompts"]):
    # Build input based on model type
    if CFG["prompt_format"] == "chatml":
        messages = [
            {"role": "system", "content": CFG["system_prompt"]},
            {"role": "user", "content": prompt},
        ]
        inputs = tokenizer.apply_chat_template(
            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
        ).to("cuda")
        input_len = inputs.shape[-1]
    else:
        full_prompt = f"<s>[INST] {CFG['system_prompt']}\n\n{prompt} [/INST]"
        encoded = tokenizer(full_prompt, return_tensors="pt").to("cuda")
        inputs = encoded["input_ids"]
        input_len = inputs.shape[-1]

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        temperature=0.8,
        top_p=0.9,
        top_k=50,
        do_sample=True,
        repetition_penalty=1.1,
    )
    response = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True)

    print(f"{'='*60}")
    print(f"Prompt {i+1}: {prompt}")
    print(f"{'='*60}")
    print(response)
    print(f"[{len(response)} chars]\n")

---
## Step 9: Save & Download Model

In [None]:
output_name = CFG["output_name"]

# Save LoRA adapters
model.save_pretrained(output_name)
tokenizer.save_pretrained(output_name)
print(f"Model saved to {output_name}/")

# Show saved files
import os
total_size = 0
for f in sorted(Path(output_name).rglob("*")):
    if f.is_file():
        size = f.stat().st_size
        total_size += size
        print(f"  {f.name}: {size / 1024 / 1024:.1f} MB")
print(f"\nTotal size: {total_size / 1024 / 1024:.1f} MB")

In [None]:
# Download as zip
!zip -r {output_name}.zip {output_name}/

from google.colab import files as colab_files
colab_files.download(f"{output_name}.zip")
print(f"\nDownloading {output_name}.zip ...")

---
## Step 10 (Optional): Save to Google Drive

In [None]:
# Uncomment these lines to save to Google Drive

# from google.colab import drive
# drive.mount("/content/drive")
#
# import shutil
# drive_path = f"/content/drive/MyDrive/{output_name}"
# shutil.copytree(output_name, drive_path, dirs_exist_ok=True)
# print(f"Saved to Google Drive: {drive_path}")

---
## Step 11 (Optional): Export to GGUF for Local Use

Export your model to GGUF format for running locally with **Ollama** or **llama.cpp**.

In [None]:
# Uncomment to export to GGUF (takes ~10-15 minutes)

# gguf_name = f"{output_name}_gguf"
# model.save_pretrained_gguf(
#     gguf_name,
#     tokenizer,
#     quantization_method="q4_k_m",  # Good balance of quality vs size
# )
#
# from google.colab import files as colab_files
# gguf_file = list(Path(gguf_name).glob("*.gguf"))[0]
# colab_files.download(str(gguf_file))
# print(f"GGUF exported! Run locally with:")
# print(f"  ollama run ./{gguf_file.name}")

---

## Done!

Your fine-tuned model has been saved. To use it locally with the Novel Writer CLI:

```bash
# Unzip your downloaded model
unzip qwen3_chinese_novel_lora.zip  # or nemo_english_story_lora.zip

# Generate text
novel-writer generate --prompt "Your prompt here..." --model qwen3_chinese_novel_lora
```

Or start the API server:
```bash
python -m novel_writer.api
# POST to http://localhost:8000/generate
```