# Novel Writer - Complete Pipeline & Training

This notebook runs **everything** end-to-end on Google Colab:

1. Clone repo & install dependencies
2. Upload your novels (or use built-in sample data)
3. Run the full data processing pipeline
4. Fine-tune your chosen model
5. Generate sample text
6. Download your trained model

### Supported Models

| Model | Params | Best For | Min GPU | Free Tier? |
|-------|--------|----------|---------|------------|
| Qwen3-4B | 4B | Chinese (lightweight) | 8GB (T4) | Yes |
| Qwen3-8B | 8B | Chinese | 12GB (T4) | Yes |
| Llama 3.1 8B | 8B | English | 12GB (T4) | Yes |
| Gemma 2 9B | 9B | English | 12GB (T4) | Yes |
| Mistral Nemo 12B | 12B | English creative writing | 12GB (T4) | Yes |
| Phi-4 14B | 14B | English (reasoning + writing) | 24GB (L4/A10) | Kaggle 2xT4 |
| Qwen3-14B | 14B | Chinese + English | 24GB (L4/A10) | Kaggle 2xT4 |
| Qwen3-32B | 32B | Chinese + English (best quality) | 40GB (A100) | No |

**Requirements:** Google Colab with T4 GPU (free tier works for 4B-12B models)

---

### How to use
1. Pick your model in **Cell 1** below
2. **Runtime > Run all**
3. When prompted, upload your novel files (or skip to use sample data)
4. Wait for training to complete (~1-3 hours depending on data size)
5. Download your LoRA adapters at the end

---
## Step 1: Configuration

**Change these settings before running!**

In [None]:
#@title Configuration { display-mode: "form" }

#@markdown ### Model Selection
#@markdown > **Free tier (T4):** qwen3_4b, qwen3_8b, llama31_8b, gemma2_9b, mistral_nemo_12b
#@markdown >
#@markdown > **Paid/Kaggle (L4+):** phi4_14b, qwen3_14b
#@markdown >
#@markdown > **A100 only:** qwen3_32b
MODEL_CHOICE = "qwen3_8b" #@param ["qwen3_4b", "qwen3_8b", "llama31_8b", "gemma2_9b", "mistral_nemo_12b", "phi4_14b", "qwen3_14b", "qwen3_32b"]

#@markdown ### Data Upload Mode
#@markdown > **upload_jsonl** = Upload a ready-made `train.jsonl` (skips pipeline, fastest)
#@markdown >
#@markdown > **upload_raw** = Upload raw novel files (.txt/.epub/etc), pipeline runs on Colab
#@markdown >
#@markdown > **sample_data** = Use built-in sample text (for testing)
UPLOAD_MODE = "upload_jsonl" #@param ["upload_jsonl", "upload_raw", "sample_data"]

#@markdown ### Training Settings
NUM_EPOCHS = 2 #@param {type:"slider", min:1, max:5, step:1}
LEARNING_RATE = 2e-4 #@param {type:"number"}
MAX_SEQ_LENGTH = 4096 #@param [2048, 4096, 8192] {type:"raw"}
LORA_RANK = 32 #@param [8, 16, 32, 64] {type:"raw"}
BATCH_SIZE = 2 #@param [1, 2, 4] {type:"raw"}
GRADIENT_ACCUMULATION = 4 #@param [2, 4, 8] {type:"raw"}

#@markdown ### Advanced Training Settings
#@markdown > **NEFTune** adds noise to embeddings during training — proven to significantly
#@markdown > improve creative text generation quality (paper: NEFTune, 2023).
#@markdown > Set to 0 to disable.
NEFTUNE_ALPHA = 5 #@param {type:"number"}
WEIGHT_DECAY = 0.01 #@param {type:"number"}

#@markdown ### Pipeline Settings (only used with upload_raw)
CHUNK_SIZE = 4000 #@param {type:"integer"}
RUN_DEDUP = True #@param {type:"boolean"}
RUN_QUALITY_FILTER = True #@param {type:"boolean"}

#@markdown ---

# ===== System prompts =====
_ZH_SYSTEM = (
    '你是一位经验丰富的中文小说作家，擅长构建沉浸式的叙事场景。'
    '请根据给定的上下文续写故事，要求：\n'
    '1. 保持与原文一致的叙事视角和文风\n'
    '2. 通过具体的动作、对话和环境描写推动情节发展\n'
    '3. 角色的言行应符合其性格特征和当前情境\n'
    '4. 善用感官细节（视觉、听觉、触觉、嗅觉）营造氛围\n'
    '5. 对话要自然生动，符合角色身份和说话习惯\n'
    '6. 避免空洞的心理独白，用行动和细节展现人物内心'
)

_EN_SYSTEM = (
    'You are an accomplished fiction author with a gift for immersive storytelling. '
    'Continue the narrative following these principles:\n'
    '1. Maintain the established point of view, voice, and tonal register\n'
    '2. Advance the plot through concrete action, dialogue, and environmental detail\n'
    '3. Show character emotion through behavior, body language, and subtext — not exposition\n'
    '4. Engage multiple senses (sight, sound, touch, smell, taste) to ground scenes\n'
    '5. Write dialogue that reveals character, creates tension, and sounds natural\n'
    '6. Vary sentence rhythm — mix short punchy lines with longer flowing passages'
)

# ===== Diverse instruction pools for training data =====
_ZH_INSTRUCTIONS = [
    '续写这段叙事，保持原文的风格和节奏。',
    '以相同的文风继续这个故事。',
    '根据已有的情节和人物设定，续写下一段。',
    '保持叙事视角不变，继续推进故事发展。',
    '用生动的细节描写续写这个场景。',
    '通过对话和动作描写推进下面的情节。',
    '延续当前的叙事氛围，写出接下来发生的事。',
    '以细腻的笔触续写这段文字。',
    '按照原文的叙事节奏，写出故事的下一部分。',
    '继续描绘这个场景中的人物和事件。',
    '用符合原文风格的语言续写故事。',
    '展开叙述，让故事自然地向前发展。',
    '保持文风一致，续写接下来的情节。',
    '以沉浸式的叙事方式继续这段故事。',
    '描绘接下来的场景，注意环境和人物的刻画。',
    '用简洁有力的文字续写这段叙事。',
    '继续讲述这个故事，注意情感的表达。',
    '以自然流畅的文笔续写下一段。',
    '延续原文的基调，推进故事走向。',
    '用丰富的感官描写续写这个场景。',
]

_EN_INSTRUCTIONS = [
    'Continue the narrative in the established style.',
    'Write the next passage, maintaining the existing voice and tone.',
    'Advance the story using vivid sensory details.',
    'Continue this scene with natural dialogue and action.',
    'Extend the narrative, preserving the point of view and pacing.',
    'Write what happens next, staying true to the characters.',
    'Continue the story with concrete, immersive description.',
    'Carry the narrative forward in the same literary register.',
    'Write the next segment, matching the established rhythm.',
    'Develop this scene further with authentic detail.',
    'Push the story forward through action and dialogue.',
    'Continue in the same voice, advancing the plot naturally.',
    'Write the following passage in the style of the preceding text.',
    'Extend this scene with attention to atmosphere and character.',
    'Continue the narrative arc with engaging prose.',
    'Write what comes next, maintaining tension and pacing.',
    'Advance the story, weaving in environmental detail.',
    'Continue with prose that matches the tone and texture of the original.',
    'Develop the next beat of the story with precise language.',
    'Carry the scene forward, balancing action with description.',
]

# ===== Test prompts =====
# NOTE: Use single quotes for strings containing Chinese quotation marks
_ZH_PROMPTS = [
    '续写以下场景：\n\n暴雨如注，李明浑身湿透地站在破庙门口。庙里的火堆旁，一个蒙面人正用匕首削着木棍。两人目光相遇的瞬间，空气仿佛凝固了。\n\n请从李明的视角续写这个紧张的对峙场景，注意环境描写和人物心理。',
    '以下是一段武侠小说的开头，请续写：\n\n月色如霜，照在悬崖边两道对峙的身影上。左边那人白衣胜雪，手中长剑微微颤动；右边那人一袭黑袍，双手背在身后，嘴角挂着一抹冷笑。\n\n\u201c三年了，\u201d白衣人开口，声音像是从牙缝里挤出来的，\u201c你终于肯现身了。\u201d\n\n续写这场决斗，要有招式描写和心理活动。',
    '请用细腻的笔触描写以下场景：\n\n清晨的江南小镇刚刚苏醒。青石板路上还残留着昨夜的雨水，空气中弥漫着桂花和早点铺子里蒸笼的气息。一个背着书箱的年轻书生走过石桥，桥下有渔翁在收网。\n\n注意五感描写，营造宁静温暖的氛围。',
]
_EN_PROMPTS = [
    'Continue this scene from the lighthouse keeper\'s perspective:\n\nThe storm hit at midnight. Thomas pressed his face to the glass and watched the beam sweep across walls of black water. Then he saw it \u2014 a flare, red and desperate, arcing up from somewhere beyond the reef.\n\nHe reached for the radio. Dead. The antenna had gone in the last gust.\n\nWrite the next 300 words. Focus on his decision-making, the physical environment, and building tension.',
    'Continue this dialogue-driven scene:\n\nThe cafe was nearly empty. Rain streaked the windows, blurring the Paris streetlights into watercolor smears. Elena stirred her coffee for the third time without drinking it.\n\n"You\'re not here for the coffee," said the man across from her. He hadn\'t touched his either.\n\n"And you\'re not here by accident," she replied.\n\nHe smiled \u2014 not warmly. "I know what you did in Lyon."\n\nContinue with tension-building dialogue. Reveal character through speech patterns and subtext, not exposition.',
    'Write the next scene:\n\nAfter ten years of war, Commander Asha Renn walked through what remained of the village gate. The wooden arch was gone \u2014 burned, she guessed, years ago. Where her mother\'s garden had been, there was a blacksmith\'s forge. A child she didn\'t recognize stared at her scarred face with wide eyes.\n\n"Are you a soldier?" the child asked.\n\nContinue from Asha\'s perspective. Balance external observation with internal emotion. Use specific sensory details to show how the village has changed.',
]

MODEL_CONFIGS = {
    # ---- Free tier models (T4 / 12-16 GB VRAM) ----
    'qwen3_4b': {
        'model_name': 'unsloth/Qwen3-4B',
        'output_name': 'qwen3_4b_novel_lora',
        'system_prompt': _ZH_SYSTEM,
        'test_prompts': _ZH_PROMPTS,
        'lang': 'zh',
        'min_vram_gb': 8,
    },
    'qwen3_8b': {
        'model_name': 'unsloth/Qwen3-8B',
        'output_name': 'qwen3_8b_novel_lora',
        'system_prompt': _ZH_SYSTEM,
        'test_prompts': _ZH_PROMPTS,
        'lang': 'zh',
        'min_vram_gb': 12,
    },
    'llama31_8b': {
        'model_name': 'unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit',
        'output_name': 'llama31_8b_novel_lora',
        'system_prompt': _EN_SYSTEM,
        'test_prompts': _EN_PROMPTS,
        'lang': 'en',
        'min_vram_gb': 12,
    },
    'gemma2_9b': {
        'model_name': 'unsloth/gemma-2-9b-it-bnb-4bit',
        'output_name': 'gemma2_9b_novel_lora',
        'system_prompt': _EN_SYSTEM,
        'test_prompts': _EN_PROMPTS,
        'lang': 'en',
        'min_vram_gb': 12,
    },
    'mistral_nemo_12b': {
        'model_name': 'unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit',
        'output_name': 'mistral_nemo_12b_novel_lora',
        'system_prompt': _EN_SYSTEM,
        'test_prompts': _EN_PROMPTS,
        'lang': 'en',
        'min_vram_gb': 12,
    },
    # ---- Larger models (L4/A10 / 24+ GB VRAM) ----
    'phi4_14b': {
        'model_name': 'unsloth/Phi-4-bnb-4bit',
        'output_name': 'phi4_14b_novel_lora',
        'system_prompt': _EN_SYSTEM,
        'test_prompts': _EN_PROMPTS,
        'lang': 'en',
        'min_vram_gb': 24,
    },
    'qwen3_14b': {
        'model_name': 'unsloth/Qwen3-14B',
        'output_name': 'qwen3_14b_novel_lora',
        'system_prompt': _ZH_SYSTEM,
        'test_prompts': _ZH_PROMPTS,
        'lang': 'zh',
        'min_vram_gb': 24,
    },
    # ---- A100 models (40+ GB VRAM) ----
    'qwen3_32b': {
        'model_name': 'unsloth/Qwen3-32B',
        'output_name': 'qwen3_32b_novel_lora',
        'system_prompt': _ZH_SYSTEM,
        'test_prompts': _ZH_PROMPTS,
        'lang': 'zh',
        'min_vram_gb': 40,
    },
}

CFG = MODEL_CONFIGS[MODEL_CHOICE]
print(f"Model: {CFG['model_name']}")
print(f"Language: {'Chinese' if CFG['lang'] == 'zh' else 'English'}")
print(f"Min VRAM: {CFG['min_vram_gb']} GB")
print(f"Upload mode: {UPLOAD_MODE}")
print(f"Output: {CFG['output_name']}")
print(f"Epochs: {NUM_EPOCHS}, LR: {LEARNING_RATE}, Seq len: {MAX_SEQ_LENGTH}")
print(f"LoRA rank: {LORA_RANK}, Batch: {BATCH_SIZE}, Grad accum: {GRADIENT_ACCUMULATION}")
print(f"NEFTune alpha: {NEFTUNE_ALPHA} {'(enabled)' if NEFTUNE_ALPHA > 0 else '(disabled)'}")
print(f"Weight decay: {WEIGHT_DECAY}")

# VRAM warning
import subprocess
try:
    result = subprocess.run(['nvidia-smi', '--query-gpu=memory.total', '--format=csv,noheader,nounits'],
                          capture_output=True, text=True)
    gpu_vram = int(result.stdout.strip()) / 1024
    if gpu_vram < CFG['min_vram_gb']:
        print(f'\nWARNING: Your GPU has ~{gpu_vram:.0f} GB VRAM but {MODEL_CHOICE} needs {CFG["min_vram_gb"]} GB.')
        print(f'   Consider using a smaller model or upgrading your Colab runtime.')
    else:
        print(f'\nGPU VRAM: {gpu_vram:.0f} GB (requirement: {CFG["min_vram_gb"]} GB)')
except Exception:
    pass

---
## Step 2: Setup Environment

In [None]:
%%capture
# Install Unsloth (2x faster training, 70% less VRAM)
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

# Clone Novel Writer repo
!rm -rf /content/Novel_Writer
!git clone https://github.com/LL-LLLu/Novel_Writer.git /content/Novel_Writer

# Install Novel Writer
%cd /content/Novel_Writer
!pip install -e .

In [None]:
# Verify installation
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
print()

!novel-writer --help | head -20
print("\nSetup complete!")

---
## Step 3: Upload Data

**Three modes** (set `UPLOAD_MODE` in Step 1):

| Mode | What to upload | Pipeline runs? |
|------|---------------|----------------|
| `upload_jsonl` | Your `train.jsonl` from local pipeline | No (fastest) |
| `upload_raw` | Raw novel files (.txt, .epub, .html, etc.) | Yes |
| `sample_data` | Nothing - uses built-in sample text | Yes |

In [None]:
import os, json
from pathlib import Path

# Ensure CWD is valid (cell 4 rm -rf and re-clone can invalidate it)
os.chdir('/content/Novel_Writer')

data_dir = Path('/content/Novel_Writer/data/raw')
processed_dir = Path('/content/Novel_Writer/data/processed')
data_dir.mkdir(parents=True, exist_ok=True)
processed_dir.mkdir(parents=True, exist_ok=True)

SKIP_PIPELINE = False  # Will be set to True if user uploads train.jsonl directly

if UPLOAD_MODE == 'upload_jsonl':
    # ====== FASTEST: Upload pre-processed train.jsonl directly ======
    from google.colab import files as colab_files
    print('Upload your train.jsonl file:')
    print('(This is the file from: data/processed/train.jsonl)')
    print()

    # Upload into the processed directory directly
    os.chdir(str(processed_dir))
    uploaded = colab_files.upload()
    os.chdir('/content/Novel_Writer')

    target = processed_dir / 'train.jsonl'
    # If uploaded file has a different name, rename it
    for name in uploaded:
        src = processed_dir / name
        if src != target and src.exists():
            src.rename(target)

    # Validate
    with open(target, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    sample = json.loads(lines[0])
    print(f'\nUploaded -> {target}')
    print(f'  Entries: {len(lines)}')
    print(f'  Keys: {list(sample.keys())}')
    print(f'  Preview: {sample.get("output", "")[:100]}...')

    TRAIN_FILE = str(target)
    SKIP_PIPELINE = True
    print(f'\nPipeline will be SKIPPED (data already processed)')

elif UPLOAD_MODE == 'upload_raw':
    # ====== Upload raw novel files, pipeline runs on Colab ======
    from google.colab import files as colab_files
    print('Upload your novel files (.txt, .pdf, .epub, .html, .md, .mobi):')
    print('(Click "Choose Files" button below)')
    print()

    os.chdir(str(data_dir))
    uploaded = colab_files.upload()
    os.chdir('/content/Novel_Writer')

    for name in uploaded:
        size = (data_dir / name).stat().st_size
        print(f'  Saved: {name} ({size:,} bytes)')

    print(f'\nTotal files: {len(list(data_dir.iterdir()))}')
    print('Pipeline will process these into training data.')

elif UPLOAD_MODE == 'sample_data':
    # ====== Built-in sample data for testing ======
    if CFG['lang'] == 'zh':
        sample_text = (
            '\n\u7b2c1\u7ae0 \u9ece\u660e\u4e4b\u524d\n\n'
            '\u5929\u8fd8\u6ca1\u6709\u4eae\uff0c\u6574\u4e2a\u6751\u5e84\u90fd\u7b3c\u7f69\u5728\u4e00\u7247\u5bc2\u9759\u4e4b\u4e2d\u3002'
            '\u8fdc\u5904\u7684\u5c71\u5ce6\u5728\u8584\u96fe\u4e2d\u82e5\u9690\u82e5\u73b0\uff0c\u4eff\u4f5b\u4e00\u5e45\u6de1\u58a8\u5c71\u6c34\u753b\u3002\n'
            '\u674e\u660e\u7ad9\u5728\u9662\u5b50\u91cc\uff0c\u6df1\u6df1\u5730\u5438\u4e86\u4e00\u53e3\u6e05\u6668\u7684\u7a7a\u6c14\u3002'
            '\u4eca\u5929\u662f\u4e2a\u7279\u522b\u7684\u65e5\u5b50\uff0c\u4ed6\u5df2\u7ecf\u7b49\u4e86\u6574\u6574\u4e09\u5e74\u3002\n\n'
            '\u201c\u4f60\u771f\u7684\u8981\u8d70\u5417\uff1f\u201d\u8eab\u540e\u4f20\u6765\u6bcd\u4eb2\u82cd\u8001\u7684\u58f0\u97f3\u3002\n\n'
            '\u674e\u660e\u6ca1\u6709\u56de\u5934\uff0c\u4ed6\u77e5\u9053\u5982\u679c\u56de\u5934\uff0c'
            '\u81ea\u5df1\u53ef\u80fd\u5c31\u518d\u4e5f\u8d70\u4e0d\u4e86\u4e86\u3002'
            '\u201c\u5988\uff0c\u6211\u4f1a\u56de\u6765\u7684\u3002\u201d\n\n'
            '\u4ed6\u7684\u58f0\u97f3\u5f88\u8f7b\uff0c\u5374\u5728\u5bc2\u9759\u7684\u6e05\u6668\u663e\u5f97\u683c\u5916\u6e05\u6670\u3002'
            '\u6bcd\u4eb2\u6ca1\u6709\u518d\u8bf4\u4ec0\u4e48\uff0c\u53ea\u662f\u9ed8\u9ed8\u5730\u5c06\u4e00\u4e2a\u5305\u88f1\u9012\u5230\u4ed6\u624b\u4e2d\u3002\n'
            '\u5305\u88f1\u4e0d\u91cd\uff0c\u4f46\u674e\u660e\u77e5\u9053\u91cc\u9762\u88c5\u7740\u6bcd\u4eb2\u6240\u6709\u7684\u5fc3\u610f\u2014\u2014'
            '\u51e0\u4ef6\u6362\u6d17\u7684\u8863\u88f3\uff0c\u51e0\u4e2a\u70d9\u997c\uff0c'
            '\u8fd8\u6709\u7236\u4eb2\u7559\u4e0b\u7684\u90a3\u628a\u77ed\u5200\u3002\n\n'
            '\u201c\u8def\u4e0a\u5c0f\u5fc3\u3002\u201d\u6bcd\u4eb2\u7ec8\u4e8e\u5f00\u53e3\uff0c\u58f0\u97f3\u6709\u4e9b\u98a4\u6296\u3002\n\n'
            '\u674e\u660e\u70b9\u4e86\u70b9\u5934\uff0c\u80cc\u8d77\u5305\u88f1\uff0c\u5411\u6751\u53e3\u8d70\u53bb\u3002'
            '\u6668\u96fe\u6e10\u6e10\u6563\u5f00\uff0c\u4e1c\u65b9\u7684\u5929\u9645\u6cdb\u8d77\u4e86\u4e00\u62b9\u9c7c\u809a\u767d\u3002\n'
            '\u4ed6\u77e5\u9053\uff0c\u4ece\u8fd9\u4e00\u523b\u8d77\uff0c\u4e00\u5207\u90fd\u5c06\u4e0d\u540c\u3002\n'
        )
        for i in range(3):
            names = ['\u674e\u660e', '\u738b\u521a', '\u8d75\u4e91']
            (data_dir / f'sample_novel_{i+1}.txt').write_text(
                sample_text.replace('\u674e\u660e', names[i]),
                encoding='utf-8'
            )
    else:
        sample_text = (
            '\nChapter 1: The Last Light\n\n'
            'The old lighthouse stood at the edge of the world, or so it seemed to Thomas Gray.\n'
            'For forty years he had climbed these stairs each evening, lit the great lamp, and watched\n'
            'its beam sweep across the dark Atlantic waters.\n\n'
            '"One more night," he muttered to himself, a habit born of decades without anyone else\n'
            'to talk to. "Just one more."\n\n'
            'The lamp room at the top was warm despite the storm. Thomas had maintained the old\n'
            'Fresnel lens with religious devotion. The Coast Guard had wanted to automate the light\n'
            'years ago, replace him with sensors and timers. He had fought them tooth and nail.\n\n'
            'Thomas struck the match and touched it to the wick. The flame caught, small at first,\n'
            'then growing as the oil drew upward. He watched the light bloom and multiply through\n'
            'the precision-cut prisms until it became something powerful, something that could reach\n'
            'across miles of angry ocean to tell a sailor: you are not alone.\n\n'
            'He settled into his chair and opened his logbook. "November 17th," he wrote. "Wind\n'
            'northeast, 45 knots gusting to 60. Rain heavy. Visibility poor." He paused, pen\n'
            'hovering over the page. Then he added: "Final entry."\n\n'
            'Chapter 2: The Storm\n\n'
            'Sarah Chen had not planned to be at sea tonight. The storm had come on fast, much\n'
            'faster than the forecast predicted. So she did what sailors do: she shortened sail,\n'
            'lashed everything down, clipped her harness to the jackline, and held on.\n\n'
            'Through the rain, through the spray, through the chaos of wind and wave, she saw\n'
            'it \u2014 a light. Sweeping across the water in a steady, ancient rhythm. The lighthouse.\n\n'
            '"Thank God," she breathed, and for the first time in hours, she knew where she was.\n'
        )
        for i in range(3):
            names = [('Thomas Gray', 'Sarah Chen'), ('James Walker', 'Maria Santos'), ('Robert Kim', 'Elena Volkov')]
            text = sample_text.replace('Thomas Gray', names[i][0]).replace('Sarah Chen', names[i][1])
            (data_dir / f'sample_novel_{i+1}.txt').write_text(text, encoding='utf-8')

    print(f'Created sample data in {data_dir}:')
    for f in sorted(data_dir.iterdir()):
        if f.is_file():
            print(f'  {f.name} ({f.stat().st_size:,} bytes)')

print(f'\nSkip pipeline: {SKIP_PIPELINE}')

---
## Step 4: Run Data Processing Pipeline

Runs the full pipeline: **clean** > **format** (to JSONL) > **deduplicate** > **quality filter**

*This step is automatically skipped if you uploaded `train.jsonl` directly.*

In [None]:
if SKIP_PIPELINE:
    print("✓ Skipping pipeline (train.jsonl was uploaded directly)")
    print(f"  Using: {TRAIN_FILE}")
else:
    import yaml
    from pathlib import Path

    config = {
        "data": {
            "input_dir": "data/raw",
            "output_dir": "data/processed",
            "temp_dir": "data/processed/temp_cleaned",
            "chunk_size": CHUNK_SIZE,
            "overlap": 500,
        },
        "log_level": "INFO",
    }

    with open("/content/Novel_Writer/config.yaml", "w") as f:
        yaml.dump(config, f, default_flow_style=False)

    print("Config written. Running pipeline...\n")

    cmd = "novel-writer -v pipeline --clean"

    ingest_exts = {".epub", ".html", ".htm", ".md", ".mobi"}
    has_ingestable = any(f.suffix.lower() in ingest_exts for f in Path("data/raw").iterdir() if f.is_file())
    if has_ingestable:
        cmd += " --ingest"
    if RUN_DEDUP:
        cmd += " --deduplicate"
    if RUN_QUALITY_FILTER:
        cmd += " --filter"

    print(f"Command: {cmd}\n")
    !{cmd}

In [None]:
import json
from pathlib import Path

if not SKIP_PIPELINE:
    # Find the final JSONL file produced by pipeline
    processed_dir = Path("data/processed")
    jsonl_files = sorted(processed_dir.glob("*.jsonl"), key=lambda f: f.stat().st_mtime, reverse=True)

    if not jsonl_files:
        raise FileNotFoundError("No JSONL files produced! Check pipeline output above.")

    TRAIN_FILE = str(jsonl_files[0])

# Validate the training file
with open(TRAIN_FILE, "r", encoding="utf-8") as f:
    lines = f.readlines()

print(f"Training data: {TRAIN_FILE}")
print(f"Total entries: {len(lines)}")

if lines:
    sample = json.loads(lines[0])
    print(f"Keys: {list(sample.keys())}")
    print(f"Output preview: {sample.get('output', '')[:200]}...")
else:
    raise ValueError("Training file is empty!")

---
## Step 5: Load Model & Configure LoRA

In [None]:
from unsloth import FastLanguageModel
import torch

print(f"Loading model: {CFG['model_name']}")
print(f"Max sequence length: {MAX_SEQ_LENGTH}")
print(f"4-bit quantization: True\n")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=CFG["model_name"],
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,          # Auto-detect
    load_in_4bit=True,   # QLoRA
)

print(f"\nGPU memory after loading: {torch.cuda.memory_allocated() / 1024**3:.1f} GB")

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=LORA_RANK // 2,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} / {total:,} ({100 * trainable / total:.2f}%)")
print(f"GPU memory with LoRA: {torch.cuda.memory_allocated() / 1024**3:.1f} GB")

---
## Step 6: Prepare Dataset for Training

In [None]:
import random
from datasets import load_dataset

dataset = load_dataset("json", data_files=TRAIN_FILE, split="train")

# Diversify instructions if they're all the same (common with pipeline-generated data)
unique_instructions = set(dataset["instruction"])
if len(unique_instructions) <= 2:
    print(f"Found only {len(unique_instructions)} unique instruction(s) - diversifying...")
    instruction_pool = _ZH_INSTRUCTIONS if CFG["lang"] == "zh" else _EN_INSTRUCTIONS

    def diversify_instructions(examples):
        new_instructions = [random.choice(instruction_pool) for _ in examples["instruction"]]
        return {"instruction": new_instructions}

    dataset = dataset.map(diversify_instructions, batched=True)
    new_unique = len(set(dataset["instruction"]))
    print(f"  Diversified to {new_unique} unique instructions")
else:
    print(f"Instructions already diverse ({len(unique_instructions)} unique)")

# Train/validation split
if len(dataset) > 10:
    split = dataset.train_test_split(test_size=0.1, seed=42)
    train_dataset = split["train"]
    eval_dataset = split["test"]
else:
    train_dataset = dataset
    eval_dataset = None
    print("Dataset too small for validation split, training on all data.")

print(f"Training samples: {len(train_dataset)}")
if eval_dataset:
    print(f"Validation samples: {len(eval_dataset)}")

# Universal formatting using tokenizer's built-in chat template
# This works for ALL models (Qwen, Llama, Mistral, Gemma, Phi, etc.)
def formatting_func(examples):
    instructions = examples["instruction"]
    outputs = examples["output"]
    texts = []
    for instruction, output in zip(instructions, outputs):
        messages = [
            {"role": "system", "content": CFG["system_prompt"]},
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": output},
        ]
        # apply_chat_template handles the correct format for each model family
        try:
            text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)
        except Exception:
            # Fallback for models without system role support
            messages_no_sys = [
                {"role": "user", "content": CFG["system_prompt"] + "\n\n" + instruction},
                {"role": "assistant", "content": output},
            ]
            text = tokenizer.apply_chat_template(messages_no_sys, tokenize=False, add_generation_prompt=False)
        texts.append(text)
    return {"text": texts}

train_dataset = train_dataset.map(formatting_func, batched=True)
if eval_dataset:
    eval_dataset = eval_dataset.map(formatting_func, batched=True)

print(f"\n--- Sample formatted entry ---")
print(train_dataset[0]["text"][:600])
print("...")

---
## Step 7: Train!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments, EarlyStoppingCallback

training_args = TrainingArguments(
    output_dir=f"checkpoints_{CFG['output_name']}",
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION,
    warmup_ratio=0.1,
    learning_rate=LEARNING_RATE,
    weight_decay=WEIGHT_DECAY,
    lr_scheduler_type="cosine",
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=5,
    save_strategy="steps" if eval_dataset else "epoch",
    save_steps=50 if eval_dataset else None,
    save_total_limit=3,
    seed=3407,
)

# Add eval settings if we have validation data
if eval_dataset:
    training_args.eval_strategy = "steps"
    training_args.eval_steps = 50
    training_args.load_best_model_at_end = True
    training_args.metric_for_best_model = "eval_loss"
    training_args.greater_is_better = False

callbacks = []
if eval_dataset:
    callbacks.append(EarlyStoppingCallback(early_stopping_patience=3))

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    packing=False,
    neftune_noise_alpha=NEFTUNE_ALPHA if NEFTUNE_ALPHA > 0 else None,
    args=training_args,
    callbacks=callbacks if callbacks else None,
)

nef_status = f"NEFTune alpha={NEFTUNE_ALPHA}" if NEFTUNE_ALPHA > 0 else "NEFTune disabled"
print(f"Starting training: {NUM_EPOCHS} epochs, {len(train_dataset)} samples")
print(f"Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"Estimated steps: {len(train_dataset) * NUM_EPOCHS // (BATCH_SIZE * GRADIENT_ACCUMULATION)}")
print(f"LR schedule: cosine, Weight decay: {WEIGHT_DECAY}, {nef_status}")
print("="*60)

stats = trainer.train()

print("="*60)
print(f"Training complete!")
print(f"  Total steps: {stats.global_step}")
print(f"  Training loss: {stats.training_loss:.4f}")
print(f"  Runtime: {stats.metrics['train_runtime']:.0f} seconds")

---
## Step 8: Test Generation

Let's see what the fine-tuned model can do!

In [None]:
FastLanguageModel.for_inference(model)

print(f"Generating with {CFG['model_name']}...\n")

for i, prompt in enumerate(CFG["test_prompts"]):
    # Universal chat template approach - works for all models
    messages = [
        {"role": "system", "content": CFG["system_prompt"]},
        {"role": "user", "content": prompt},
    ]
    try:
        inputs = tokenizer.apply_chat_template(
            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
        ).to("cuda")
    except Exception:
        # Fallback for models without system role
        messages_no_sys = [
            {"role": "user", "content": CFG["system_prompt"] + "\n\n" + prompt},
        ]
        inputs = tokenizer.apply_chat_template(
            messages_no_sys, tokenize=True, add_generation_prompt=True, return_tensors="pt"
        ).to("cuda")

    input_len = inputs.shape[-1]

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        temperature=0.8,
        top_p=0.9,
        top_k=50,
        do_sample=True,
        repetition_penalty=1.1,
    )
    response = tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True)

    print(f"{'='*60}")
    print(f"Prompt {i+1}: {prompt}")
    print(f"{'='*60}")
    print(response)
    print(f"[{len(response)} chars]\n")

---
## Step 9: Save & Download Model

In [None]:
output_name = CFG["output_name"]

# Save LoRA adapters
model.save_pretrained(output_name)
tokenizer.save_pretrained(output_name)
print(f"Model saved to {output_name}/")

# Show saved files
import os
total_size = 0
for f in sorted(Path(output_name).rglob("*")):
    if f.is_file():
        size = f.stat().st_size
        total_size += size
        print(f"  {f.name}: {size / 1024 / 1024:.1f} MB")
print(f"\nTotal size: {total_size / 1024 / 1024:.1f} MB")

In [None]:
# Download as zip
!zip -r {output_name}.zip {output_name}/

from google.colab import files as colab_files
colab_files.download(f"{output_name}.zip")
print(f"\nDownloading {output_name}.zip ...")

---
## Step 10 (Optional): Save to Google Drive

In [None]:
# Uncomment these lines to save to Google Drive

# from google.colab import drive
# drive.mount("/content/drive")
#
# import shutil
# drive_path = f"/content/drive/MyDrive/{output_name}"
# shutil.copytree(output_name, drive_path, dirs_exist_ok=True)
# print(f"Saved to Google Drive: {drive_path}")

---
## Step 11 (Optional): Export to GGUF for Local Use

Export your model to GGUF format for running locally with **Ollama** or **llama.cpp**.

In [None]:
# Uncomment to export to GGUF (takes ~10-15 minutes)

# gguf_name = f"{output_name}_gguf"
# model.save_pretrained_gguf(
#     gguf_name,
#     tokenizer,
#     quantization_method="q4_k_m",  # Good balance of quality vs size
# )
#
# from google.colab import files as colab_files
# gguf_file = list(Path(gguf_name).glob("*.gguf"))[0]
# colab_files.download(str(gguf_file))
# print(f"GGUF exported! Run locally with:")
# print(f"  ollama run ./{gguf_file.name}")

---

## Done!

Your fine-tuned model has been saved. To use it locally with the Novel Writer CLI:

```bash
# Unzip your downloaded model
unzip qwen3_chinese_novel_lora.zip  # or nemo_english_story_lora.zip

# Generate text
novel-writer generate --prompt "Your prompt here..." --model qwen3_chinese_novel_lora
```

Or start the API server:
```bash
python -m novel_writer.api
# POST to http://localhost:8000/generate
```