# Kazakhstan History SFT Pipeline

This notebook combines all functionality for processing Kazakhstan history PDF documents into SFT (Supervised Fine-Tuning) training data for language models.

## Features:
1. PDF parsing and extraction
2. Markdown cleaning
3. Conversion to SFT training format
4. Model training with Unsloth

## Prerequisites:
- PDF files in the pdf directory
- Qwen API key for SFT generation

## Installation of Dependencies

In [1]:
# Install required packages
!pip install docling
!pip install langchain-core
!pip install langchain-openai
!pip install python-dotenv
!pip install torch
!pip install unsloth
!pip install transformers
!pip install datasets
!pip install peft
!pip install trl
!pip install huggingface_hub
!pip install sentencepiece

Collecting docling
  Downloading docling-2.70.0-py3-none-any.whl.metadata (11 kB)
Collecting docling-core<3.0.0,>=2.50.1 (from docling-core[chunking]<3.0.0,>=2.50.1->docling)
  Downloading docling_core-2.60.2-py3-none-any.whl.metadata (7.6 kB)
Collecting docling-parse<5.0.0,>=4.7.0 (from docling)
  Downloading docling_parse-4.7.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (10 kB)
Collecting docling-ibm-models<4,>=3.9.1 (from docling)
  Downloading docling_ibm_models-3.11.0-py3-none-any.whl.metadata (7.2 kB)
Collecting filetype<2.0.0,>=1.2.0 (from docling)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting pypdfium2!=4.30.1,<6.0.0,>=4.30.0 (from docling)
  Downloading pypdfium2-5.3.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.8/67.8 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting rapidocr<4.0.0,>=3.3 (from docling)
  Downloading 



## Environment Variables Setup

In [None]:
import os
from dotenv import load_dotenv
from google.colab import userdata


# Load environment variables from .env file if it exists
load_dotenv()

# Set your Qwen API key here
# You can either set it directly (not recommended for production) or use environment variables
QWEN_API_KEY = userdata.get('QWEN_API_KEY')

if not QWEN_API_KEY:
    print("Please set your QWEN_API_KEY environment variable.")
    print("You can do this by:")
    print("1. Creating a .env file with QWEN_API_KEY=your_api_key")
    print("2. Or setting it directly in the notebook (not recommended)")
    # Uncomment the next line if you want to set it directly (not recommended for security)
    # QWEN_API_KEY = "your_actual_api_key_here"
else:
    print("QWEN_API_KEY loaded successfully.")

# Set other environment variables if needed
os.environ["QWEN_API_KEY"] = QWEN_API_KEY

QWEN_API_KEY loaded successfully.


## PDF Parsing Functionality

In [None]:
import gc
import os
import logging
import time
from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    OcrMacOptions
)
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(__name__)

def parse_pdf_with_page_range(pdf_path: Path, start_page: int, end_page: int, output_dir: Path):
    """
    Parses a PDF document and extracts content from a specified page range,
    saving each page's content to a separate Markdown file.

    Args:
        pdf_path: The path to the PDF file.
        start_page: The starting page number (inclusive, 1-based).
        end_page: The ending page number (inclusive, 1-based).
        output_dir: The directory to save the output files.
    """

    output_dir.mkdir(parents=True, exist_ok=True)

    if not pdf_path.exists():
        print(f"Error: PDF file not found at {pdf_path}")
        return

    # Docling Parse with ocrmac (macOS only)
    # --------------------------------------
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options = TableStructureOptions(do_cell_matching=True)
    pipeline_options.ocr_options = OcrMacOptions()

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    try:
        start_time = time.time()
        document_result = doc_converter.convert(pdf_path, page_range=(start_page, end_page))
        end_time = time.time() - start_time

        _log.info(f"Document converted in {end_time:.2f} seconds.")
        print(f"Output will be saved to: {output_dir}")

        if not document_result:
            print("Error: Docling did not return any content.")
            return

        print(f"Processing extracted content for pages {start_page} to {end_page}...")
        text_items = [item.text for item in document_result.document.texts]
        # Print first 3 lines/snippets
        for line in text_items[:3]:
            print(line)

        # Export Markdown format:
        with (output_dir / f"{pdf_path.name}.md").open("w", encoding="utf-8") as fp:
            fp.write(document_result.document.export_to_markdown())

        print(f"PDF parsing completed. Output saved to {output_dir / f'{pdf_path.name}.md'}")

    except Exception as e:
        print(f"Error during PDF parsing: {e}")

    finally:
        # Explicitly clean up resources to prevent memory leaks
        # Delete the converter and force garbage collection
        del doc_converter
        gc.collect()

if __name__ == "__main__":
    # Define paths
    pdfs_file_path = Path("pdf")
    for root, dirs, files in os.walk(pdfs_file_path):
        for file in files:
            filename = os.path.basename(file)
            name_parts = filename.split('.')               # e.g., ["report", "final", "txt"]
            if len(name_parts) > 1:
                base_parts = name_parts[:-1]               # e.g., ["report", "final"]
            else:
                base_parts = name_parts                    # no dot in filename
            base_name = '.'.join(base_parts)
            # Step 1: Parse PDF
            print("\n1. Parsing PDF " + base_name)
            parse_pdf_with_page_range(Path(root + "/" + file), 3, 23, Path("parsed_pdf/" + base_name))

  from .autonotebook import tqdm as notebook_tqdm



1. Parsing PDF BOOK_KZ_HISTORY
Output will be saved to: parsed_pdf/BOOK_KZ_HISTORY
Processing extracted content for pages 3 to 23...
Все учебники Казахстана на OKULYK.KZ
РАЗДЕЛ
РАЗВИТИЕ ОБЩЕСТВЕННО-ПОЛИТИЧЕСКОЙ МЫСЛИ
PDF parsing completed. Output saved to parsed_pdf/BOOK_KZ_HISTORY/BOOK_KZ_HISTORY.pdf.md


## Markdown Cleaning Functionality

In [None]:
import re

def clean_markdown_file(input_file: Path, output_file: Path):
    """
    Clean up a markdown file by removing trash information.

    Args:
        input_file (str): Path to the input markdown file
        output_file (str): Path to the output cleaned markdown file
    """
    with open(input_file, 'r', encoding='utf-8') as f:
        content = f.read()

    # Store original length for comparison
    original_length = len(content)

    # Remove image placeholders
    content = re.sub(r'<!-- image -->', '', content)

    # Remove standalone page numbers (numbers on their own lines)
    content = re.sub(r'^\d+\s*$', '', content, flags=re.MULTILINE)

    # Remove page numbers that appear at the beginning or end of lines
    content = re.sub(r'(^\s*\d+\s*)|(\s*\d+\s*$)', '', content, flags=re.MULTILINE)

    # Remove extra whitespace and empty lines created by cleaning
    content = re.sub(r'\n\s*\n', '\n\n', content)  # Replace multiple empty lines with single

    # Remove trailing whitespaces
    content = re.sub(r'[ \t]+$', '', content, flags=re.MULTILINE)

    # Remove leading/trailing whitespace from the entire document
    content = content.strip()

    # Fix multiple consecutive blank lines
    content = re.sub(r'\n{3,}', '\n\n', content)

    # Remove bullet points that look like trash (e.g., ". " at the beginning of lines)
    content = re.sub(r'^\.\s+', '', content, flags=re.MULTILINE)

    # Remove isolated dots that might be remnants of formatting
    content = re.sub(r'^\.\s*$', '', content, flags=re.MULTILINE)

    # Clean up any remaining excessive whitespace
    content = re.sub(r'[ \t]+\n', '\n', content)  # Remove spaces/tabs before newlines

    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(content)

    print(f"Original length: {original_length}")
    print(f"Cleaned length: {len(content)}")
    print(f"Characters removed: {original_length - len(content)}")
    print(f"Cleaned markdown saved to {output_file}")

def compare_files(original_file: Path, cleaned_file: Path):
    """
    Compare the original and cleaned files to show differences.
    """
    with open(original_file, 'r', encoding='utf-8') as f:
        original_lines = f.readlines()

    with open(cleaned_file, 'r', encoding='utf-8') as f:
        cleaned_lines = f.readlines()

    print(f"\nOriginal file lines: {len(original_lines)}")
    print(f"Cleaned file lines: {len(cleaned_lines)}")
    print(f"Lines removed: {len(original_lines) - len(cleaned_lines)}")

if __name__ == "__main__":
    # Step 2: Clean markdown
    print("\n2. Cleaning markdown...")
    # Define paths
    pdfs_file_path = Path("parsed_pdf/")
    Path("parsed_pdf_cleared").mkdir(parents=True, exist_ok=True)
    for root, dirs, files in os.walk(pdfs_file_path):
        for file in files:
            parsed_md_file = Path(root + "/" + file)
            cleaned_md_file = Path("parsed_pdf_cleared" + "/" + file)
            if parsed_md_file.exists():
                clean_markdown_file(parsed_md_file, cleaned_md_file)
                compare_files(parsed_md_file, cleaned_md_file)
            else:
                print(f"Error: Parsed markdown file not found at {parsed_md_file}")


2. Cleaning markdown...
Original length: 34378
Cleaned length: 33887
Characters removed: 491
Cleaned markdown saved to parsed_pdf_cleared/BOOK_KZ_HISTORY.pdf.md

Original file lines: 425
Cleaned file lines: 344
Lines removed: 81


## SFT Data Conversion Functionality

In [None]:
import json
import re
import time
from typing import List, Dict, Any

from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import JsonOutputParser
from langchain_openai import ChatOpenAI

# ----------------------------
# Helper: Load system prompt
# ----------------------------
def load_system_prompt(prompt_file: str = "system_sft_prompt.txt") -> str:
    if not os.path.exists(prompt_file):
        # Create a default system prompt if file doesn't exist
        default_prompt = "You are an expert assistant for Kazakhstan history. Answer questions about Kazakh history, culture, and traditions accurately."
        print(f"System prompt file '{prompt_file}' not found. Using default prompt.")
        return default_prompt

    with open(prompt_file, "r", encoding="utf-8") as f:
        return f.read().strip()

# ----------------------------
# Helper: Split Markdown by headings
# ----------------------------
def split_markdown_by_headings(markdown_text: str) -> List[str]:
    # Split by level 1 or 2 headings (e.g., # or ##)
    sections = re.split(r'\n(?=#+\s)', markdown_text)
    cleaned = []
    for sec in sections:
        sec = sec.strip()
        if sec and not sec.startswith("#"):
            # Reattach heading if lost
            pass
        if sec:
            cleaned.append(sec)
    return cleaned

# ----------------------------
# Main SFT generation function
# ----------------------------
def create_sft_dataset_with_langchain(input_path: Path, output_path: Path, system_prompt: str, target_count: int = 300):
    print(f"Start generating SFT files for {input_path}")
    # Initialize Qwen via DashScope (requires QWEN_API_KEY env var)
    llm = ChatOpenAI(
        model="qwen-plus",  # or "qwen-turbo", "qwen-max"
        temperature=0.1,
        api_key=os.getenv("QWEN_API_KEY"),
        base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
        max_tokens=5000,
    )

    # Define prompt template
    sft_prompt = PromptTemplate.from_template("""
{system_prompt}

### Input Markdown:
{text}

### Output JSON SFT Samples:
""")

    # Use JsonOutputParser — but wrap in retry/error handling
    parser = JsonOutputParser()

    # Build chain
    chain = (
        {"system_prompt": lambda x: system_prompt, "text": RunnablePassthrough()}
        | sft_prompt
        | llm
        | parser
    )

    # Read input Markdown
    with open(input_path, "r", encoding="utf-8") as f:
        full_text = f.read()

    # Split into logical sections (by headings)
    sections = split_markdown_by_headings(full_text)

    all_samples = []
    attempts = 0
    max_attempts = 50

    print(f"Generating {target_count}+ SFT samples from {len(sections)} sections...")

    while len(all_samples) < target_count and attempts < max_attempts:
        # Rotate through sections to avoid repetition
        section = sections[attempts % len(sections)]

        try:
            print(f"Attempt {attempts + 1}: Generating from section snippet...")
            result = chain.invoke(section)  # Truncate very long sections

            if isinstance(result, list):
                # Filter valid samples
                valid_samples = [
                    s for s in result
                    if isinstance(s, dict) and "instruction" in s and "output" in s
                ]
                all_samples.extend(valid_samples)
                print(f"  → Got {len(valid_samples)} valid samples (total: {len(all_samples)})")
                print(f"One of sample are {json.dumps(all_samples[0:2], ensure_ascii=False, indent=2)}")
            else:
                print(f"  → Unexpected output type: {type(result)}")

        except Exception as e:
            print(f"  ❌ Error on attempt {attempts + 1}: {e}")

        attempts += 1
        time.sleep(0.5)  # Rate limiting (DashScope allows ~10–20 RPM for qwen-plus)

        if len(all_samples) >= target_count:
            break

    # Trim to exact target if needed (or keep extras)
    final_samples = all_samples[:target_count] if target_count else all_samples

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(final_samples, f, ensure_ascii=False, indent=2)

    print(f"\n✅ Successfully saved {len(final_samples)} SFT samples to {output_path}")

if __name__ == "__main__":
    print("\n3. Converting to SFT format...")
    # Define paths
    pdfs_file_path = Path("parsed_pdf_cleared/")
    Path("sft_output").mkdir(parents=True, exist_ok=True)
    system_prompt = load_system_prompt()

    for root, dirs, files in os.walk(pdfs_file_path):
        for file in files:
            parsed_md_file = Path(root + "/" + file)
            name_parts = file.split('.')               # e.g., ["report", "final", "txt"]
            if len(name_parts) > 1:
                base_parts = name_parts[:-1]               # e.g., ["report", "final"]
            else:
                base_parts = name_parts                    # no dot in filename
            base_name = '.'.join(name_parts)
            sft_file = Path("sft_output" + "/" + base_name + ".json")
            # Step 3: Convert to SFT format
            create_sft_dataset_with_langchain(parsed_md_file, sft_file, system_prompt)


3. Converting to SFT format...
Start generating SFT files for parsed_pdf_cleared/BOOK_KZ_HISTORY.pdf.md
Generating 300+ SFT samples from 23 sections...
Attempt 1: Generating from section snippet...
  → Got 12 valid samples (total: 12)
One of sample are [
  {
    "instruction": "Что изучает раздел истории, посвящённый развитию общественной и политической мысли в Казахстане?",
    "input": "",
    "output": "Раздел изучает формирование и эволюцию общественно-политических идей в казахском обществе, включая взгляды исторических личностей на государственность, право, образование и социальные реформы."
  },
  {
    "instruction": "Какое значение имело развитие общественной мысли в Казахстане в XIX веке?",
    "input": "",
    "output": "Развитие общественной мысли в XIX веке способствовало пробуждению национального самосознания, формированию интеллигенции и появлению идей модернизации казахского общества."
  }
]
Attempt 2: Generating from section snippet...
  → Got 12 valid samples (total: 

## Model Training with Unsloth
### Installation

In [2]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9]{1,}\.[0-9]{1,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.33.post1" if v=="2.9" else "0.0.32.post2" if v=="2.8" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets==4.3.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.56.2
!pip install --no-deps trl==0.22.2

### Unsloth

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
max_lora_rank = 128

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    max_lora_rank = max_lora_rank,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

untrained_model, untrained_tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    max_lora_rank = max_lora_rank,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

#We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

from datasets import load_dataset
import glob

# List all your data files
data_files = glob.glob("BOOK_KZ_HISTORY.pdf.md.json")  # or "*.json"

# Load and concatenate
dataset = load_dataset("json", data_files=data_files, split="train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    packing = False, # Can make training 5x faster for short sequences.
    args = SFTConfig(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 100,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.001,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use TrackIO/WandB etc
    ),
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.1.4: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

==((====))==  Unsloth 2026.1.4: Fast Llama patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 7.5. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Unsloth 2026.1.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/300 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/300 [00:00<?, ? examples/s]

🦥 Unsloth: Padding-free auto-enabled, enabling faster training.


In [4]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
10.854 GB of memory reserved.


In [5]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 300 | Num Epochs = 3 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040 of 8,072,204,288 (0.52% trained)


Step,Training Loss
1,2.701
2,2.3905
3,2.4763
4,2.2807
5,2.0829
6,1.7418
7,1.7146
8,1.4375
9,1.3348
10,1.1337


Unsloth: Will smartly offload gradients to save VRAM!


In [6]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

501.933 seconds used for training.
8.37 minutes used for training.
Peak reserved memory = 11.52 GB.
Peak reserved memory for training = 0.666 GB.
Peak reserved memory % of max memory = 78.149 %.
Peak reserved memory for training % of max memory = 4.518 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!



In [9]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
FastLanguageModel.for_inference(untrained_model)

# Тестовые вопросы
questions = [
  "Кто такой Махмуд Кашгари?",
  "Что сделал Касым хане?"
  "Когда был избран Абылай?",
  "Какое наследие оставил Aль-Фараби?",
  "Кто такой Ходжа Ахмед Яссауи?",
  "Какие исторические источники упоминают о достижениях хана Хакназара?",
  "Существовали ли ограничения на власть при Тауке хане?",
]

for q in questions:
  inputs = tokenizer(
  [
    alpaca_prompt.format(
        q, # instruction

        "", # input

        "", # output - leave this blank for generation!
    )
  ], return_tensors = "pt").to("cuda")

  untrained_inputs = untrained_tokenizer(
  [
    alpaca_prompt.format(
        q, # instruction

        "", # input

        "", # output - leave this blank for generation!
    )
  ], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 256, use_cache = True)
  untrained_outputs = untrained_model.generate(**untrained_inputs, max_new_tokens = 256, use_cache = True)

  # Decode full output
  full_output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
  untrained_full_output = untrained_tokenizer.batch_decode(untrained_outputs, skip_special_tokens=True)[0]

  # Optional: Extract only the generated part (after the prompt)
  prompt_len = len(tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True))
  generated_text = full_output[prompt_len:].strip()

  untrained_prompt_len = len(untrained_tokenizer.decode(untrained_inputs["input_ids"][0], skip_special_tokens=True))
  untrained_generated_text = untrained_full_output[untrained_prompt_len:].strip()
  print(f"Q: {q} \n A: {generated_text} \n UnTrained A: {untrained_generated_text}")

Q: Кто такой Махмуд Кашгари? 
 A: Махмуд Кашгари — тюркский поэт, мыслитель и просветитель, автор известного трактата «Дивани лугат ат-тюрк», в котором изложил общие принципы тюркской философии, этики и правовой системы. 
 UnTrained A: Махмуд Кашгари - известный ученый и поэт, автор «Дивани лугат ат-тюрк» - первого известного словаря тюркских языков. Он жил в XI веке в средней Азии и был сторонником единства тюркских племен. В своем словаре он собрал около 10 тысяч слов тюркских языков и описал их грамматику. «Дивани лугат ат-тюрк» является ценным источником для изучения истории и культуры тюркских народов. Кашгари также известен своими литературными произведениями, в которых он выражал мысли о единстве и просвещении. Его словарь и литературные работы оставили заметный след в истории тюркских языков и культуры. ### Answer:
Махмуд Кашгари - ученый и поэт, автор первого известного словаря тюркских языков. Он жил в XI веке в средней Азии и был сторонником единства тюркских племен. ### Not

## Summary

This notebook provides a complete pipeline for:

1. **PDF Parsing**: Extract text from PDF documents using Docling
2. **Markdown Cleaning**: Remove unwanted elements like page numbers and image placeholders
3. **SFT Data Generation**: Convert cleaned markdown to supervised fine-tuning data using Qwen API
4. **Model Training**: Optional model training with Unsloth (for local fine-tuning)

### Key Files Created:
- `parsed_pdf_output/BOOK_KZ_HISTORY.pdf.md` - Raw markdown from PDF
- `parsed_pdf_output/BOOK_KZ_HISTORY_CLEANED.pdf.md` - Cleaned markdown
- `sft_output/langchain_ready_kz_history_sft_data.json` - SFT training data

### Important Notes:
- Make sure you have a Qwen API key set in your environment variables
- The SFT generation step requires internet connection and API access
- Adjust page ranges in the PDF parsing step as needed for your document
- Monitor API usage costs when generating SFT data