# Kazakhstan History SFT Pipeline

This notebook combines all functionality for processing Kazakhstan history PDF documents into SFT (Supervised Fine-Tuning) training data for language models.

## Features:
1. PDF parsing and extraction
2. Markdown cleaning
3. Conversion to SFT training format
4. Model training with Unsloth

## Prerequisites:
- PDF files in the pdf directory
- Qwen API key for SFT generation

## Installation of Dependencies

In [None]:
# Install required packages
!pip install docling
!pip install langchain-core
!pip install langchain-openai
!pip install python-dotenv
!pip install torch
!pip install transformers
!pip install datasets
!pip install peft
!pip install trl
!pip install unsloth
!pip install huggingface_hub
!pip install sentencepiece

## Environment Variables Setup

In [8]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file if it exists
load_dotenv()

# Set your Qwen API key here
# You can either set it directly (not recommended for production) or use environment variables
QWEN_API_KEY = os.getenv("QWEN_API_KEY")

if not QWEN_API_KEY:
    print("Please set your QWEN_API_KEY environment variable.")
    print("You can do this by:")
    print("1. Creating a .env file with QWEN_API_KEY=your_api_key")
    print("2. Or setting it directly in the notebook (not recommended)")
    # Uncomment the next line if you want to set it directly (not recommended for security)
    # QWEN_API_KEY = "your_actual_api_key_here"
else:
    print("QWEN_API_KEY loaded successfully.")
    
# Set other environment variables if needed
os.environ["QWEN_API_KEY"] = QWEN_API_KEY

QWEN_API_KEY loaded successfully.


## PDF Parsing Functionality

In [5]:
import gc
import os
import logging
import time
from pathlib import Path

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    TableStructureOptions,
    OcrMacOptions
)
from docling.document_converter import DocumentConverter, PdfFormatOption

_log = logging.getLogger(__name__)

def parse_pdf_with_page_range(pdf_path: Path, start_page: int, end_page: int, output_dir: Path):
    """
    Parses a PDF document and extracts content from a specified page range,
    saving each page's content to a separate Markdown file.

    Args:
        pdf_path: The path to the PDF file.
        start_page: The starting page number (inclusive, 1-based).
        end_page: The ending page number (inclusive, 1-based).
        output_dir: The directory to save the output files.
    """

    output_dir.mkdir(parents=True, exist_ok=True)

    if not pdf_path.exists():
        print(f"Error: PDF file not found at {pdf_path}")
        return

    # Docling Parse with ocrmac (macOS only)
    # --------------------------------------
    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options = TableStructureOptions(do_cell_matching=True)
    pipeline_options.ocr_options = OcrMacOptions()

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    try:
        start_time = time.time()
        document_result = doc_converter.convert(pdf_path, page_range=(start_page, end_page))
        end_time = time.time() - start_time

        _log.info(f"Document converted in {end_time:.2f} seconds.")
        print(f"Output will be saved to: {output_dir}")

        if not document_result:
            print("Error: Docling did not return any content.")
            return

        print(f"Processing extracted content for pages {start_page} to {end_page}...")
        text_items = [item.text for item in document_result.document.texts]
        # Print first 3 lines/snippets
        for line in text_items[:3]:
            print(line)

        # Export Markdown format:
        with (output_dir / f"{pdf_path.name}.md").open("w", encoding="utf-8") as fp:
            fp.write(document_result.document.export_to_markdown())
            
        print(f"PDF parsing completed. Output saved to {output_dir / f'{pdf_path.name}.md'}")
    
    except Exception as e:
        print(f"Error during PDF parsing: {e}")
        
    finally:
        # Explicitly clean up resources to prevent memory leaks
        # Delete the converter and force garbage collection
        del doc_converter
        gc.collect()

if __name__ == "__main__":
    # Define paths
    pdfs_file_path = Path("pdf")
    for root, dirs, files in os.walk(pdfs_file_path):
        for file in files:
            filename = os.path.basename(file)  
            name_parts = filename.split('.')               # e.g., ["report", "final", "txt"]
            if len(name_parts) > 1:
                base_parts = name_parts[:-1]               # e.g., ["report", "final"]
            else:
                base_parts = name_parts                    # no dot in filename
            base_name = '.'.join(base_parts)
            # Step 1: Parse PDF
            print("\n1. Parsing PDF " + base_name)
            parse_pdf_with_page_range(Path(root + "/" + file), 3, 23, Path("parsed_pdf/" + base_name))


1. Parsing PDF BOOK_KZ_HISTORY
Output will be saved to: parsed_pdf/BOOK_KZ_HISTORY
Processing extracted content for pages 3 to 23...
Все учебники Казахстана на OKULYK.KZ
РАЗДЕЛ
РАЗВИТИЕ ОБЩЕСТВЕННО-ПОЛИТИЧЕСКОЙ МЫСЛИ
PDF parsing completed. Output saved to parsed_pdf/BOOK_KZ_HISTORY/BOOK_KZ_HISTORY.pdf.md


## Markdown Cleaning Functionality

In [None]:
import re

def clean_markdown_file(input_file: Path, output_file: Path):
    """
    Clean up a markdown file by removing trash information.

    Args:
        input_file (str): Path to the input markdown file
        output_file (str): Path to the output cleaned markdown file
    """
    with open(input_file, 'r', encoding='utf-8') as f:
        content = f.read()

    # Store original length for comparison
    original_length = len(content)

    # Remove image placeholders
    content = re.sub(r'<!-- image -->', '', content)

    # Remove standalone page numbers (numbers on their own lines)
    content = re.sub(r'^\d+\s*$', '', content, flags=re.MULTILINE)

    # Remove page numbers that appear at the beginning or end of lines
    content = re.sub(r'(^\s*\d+\s*)|(\s*\d+\s*$)', '', content, flags=re.MULTILINE)

    # Remove extra whitespace and empty lines created by cleaning
    content = re.sub(r'\n\s*\n', '\n\n', content)  # Replace multiple empty lines with single

    # Remove trailing whitespaces
    content = re.sub(r'[ \t]+$', '', content, flags=re.MULTILINE)

    # Remove leading/trailing whitespace from the entire document
    content = content.strip()

    # Fix multiple consecutive blank lines
    content = re.sub(r'\n{3,}', '\n\n', content)

    # Remove bullet points that look like trash (e.g., ". " at the beginning of lines)
    content = re.sub(r'^\.\s+', '', content, flags=re.MULTILINE)

    # Remove isolated dots that might be remnants of formatting
    content = re.sub(r'^\.\s*$', '', content, flags=re.MULTILINE)

    # Clean up any remaining excessive whitespace
    content = re.sub(r'[ \t]+\n', '\n', content)  # Remove spaces/tabs before newlines

    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(content)

    print(f"Original length: {original_length}")
    print(f"Cleaned length: {len(content)}")
    print(f"Characters removed: {original_length - len(content)}")
    print(f"Cleaned markdown saved to {output_file}")

def compare_files(original_file: Path, cleaned_file: Path):
    """
    Compare the original and cleaned files to show differences.
    """
    with open(original_file, 'r', encoding='utf-8') as f:
        original_lines = f.readlines()

    with open(cleaned_file, 'r', encoding='utf-8') as f:
        cleaned_lines = f.readlines()

    print(f"\nOriginal file lines: {len(original_lines)}")
    print(f"Cleaned file lines: {len(cleaned_lines)}")
    print(f"Lines removed: {len(original_lines) - len(cleaned_lines)}")

if __name__ == "__main__":
    # Step 2: Clean markdown
    print("\n2. Cleaning markdown...")
    # Define paths
    pdfs_file_path = Path("parsed_pdf/")
    Path("parsed_pdf_cleared").mkdir(parents=True, exist_ok=True)
    for root, dirs, files in os.walk(pdfs_file_path):
        for file in files:
            parsed_md_file = Path(root + "/" + file)
            cleaned_md_file = Path("parsed_pdf_cleared" + "/" + file)
            if parsed_md_file.exists():
                clean_markdown_file(parsed_md_file, cleaned_md_file)
                compare_files(parsed_md_file, cleaned_md_file)
            else:
                print(f"Error: Parsed markdown file not found at {parsed_md_file}")


2. Cleaning markdown...
Original length: 34378
Cleaned length: 33887
Characters removed: 491
Cleaned markdown saved to parsed_pdf_cleared/BOOK_KZ_HISTORY.pdf.md

Original file lines: 425
Cleaned file lines: 344
Lines removed: 81


## SFT Data Conversion Functionality

In [19]:
import json
import re
import time
from typing import List, Dict, Any

from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import JsonOutputParser
from langchain_openai import ChatOpenAI

# ----------------------------
# Helper: Load system prompt
# ----------------------------
def load_system_prompt(prompt_file: str = "system_sft_prompt.txt") -> str:
    if not os.path.exists(prompt_file):
        # Create a default system prompt if file doesn't exist
        default_prompt = "You are an expert assistant for Kazakhstan history. Answer questions about Kazakh history, culture, and traditions accurately."
        print(f"System prompt file '{prompt_file}' not found. Using default prompt.")
        return default_prompt
    
    with open(prompt_file, "r", encoding="utf-8") as f:
        return f.read().strip()

# ----------------------------
# Helper: Split Markdown by headings
# ----------------------------
def split_markdown_by_headings(markdown_text: str) -> List[str]:
    # Split by level 1 or 2 headings (e.g., # or ##)
    sections = re.split(r'\n(?=#+\s)', markdown_text)
    cleaned = []
    for sec in sections:
        sec = sec.strip()
        if sec and not sec.startswith("#"): 
            # Reattach heading if lost
            pass
        if sec:
            cleaned.append(sec)
    return cleaned

# ----------------------------
# Main SFT generation function
# ----------------------------
def create_sft_dataset_with_langchain(input_path: Path, output_path: Path, system_prompt: str, target_count: int = 300):
    print(f"Start generating SFT files for {input_path}")
    # Initialize Qwen via DashScope (requires QWEN_API_KEY env var)
    llm = ChatOpenAI(
        model="qwen-plus",  # or "qwen-turbo", "qwen-max"
        temperature=0.1,
        api_key=os.getenv("QWEN_API_KEY"),
        base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
        max_tokens=5000,
    )

    # Define prompt template
    sft_prompt = PromptTemplate.from_template("""
{system_prompt}

### Input Markdown:
{text}

### Output JSON SFT Samples:
""")

    # Use JsonOutputParser — but wrap in retry/error handling
    parser = JsonOutputParser()

    # Build chain
    chain = (
        {"system_prompt": lambda x: system_prompt, "text": RunnablePassthrough()}
        | sft_prompt
        | llm
        | parser
    )

    # Read input Markdown
    with open(input_path, "r", encoding="utf-8") as f:
        full_text = f.read()

    # Split into logical sections (by headings)
    sections = split_markdown_by_headings(full_text)

    all_samples = []
    attempts = 0
    max_attempts = 50

    print(f"Generating {target_count}+ SFT samples from {len(sections)} sections...")

    while len(all_samples) < target_count and attempts < max_attempts:
        # Rotate through sections to avoid repetition
        section = sections[attempts % len(sections)]

        try:
            print(f"Attempt {attempts + 1}: Generating from section snippet...")
            result = chain.invoke(section)  # Truncate very long sections

            if isinstance(result, list):
                # Filter valid samples
                valid_samples = [
                    s for s in result
                    if isinstance(s, dict) and "instruction" in s and "output" in s
                ]
                all_samples.extend(valid_samples)
                print(f"  → Got {len(valid_samples)} valid samples (total: {len(all_samples)})")
                print(f"One of sample are {json.dumps(all_samples[0:2], ensure_ascii=False, indent=2)}")
            else:
                print(f"  → Unexpected output type: {type(result)}")

        except Exception as e:
            print(f"  ❌ Error on attempt {attempts + 1}: {e}")

        attempts += 1
        time.sleep(0.5)  # Rate limiting (DashScope allows ~10–20 RPM for qwen-plus)

        if len(all_samples) >= target_count:
            break

    # Trim to exact target if needed (or keep extras)
    final_samples = all_samples[:target_count] if target_count else all_samples

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(final_samples, f, ensure_ascii=False, indent=2)

    print(f"\n✅ Successfully saved {len(final_samples)} SFT samples to {output_path}")

if __name__ == "__main__":
    print("\n3. Converting to SFT format...")
    # Define paths
    pdfs_file_path = Path("parsed_pdf_cleared/")
    Path("sft_output").mkdir(parents=True, exist_ok=True)
    system_prompt = load_system_prompt()

    for root, dirs, files in os.walk(pdfs_file_path):
        for file in files:
            parsed_md_file = Path(root + "/" + file)
            name_parts = file.split('.')               # e.g., ["report", "final", "txt"]
            if len(name_parts) > 1:
                base_parts = name_parts[:-1]               # e.g., ["report", "final"]
            else:
                base_parts = name_parts                    # no dot in filename
            base_name = '.'.join(name_parts)
            sft_file = Path("sft_output" + "/" + base_name + ".json")
            # Step 3: Convert to SFT format
            create_sft_dataset_with_langchain(parsed_md_file, sft_file, system_prompt)


3. Converting to SFT format...
Start generating SFT files for parsed_pdf_cleared/BOOK_KZ_HISTORY.pdf.md
Generating 300+ SFT samples from 23 sections...
Attempt 1: Generating from section snippet...
  → Got 12 valid samples (total: 12)
One of sample are [
  {
    "instruction": "Что изучает раздел истории, посвящённый развитию общественной и политической мысли в Казахстане?",
    "input": "",
    "output": "Раздел изучает формирование и эволюцию общественно-политических идей в казахском обществе, включая взгляды исторических личностей на государственность, право, образование и социальные реформы."
  },
  {
    "instruction": "Какое значение имело развитие общественной мысли в Казахстане в XIX веке?",
    "input": "",
    "output": "Развитие общественной мысли в XIX веке способствовало пробуждению национального самосознания, формированию интеллигенции и появлению идей модернизации казахского общества."
  }
]
Attempt 2: Generating from section snippet...
  → Got 12 valid samples (total: 

## Model Training with Unsloth (Optional)

In [None]:
# Only run this section if you want to train a model with the generated SFT data
# This is adapted from the Llama3.1 notebook

try:
    from unsloth import FastLanguageModel
    import torch
    max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
    dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

    # 4bit pre quantized models we support for 4x faster downloading + no OOMs.
    fourbit_models = [
        "unsloth/mistral-7b-bnb-4bit",
        "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
        "unsloth/llama-2-7b-bnb-4bit",
        "unsloth/llama-2-13b-bnb-4bit",
        "unsloth/codellama-34b-bnb-4bit",
        "unsloth/tinyllama-bnb-4bit",
        "unsloth/tinyllama-chat-bnb-4bit",
        "unsloth/qwen-7b-bnb-4bit",
        "unsloth/qwen-14b-bnb-4bit",
        "unsloth/mixtral-8x7b-bnb-4bit",
    ] # More models at https://huggingface.co/unsloth

    model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit"  # You can change this to any model from the list
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name,
        max_seq_length=max_seq_length,
        dtype=dtype,
        load_in_4bit=load_in_4bit,
    )
    print(f"Model {model_name} loaded successfully!")
    
    # Add LoRA adapters for fine-tuning
    model = FastLanguageModel.get_peft_model(
        model,
        r = 16, # Choose any number > 0. 16 is a good default
        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                          "gate_proj", "up_proj", "down_proj",],
        lora_alpha = 16,
        lora_dropout = 0, # Supports any, but = 0 is optimized
        bias = "none",    # Supports any, but = "none" is optimized
        use_gradient_checkpointing = "unsloth",
        random_state = 3407,
        use_rslora = False,  # We support rank stabilized LoRA
        loftq_config = None, # And LoftQ
    )
    
    print("LoRA adapters added to the model.")
    
except ImportError:
    print("Unsloth not installed or not available in this environment. Skipping model training setup.")
    print("To enable model training, install unsloth: pip install unsloth")

## Summary

This notebook provides a complete pipeline for:

1. **PDF Parsing**: Extract text from PDF documents using Docling
2. **Markdown Cleaning**: Remove unwanted elements like page numbers and image placeholders
3. **SFT Data Generation**: Convert cleaned markdown to supervised fine-tuning data using Qwen API
4. **Model Training**: Optional model training with Unsloth (for local fine-tuning)

### Key Files Created:
- `parsed_pdf_output/BOOK_KZ_HISTORY.pdf.md` - Raw markdown from PDF
- `parsed_pdf_output/BOOK_KZ_HISTORY_CLEANED.pdf.md` - Cleaned markdown
- `sft_output/langchain_ready_kz_history_sft_data.json` - SFT training data

### Important Notes:
- Make sure you have a Qwen API key set in your environment variables
- The SFT generation step requires internet connection and API access
- Adjust page ranges in the PDF parsing step as needed for your document
- Monitor API usage costs when generating SFT data