# Qualitative Codebook Builder

Created by [Matt Artz](https://www.mattartz.me/) — Advancing AI Anthropology through computational approaches to qualitative research.



<br>

---

<br>

## What This Tool Does

This notebook extracts and develops research codebooks from source materials (academic articles, reports, methodology guides) following established methodological best practices. Instead of manually reading through dozens of papers to identify theoretical constructs and build coding frameworks, you receive a structured, codebook with definitions, criteria, and examples extracted from your literature.

The system analyzes your source documents and creates a maximum of 40 codes to prevent cognitive overload, with each code including comprehensive definitions, inclusion/exclusion criteria, and supporting examples from the original texts.

## Key Features for Anthropological Research

1. **Multi-Format Document Support**: PDF, DOCX, DOC, TXT, RTF, XLSX, CSV file processing
2. **Theory-Guided Extraction**: Implements inductive, deductive, and hybrid coding approaches
3. **Quality Assurance Framework**: Built-in validation and conceptual distinctness assessment
4. **Semantic Versioning**: Full version control with changelog tracking
5. **Multiple Export Formats**: CSV, JSON, Markdown, ATLAS.ti, NVivo compatible outputs
6. **Methodological Rigor**: Following established best practices for codebook development
7. **Interactive Configuration**: Adjustable parameters for different research approaches

## Workflow

1. **Configure Parameters**: Set extraction approach (inductive/deductive/hybrid), quality thresholds, and processing preferences
2. **Upload Source Documents**: Process academic articles, reports, and methodology guides in multiple formats
3. **Automated Code Extraction**: AI analysis identifies theoretical constructs, methodological approaches, and key concepts
4. **Quality Refinement**: Merge similar codes, validate definitions, and ensure conceptual distinctness
5. **Export Documentation**: Generate codebooks in multiple formats with usage guidelines and quality reports

## Applications in Anthropological Practice

This tool supports any research requiring content analysis—from dissertation fieldwork to applied research projects. It's particularly useful for computational analysis using the tools in my AI Anthropology Toolkit, comparative studies requiring standardized coding frameworks, and collaborative research where multiple team members need consistent theoretical foundations.

## Methodological Positioning

This tool represents a **computational anthropology** approach—using AI to enhance rather than replace traditional literature review and codebook development. The extraction preserves the theoretical rigor that defines anthropological inquiry while addressing the practical challenges of analyzing large bodies of literature.

**Important**: This tool extracts codes from source documents but does not apply them to research data.

## Target Audience

Designed for anthropologists and qualitative researchers developing coding frameworks—from graduate students building dissertation codebooks to research teams requiring standardized theoretical foundations for large-scale studies.

## Technical Approach

The system employs **semantic analysis and natural language processing** to identify theoretical constructs and methodological approaches within academic literature. Using configurable extraction approaches, it systematically processes documents to build comprehensive codebooks while maintaining quality through similarity detection, validation checks, and conceptual distinctness assessment.

## Contributing to AI Anthropology

This notebook contributes to the emerging field of AI Anthropology—which combines studying AI as cultural artifact, using AI to enhance ethnographic research, and applying anthropological insights to AI development (Artz, forthcoming). By open-sourcing these tools, this work advances the collective capacity of anthropologists to work effectively with computational methods.

## AI Anthropology Toolkit

This tool is part of a growing suite of computational resources for anthropological research:

- **[Qualitative Codebook Builder](https://github.com/MattArtzAnthro/Qualitative_Codebook_Builder)** (this tool) - AI-assisted development of qualitative coding frameworks
- **[Interview Transcript Semantic Chunker](https://github.com/MattArtzAnthro/Interview_Transcript_Semantic_Chunker)** - AI-assisted segmentation of interview transcripts
- **[Coding and Thematic Analysis](https://github.com/MattArtzAnthro/Coding_and_Thematic_Analysis)** - AI-assisted coding and thematic analysis of qualtiative data

*Additional tools will be added to this toolkit as they are developed.*

<br>

---

<br>

## License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You may remix, adapt, and build upon the material for non-commercial purposes, provided you credit Matt Artz and link to the repository.

**Full license details**: https://creativecommons.org/licenses/by-nc/4.0/

## Attribution   

If you use or adapt this project in your work, please cite:


> Built with the Qualitative Codebook Builder (Matt Artz, 2025) — https://github.com/MattArtzAnthro/Qualitative_Codebook_Builder


## Citation

If you use this tool in your academic research, please cite:


> Artz, Matt. 2025. Qualitative Codebook Builder. Software.
Zenodo. https://doi.org/10.5281/zenodo.15808612


## Refrences
Artz, Matt. Forthcoming. “AI Anthropology: The Future of Applied Anthropological Practice.” In Routledge Handbook of Applied Anthropology, edited by Christina Wasson, Edward B. Liebow, Karine L. Narahara, Ndukuyakhe Ndlovu, and Alaka Wali. New York: Routledge.


## Setup and Package Installation

Install required Python packages and import necessary libraries for document processing, AI analysis, and codebook generation. Run this cell first to ensure all dependencies are available for the extraction pipeline.

In [None]:
# Install required packages
!pip install anthropic pandas numpy scikit-learn nltk PyPDF2 python-docx openpyxl
!pip install sentence-transformers python-pptx striprtf

import os
import json
import pandas as pd
import numpy as np
from datetime import datetime
import re
from collections import defaultdict, Counter
import anthropic
from typing import List, Dict, Tuple, Optional
import time
from IPython.display import display, HTML
import ipywidgets as widgets
from google.colab import files
import io

# NLP and text processing
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')  # Add this to fix potential NLTK issues

# File handling
import PyPDF2
from docx import Document
from striprtf.striprtf import rtf_to_text
from pathlib import Path
import zipfile

# For semantic similarity in code refinement
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Version control
import hashlib
from copy import deepcopy

print("✓ All packages installed successfully")

## Configuration and Parameters

Configure codebook development parameters using interactive widgets to customize extraction strategy, quality thresholds, and processing preferences. Set API credentials and define research approach before beginning document analysis.


In [None]:
# Interactive Configuration System with Widgets

import ipywidgets as widgets
from IPython.display import display, HTML
import anthropic

class CodeEntry:
    """Structure for each code following methodological guidelines"""
    def __init__(self):
        self.label = ""  # ≤25 chars, alphanumeric
        self.definition = ""  # One litmus sentence
        self.inclusion_criteria = []  # When to use
        self.exclusion_criteria = []  # When NOT to use
        self.examples = []  # 1-2 archetypal quotes
        self.notes = []  # Analytic decisions, date-stamped
        self.source_documents = []  # Where code was found
        self.frequency = 0
        self.created_date = datetime.now()
        self.last_modified = datetime.now()
        self.version = "1.0.0"

class Config:
    """Configuration following best practices from methodological literature"""
    # Default values
    ANTHROPIC_API_KEY = ""
    MAX_CODE_LABEL_LENGTH = 25
    MIN_DEFINITION_LENGTH = 20
    MAX_INITIAL_CODES = 40
    MIN_EXAMPLES_PER_CODE = 2
    MIN_CODE_FREQUENCY = 2
    SIMILARITY_THRESHOLD = 0.85
    CHUNK_SIZE = 500
    OVERLAP = 50
    MODEL = "claude-sonnet-4-20250514"
    MAX_TOKENS = 4000
    TEMPERATURE = 0.3
    OUTPUT_PATH = "/content/codebook_outputs/"
    VERSION_PATH = "/content/codebook_versions/"
    PURPOSE = "Extract theoretical and methodological codes from academic literature"
    EPISTEMOLOGICAL_STANCE = "pragmatic"
    CODING_STRATEGY = "hybrid"

def create_configuration_interface():
    """Create interactive configuration interface using AI Anthropology Toolkit styling"""

    # Instructions with consistent styling
    instructions_html = """
    <div style='background-color: #E7ECEF; padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 5px solid #274C77;'>
    <h3 style='color: #274C77; margin-top: 0;'>🎯 Configure Codebook Development</h3>
    <p><strong>Welcome to the Qualitative Codebook Builder!</strong> Configure your extraction parameters and research approach below.</p>
    <div style='background-color: #A3CEF1; padding: 15px; border-radius: 8px; margin: 15px 0; border-left: 4px solid #6096BA;'>
        <p style='color: #274C77; margin: 0; font-weight: bold;'>🔬 What this tool does:</p>
        <ul style='color: #274C77; margin: 10px 0;'>
            <li>Extract theoretical constructs and methodological approaches from academic literature</li>
            <li>Build comprehensive coding frameworks with definitions and examples</li>
            <li>Generate quality-assured codebooks for qualitative research</li>
            <li>Export in multiple formats compatible with NVivo, ATLAS.ti, and other tools</li>
        </ul>
    </div>
    </div>
    """

    # Styling
    style = {'description_width': '200px'}
    layout = widgets.Layout(width='400px')

    # API Configuration
    api_header = widgets.HTML("<h3 style='color: #274C77;'>🔑 API Configuration</h3>")

    api_key_widget = widgets.Password(
        value=Config.ANTHROPIC_API_KEY,
        placeholder='Enter your Anthropic API key',
        description='Anthropic API Key:',
        style=style,
        layout=layout
    )

    model_widget = widgets.Dropdown(
        options=[
            'claude-sonnet-4-20250514',
            'claude-3-5-sonnet-20241022',
        ],
        value=Config.MODEL,
        description='Claude Model:',
        style=style,
        layout=layout
    )

    # Codebook Parameters
    codebook_header = widgets.HTML("<h3 style='color: #274C77;'>📚 Codebook Parameters</h3>")

    max_codes_widget = widgets.IntSlider(
        value=Config.MAX_INITIAL_CODES,
        min=10,
        max=100,
        step=5,
        description='Max Initial Codes:',
        style=style,
        layout=layout
    )

    max_label_length_widget = widgets.IntSlider(
        value=Config.MAX_CODE_LABEL_LENGTH,
        min=15,
        max=50,
        step=5,
        description='Max Label Length:',
        style=style,
        layout=layout
    )

    min_definition_length_widget = widgets.IntSlider(
        value=Config.MIN_DEFINITION_LENGTH,
        min=10,
        max=100,
        step=5,
        description='Min Definition Length:',
        style=style,
        layout=layout
    )

    min_examples_widget = widgets.IntSlider(
        value=Config.MIN_EXAMPLES_PER_CODE,
        min=1,
        max=5,
        step=1,
        description='Min Examples per Code:',
        style=style,
        layout=layout
    )

    # Quality Thresholds
    quality_header = widgets.HTML("<h3 style='color: #274C77;'>🎯 Quality Thresholds</h3>")

    min_frequency_widget = widgets.IntSlider(
        value=Config.MIN_CODE_FREQUENCY,
        min=1,
        max=10,
        step=1,
        description='Min Code Frequency:',
        style=style,
        layout=layout
    )

    similarity_threshold_widget = widgets.FloatSlider(
        value=Config.SIMILARITY_THRESHOLD,
        min=0.5,
        max=0.95,
        step=0.05,
        description='Similarity Threshold:',
        style=style,
        layout=layout
    )

    # Processing Parameters
    processing_header = widgets.HTML("<h3 style='color: #274C77;'>⚙️ Processing Parameters</h3>")

    chunk_size_widget = widgets.IntSlider(
        value=Config.CHUNK_SIZE,
        min=200,
        max=1000,
        step=50,
        description='Chunk Size (words):',
        style=style,
        layout=layout
    )

    overlap_widget = widgets.IntSlider(
        value=Config.OVERLAP,
        min=10,
        max=200,
        step=10,
        description='Chunk Overlap:',
        style=style,
        layout=layout
    )

    # LLM Parameters
    llm_header = widgets.HTML("<h3 style='color: #274C77;'>🤖 LLM Parameters</h3>")

    max_tokens_widget = widgets.IntSlider(
        value=Config.MAX_TOKENS,
        min=1000,
        max=8000,
        step=500,
        description='Max Tokens:',
        style=style,
        layout=layout
    )

    temperature_widget = widgets.FloatSlider(
        value=Config.TEMPERATURE,
        min=0.0,
        max=1.0,
        step=0.1,
        description='Temperature:',
        style=style,
        layout=layout
    )

    # Research Configuration
    research_header = widgets.HTML("<h3 style='color: #274C77;'>🔬 Research Configuration</h3>")

    purpose_widget = widgets.Textarea(
        value=Config.PURPOSE,
        placeholder='Describe the purpose of your codebook...',
        description='Codebook Purpose:',
        style=style,
        layout=widgets.Layout(width='600px', height='80px')
    )

    epistemological_widget = widgets.Dropdown(
        options=['positivist', 'interpretivist', 'critical', 'pragmatic'],
        value=Config.EPISTEMOLOGICAL_STANCE,
        description='Epistemological Stance:',
        style=style,
        layout=layout
    )

    coding_strategy_widget = widgets.Dropdown(
        options=['deductive', 'inductive', 'hybrid'],
        value=Config.CODING_STRATEGY,
        description='Coding Strategy:',
        style=style,
        layout=layout
    )

    # Action Buttons
    apply_button = widgets.Button(
        description='✅ Apply Configuration',
        disabled=False,
        button_style='',
        tooltip='Apply current settings',
        icon='check',
        layout=widgets.Layout(width='220px', height='45px', margin='5px'),
        style={'button_color': '#6096BA', 'font_weight': 'bold'}
    )

    test_api_button = widgets.Button(
        description='🧪 Test API Key',
        disabled=False,
        button_style='',
        tooltip='Test if API key is valid',
        icon='key',
        layout=widgets.Layout(width='160px', height='45px', margin='5px'),
        style={'button_color': '#A3CEF1', 'font_weight': 'bold'}
    )

    reset_button = widgets.Button(
        description='🔄 Reset Defaults',
        disabled=False,
        button_style='',
        tooltip='Reset to default values',
        icon='refresh',
        layout=widgets.Layout(width='160px', height='45px', margin='5px'),
        style={'button_color': '#8B8C89', 'font_weight': 'bold'}
    )

    # Status output
    status_output = widgets.Output()

    # Event handlers
    def apply_configuration(b):
        with status_output:
            status_output.clear_output()
            try:
                # Update Config class
                Config.ANTHROPIC_API_KEY = api_key_widget.value
                Config.MODEL = model_widget.value
                Config.MAX_INITIAL_CODES = max_codes_widget.value
                Config.MAX_CODE_LABEL_LENGTH = max_label_length_widget.value
                Config.MIN_DEFINITION_LENGTH = min_definition_length_widget.value
                Config.MIN_EXAMPLES_PER_CODE = min_examples_widget.value
                Config.MIN_CODE_FREQUENCY = min_frequency_widget.value
                Config.SIMILARITY_THRESHOLD = similarity_threshold_widget.value
                Config.CHUNK_SIZE = chunk_size_widget.value
                Config.OVERLAP = overlap_widget.value
                Config.MAX_TOKENS = max_tokens_widget.value
                Config.TEMPERATURE = temperature_widget.value
                Config.PURPOSE = purpose_widget.value
                Config.EPISTEMOLOGICAL_STANCE = epistemological_widget.value
                Config.CODING_STRATEGY = coding_strategy_widget.value

                # Initialize client
                global client
                client = anthropic.Anthropic(api_key=Config.ANTHROPIC_API_KEY)

                # Create directories
                os.makedirs(Config.OUTPUT_PATH, exist_ok=True)
                os.makedirs(Config.VERSION_PATH, exist_ok=True)

                print("✅ Configuration applied successfully!")
                print(f"📊 Max Codes: {Config.MAX_INITIAL_CODES}")
                print(f"🔧 Chunk Size: {Config.CHUNK_SIZE} words")
                print(f"🎯 Strategy: {Config.CODING_STRATEGY}")
                print(f"🧠 Model: {Config.MODEL}")

            except Exception as e:
                print(f"❌ Error applying configuration: {e}")

    def test_api_key(b):
        with status_output:
            status_output.clear_output()
            try:
                if not api_key_widget.value:
                    print("❌ Please enter an API key first")
                    return

                print("🔄 Testing API connection...")
                test_client = anthropic.Anthropic(api_key=api_key_widget.value)

                response = test_client.messages.create(
                    model=model_widget.value,
                    max_tokens=10,
                    messages=[{"role": "user", "content": "Hello"}]
                )

                print("✅ API key is valid and working!")
                print(f"🤖 Model: {model_widget.value}")

            except Exception as e:
                print(f"❌ API test failed: {e}")

    def reset_configuration(b):
        with status_output:
            status_output.clear_output()
            # Reset all widgets to defaults
            api_key_widget.value = ""
            model_widget.value = "claude-sonnet-4-20250514"
            max_codes_widget.value = 40
            max_label_length_widget.value = 25
            min_definition_length_widget.value = 20
            min_examples_widget.value = 2
            min_frequency_widget.value = 2
            similarity_threshold_widget.value = 0.85
            chunk_size_widget.value = 500
            overlap_widget.value = 50
            max_tokens_widget.value = 4000
            temperature_widget.value = 0.3
            purpose_widget.value = "Extract theoretical and methodological codes from academic literature"
            epistemological_widget.value = "pragmatic"
            coding_strategy_widget.value = "hybrid"
            print("🔄 Configuration reset to defaults")

    # Bind events
    apply_button.on_click(apply_configuration)
    test_api_button.on_click(test_api_key)
    reset_button.on_click(reset_configuration)

    # Help documentation
    help_html = """
    <div style='background-color: #A3CEF1; padding: 15px; border-radius: 5px; margin: 15px 0; border-left: 4px solid #6096BA;'>
    <h4 style='color: #274C77; margin-top: 0;'>📖 Configuration Guide</h4>
    <div style='display: flex; gap: 20px; margin: 15px 0;'>
        <div style='flex: 1;'>
            <ul style='color: #274C77;'>
                <li><strong>Max Initial Codes:</strong> Limits extracted codes to prevent cognitive overload (recommended: 20-40)</li>
                <li><strong>Chunk Size:</strong> Number of words per text segment. Larger = more context, slower processing</li>
                <li><strong>Similarity Threshold:</strong> How similar codes must be to trigger merge consideration (0.85 = 85% similar)</li>
                <li><strong>Temperature:</strong> LLM creativity (0.0 = deterministic, 1.0 = creative)</li>
            </ul>
        </div>
        <div style='flex: 1;'>
            <ul style='color: #274C77;'>
                <li><strong>Coding Strategy:</strong>
                    <ul>
                        <li><em>Deductive:</em> Extract known theoretical frameworks</li>
                        <li><em>Inductive:</em> Discover emergent themes</li>
                        <li><em>Hybrid:</em> Combine both approaches (recommended)</li>
                    </ul>
                </li>
                <li><strong>Quality Thresholds:</strong> Minimum requirements for code inclusion and validation</li>
            </ul>
        </div>
    </div>
    <p style='color: #274C77; margin: 0;'><strong>💡 Tip:</strong> Start with default settings and adjust based on your specific research needs.</p>
    </div>
    """

    # Layout sections
    api_section = widgets.VBox([api_header, api_key_widget, model_widget])
    codebook_section = widgets.VBox([codebook_header, max_codes_widget, max_label_length_widget,
                                   min_definition_length_widget, min_examples_widget])
    quality_section = widgets.VBox([quality_header, min_frequency_widget, similarity_threshold_widget])
    processing_section = widgets.VBox([processing_header, chunk_size_widget, overlap_widget])
    llm_section = widgets.VBox([llm_header, max_tokens_widget, temperature_widget])
    research_section = widgets.VBox([research_header, purpose_widget, epistemological_widget, coding_strategy_widget])

    buttons_section = widgets.HBox([apply_button, test_api_button, reset_button])

    # Organize into two columns for better spacing
    left_column = widgets.VBox([
        api_section,
        codebook_section,
        quality_section
    ])

    right_column = widgets.VBox([
        processing_section,
        llm_section,
        research_section
    ])

    main_container = widgets.VBox([
        widgets.HBox([left_column, right_column], layout=widgets.Layout(gap='40px')),
        buttons_section,
        status_output
    ])

    # Display everything
    display(HTML(instructions_html))
    display(HTML(help_html))
    display(main_container)

    return {
        'api_key': api_key_widget,
        'model': model_widget,
        'max_codes': max_codes_widget,
        'chunk_size': chunk_size_widget,
        'temperature': temperature_widget,
        'strategy': coding_strategy_widget,
        'purpose': purpose_widget,
        'apply_button': apply_button,
        'test_button': test_api_button
    }

# Create the interface
print("🎛️ Loading Interactive Configuration...")
config_widgets = create_configuration_interface()

## File Upload and Processing

Upload source documents in multiple formats and extract clean text for analysis. The system handles various file types including PDFs, Word documents, and spreadsheets while providing detailed processing statistics.



In [None]:
# File Upload and Processing with Styled Interface

def create_document_upload_interface():
    """Create styled file upload interface for codebook source documents"""

    # Instructions with consistent styling
    instructions_html = """
    <div style='background-color: #E7ECEF; padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 5px solid #274C77;'>
    <h3 style='color: #274C77; margin-top: 0;'>📚 Upload Source Documents</h3>
    <p><strong>Ready to upload your source materials!</strong> This tool processes academic articles, reports, and methodology guides to extract theoretical constructs and coding frameworks.</p>

    <div style='display: flex; gap: 20px; margin: 15px 0;'>
        <div style='flex: 1; background-color: #A3CEF1; padding: 15px; border-radius: 8px; border-left: 4px solid #6096BA;'>
            <h4 style='color: #274C77; margin-top: 0;'>✅ Supported Formats:</h4>
            <ul>
                <li><strong>.pdf</strong> - Academic articles and reports</li>
                <li><strong>.docx/.doc</strong> - Word documents and manuscripts</li>
                <li><strong>.txt</strong> - Plain text files and notes</li>
                <li><strong>.rtf</strong> - Rich Text Format documents</li>
                <li><strong>.csv/.xlsx</strong> - Spreadsheet data and tables</li>
            </ul>
        </div>
        <div style='flex: 1; background-color: #A3CEF1; padding: 15px; border-radius: 8px; border-left: 4px solid #6096BA;'>
            <h4 style='color: #274C77; margin-top: 0;'>💡 Best Practices:</h4>
            <ul>
                <li>Include methodology sections for method extraction</li>
                <li>Upload theoretical papers for framework identification</li>
                <li>Ensure PDFs have selectable text (not scanned images)</li>
                <li>Include diverse sources for comprehensive coverage</li>
            </ul>
        </div>
    </div>
    </div>
    """

    # Upload button
    upload_button = widgets.Button(
        description='📤 Choose Source Documents',
        disabled=False,
        button_style='',
        tooltip='Click to select and upload your source documents',
        icon='upload',
        layout=widgets.Layout(width='300px', height='50px'),
        style={'button_color': '#6096BA', 'font_weight': 'bold'}
    )

    # Clear button
    clear_button = widgets.Button(
        description='🗑️ Clear All Documents',
        disabled=False,
        button_style='',
        tooltip='Remove all uploaded documents',
        icon='trash',
        layout=widgets.Layout(width='180px', height='40px'),
        style={'button_color': '#8B8C89', 'font_weight': 'bold'}
    )

    # File info display
    file_info = widgets.HTML(
        value="<p style='color: #666; font-style: italic;'>No documents uploaded yet. Click 'Choose Source Documents' to begin.</p>",
        layout=widgets.Layout(width='100%')
    )

    # Upload progress/status
    upload_output = widgets.Output()

    # Global storage for uploaded documents
    global uploaded_documents
    uploaded_documents = {}

    def handle_document_upload(b):
        with upload_output:
            upload_output.clear_output()

            print("=" * 60)
            print("📚 SOURCE DOCUMENT UPLOAD")
            print("=" * 60)
            print("Supported formats: PDF, DOCX, DOC, TXT, RTF, CSV, XLSX")
            print("Please select your source documents...")
            print()

            try:
                uploaded = files.upload()

                if not uploaded:
                    print("❌ No files were selected.")
                    return

                print(f"\n📥 Processing {len(uploaded)} document(s)...")

                global uploaded_documents
                uploaded_documents.clear()

                successful_files = []
                failed_files = []

                for filename, file_content in uploaded.items():
                    print(f"\n🔄 Processing: {filename}")

                    try:
                        # Determine file type and process
                        file_ext = filename.lower().split('.')[-1]

                        if file_ext == 'pdf':
                            text = extract_pdf_text(io.BytesIO(file_content))
                        elif file_ext in ['docx', 'doc']:
                            text = extract_docx_text(io.BytesIO(file_content))
                        elif file_ext == 'txt':
                            text = file_content.decode('utf-8', errors='ignore')
                        elif file_ext == 'rtf':
                            text = rtf_to_text(file_content.decode('utf-8', errors='ignore'))
                        elif file_ext == 'csv':
                            df = pd.read_csv(io.BytesIO(file_content))
                            text = extract_csv_text(df)
                        elif file_ext in ['xlsx', 'xls']:
                            df = pd.read_excel(io.BytesIO(file_content))
                            text = extract_excel_text(df)
                        else:
                            print(f"   ❌ Unsupported file format: {file_ext}")
                            failed_files.append(filename)
                            continue

                        # Validate extracted text
                        if not text or len(text.strip()) < 100:  # Minimum 100 chars for academic docs
                            print(f"   ❌ Insufficient text extracted from {filename}")
                            failed_files.append(filename)
                            continue

                        # Store document
                        uploaded_documents[filename] = text

                        # Calculate statistics
                        word_count = len(text.split())
                        char_count = len(text)
                        estimated_pages = char_count // 2000  # Rough estimate
                        file_size = len(file_content)

                        print(f"   ✅ Success!")
                        print(f"      📊 {word_count:,} words")
                        print(f"      📊 {char_count:,} characters")
                        print(f"      📊 ~{estimated_pages} pages")
                        print(f"      📁 File size: {file_size:,} bytes")

                        successful_files.append(filename)

                    except Exception as e:
                        print(f"   ❌ Error processing {filename}: {e}")
                        failed_files.append(filename)

                # Update display
                if successful_files:
                    print(f"\n🎉 Successfully processed {len(successful_files)} document(s)!")

                    # Create summary display
                    info_html = "<div style='background-color: #A3CEF1; padding: 15px; border-radius: 5px; border-left: 4px solid #6096BA;'>"
                    info_html += f"<h4 style='color: #274C77; margin-top: 0;'>✅ {len(successful_files)} Source Document(s) Uploaded</h4>"

                    total_words = 0
                    for filename in successful_files:
                        text = uploaded_documents[filename]
                        word_count = len(text.split())
                        char_count = len(text)
                        total_words += word_count

                        info_html += f"<div style='margin: 10px 0; padding: 10px; background-color: #E7ECEF; border-radius: 3px; border-left: 3px solid #274C77;'>"
                        info_html += f"<strong>📄 {filename}</strong><br>"
                        info_html += f"<small style='color: #274C77;'>📊 {word_count:,} words • {char_count:,} characters • ~{char_count//2000} pages</small>"
                        info_html += "</div>"

                    if len(successful_files) > 1:
                        info_html += f"<p style='color: #274C77;'><strong>📈 Total: {total_words:,} words across all documents</strong></p>"

                    info_html += "<p style='color: #274C77;'><strong>✅ Ready for codebook development! Continue to configuration and processing.</strong></p>"
                    info_html += "</div>"

                else:
                    info_html = "<div style='background-color: #E7ECEF; padding: 15px; border-radius: 5px; border-left: 4px solid #8B8C89;'>"
                    info_html += "<h4 style='color: #274C77; margin-top: 0;'>❌ No Documents Successfully Processed</h4>"
                    info_html += "<p style='color: #274C77;'>Please check your file formats and try again. Ensure PDFs have selectable text.</p>"
                    info_html += "</div>"

                if failed_files:
                    print(f"\n⚠️ Failed to process {len(failed_files)} file(s): {', '.join(failed_files)}")

                file_info.value = info_html

            except Exception as e:
                print(f"\n❌ Upload error: {e}")
                file_info.value = f"<p style='color: red;'>❌ Upload failed: {e}</p>"

    def clear_documents(b):
        with upload_output:
            upload_output.clear_output()

            global uploaded_documents
            uploaded_documents.clear()

            file_info.value = "<p style='color: #666; font-style: italic;'>Documents cleared. Click 'Choose Source Documents' to start over.</p>"
            print("🗑️ All documents cleared successfully.")

    # Bind events
    upload_button.on_click(handle_document_upload)
    clear_button.on_click(clear_documents)

    # Layout
    buttons_container = widgets.HBox([upload_button, clear_button])

    # Display interface
    display(HTML(instructions_html))
    display(buttons_container)
    display(file_info)
    display(upload_output)

    return {
        'upload_button': upload_button,
        'clear_button': clear_button,
        'file_info': file_info,
        'output': upload_output
    }

# Keep all your existing extraction functions exactly the same
def extract_pdf_text(file_content):
    """Extract text from PDF file"""
    try:
        pdf_reader = PyPDF2.PdfReader(file_content)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            page_text = pdf_reader.pages[page_num].extract_text()
            if page_text:
                text += page_text + "\n"
        return text
    except Exception as e:
        print(f"PDF extraction error: {e}")
        return ""

def extract_docx_text(file_content):
    """Extract text from DOCX file"""
    try:
        doc = Document(file_content)
        text = ""
        # Extract from paragraphs
        for paragraph in doc.paragraphs:
            text += paragraph.text + "\n"
        # Extract from tables
        for table in doc.tables:
            for row in table.rows:
                for cell in row.cells:
                    text += cell.text + " "
            text += "\n"
        return text
    except Exception as e:
        print(f"DOCX extraction error: {e}")
        return ""

def extract_csv_text(df):
    """Extract text from CSV dataframe"""
    text_columns = df.select_dtypes(include=['object']).columns
    text = ""
    for col in text_columns:
        text += f"Column: {col}\n"
        text += " ".join(df[col].dropna().astype(str).tolist()) + "\n\n"
    return text

def extract_excel_text(df):
    """Extract text from Excel dataframe"""
    if isinstance(df, dict):  # Multiple sheets
        text = ""
        for sheet_name, sheet_df in df.items():
            text += f"Sheet: {sheet_name}\n"
            text += extract_csv_text(sheet_df)
    else:
        text = extract_csv_text(df)
    return text

# Initialize the styled interface
print("📚 Document Upload Interface Ready")
print("👆 Configure your settings above, then upload your source documents below!")

# Global variable for document storage
uploaded_documents = {}

# Create and display upload interface
upload_interface = create_document_upload_interface()

## Text Segmentation

Prepare documents for AI analysis by splitting them into manageable chunks with controlled overlap. This maintains context across segment boundaries while ensuring optimal processing efficiency for the language model.

In [None]:
# This block provides utilities for splitting documents into manageable chunks

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[str]:
    """
    Split text into overlapping chunks for processing.
    Maintains context across chunk boundaries.
    """
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = []
    current_size = 0

    for sentence in sentences:
        sentence_size = len(sentence.split())

        if current_size + sentence_size > chunk_size and current_chunk:
            chunks.append(" ".join(current_chunk))
            # Keep last few sentences for overlap
            overlap_sentences = int(overlap * len(current_chunk) / current_size)
            current_chunk = current_chunk[-overlap_sentences:] if overlap_sentences > 0 else []
            current_size = sum(len(s.split()) for s in current_chunk)

        current_chunk.append(sentence)
        current_size += sentence_size

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

print("✓ Text chunking utilities loaded")

## Research Scope Definition

Define the codebook's purpose and analyze uploaded documents to understand their theoretical frameworks and methodological approaches. This step establishes the foundation for targeted code extraction based on document content.

In [None]:
# This block defines the codebook's purpose and analyzes uploaded documents

def define_codebook_purpose():
    """Step 1: Clarify purpose and scope following best practices"""

    purpose_config = {
        "timestamp": datetime.now().isoformat(),
        "purpose": Config.PURPOSE,
        "epistemological_stance": Config.EPISTEMOLOGICAL_STANCE,
        "primary_use": "Extract theoretical constructs and methodological approaches from literature",
        "intended_audience": ["Research team", "Peer reviewers", "Future researchers"],
        "scope_limitations": [
            f"Maximum {Config.MAX_INITIAL_CODES} initial codes to prevent overload",
            "Focus on conceptual and methodological codes",
            "Exclude purely descriptive or administrative codes"
        ],
        "research_questions": [
            "What theoretical constructs are present in the literature?",
            "What methodological approaches are discussed?",
            "What key concepts require operational definitions?"
        ]
    }

    # Save purpose statement
    with open(f"{Config.OUTPUT_PATH}codebook_purpose.json", 'w') as f:
        json.dump(purpose_config, f, indent=2)

    return purpose_config

def analyze_document_types(documents: Dict[str, str]) -> Dict:
    """Analyze uploaded documents to understand their nature - FULL DOCUMENT ANALYSIS"""

    print("\nAnalyzing document types and content...")
    document_analysis = {}

    for doc_name, content in documents.items():
        print(f"  Analyzing: {doc_name}")

        # For very long documents, analyze in segments
        segments = []
        if len(content) > 10000:  # If document is very long
            # Analyze beginning, middle, and end
            segments = [
                content[:3000],  # Beginning
                content[len(content)//2 - 1500:len(content)//2 + 1500],  # Middle
                content[-3000:]  # End
            ]
        else:
            # Analyze the whole document if it's shorter
            segments = [content]

        combined_analysis = {
            'document_type': [],
            'domain': [],
            'frameworks': [],
            'methods': [],
            'code_categories': []
        }

        # Analyze each segment
        for i, segment in enumerate(segments):
            analysis_prompt = f"""
            Analyze this {'full document' if len(segments) == 1 else f'segment {i+1} of document'} to determine:
            1. Document type (e.g., empirical article, theoretical paper, methodology guide, report)
            2. Primary domain/field
            3. Key theoretical frameworks mentioned
            4. Methodological approaches discussed
            5. Potential code categories to extract

            Be comprehensive - identify ALL frameworks, methods, and concepts present.

            Text:
            {segment}

            Return ONLY valid JSON without markdown formatting. Use this exact structure:
            {{
                "document_type": "string",
                "domain": "string",
                "frameworks": ["list", "of", "frameworks"],
                "methods": ["list", "of", "methods"],
                "code_categories": ["list", "of", "categories"]
            }}
            """

            try:
                response = client.messages.create(
                    model=Config.MODEL,
                    max_tokens=1500,
                    temperature=0.2,  # Lower temperature for consistent formatting
                    messages=[{
                        "role": "user",
                        "content": analysis_prompt
                    }],
                    timeout=120.0
                )

                # Clean and parse the response
                raw_response_text = response.content[0].text
                print(f"  Raw API response for segment {i+1}: {raw_response_text[:200]}...")

                try:
                    cleaned_response = raw_response_text.strip()

                    # Remove markdown code blocks
                    if cleaned_response.startswith("```json"):
                        cleaned_response = cleaned_response[7:]
                    if cleaned_response.startswith("```"):
                        cleaned_response = cleaned_response[3:]
                    if cleaned_response.endswith("```"):
                        cleaned_response = cleaned_response[:-3]

                    cleaned_response = cleaned_response.strip()

                    # Parse JSON
                    segment_analysis = json.loads(cleaned_response)

                    # Aggregate results
                    if isinstance(segment_analysis.get('document_type'), str):
                        combined_analysis['document_type'].append(segment_analysis['document_type'])
                    if isinstance(segment_analysis.get('domain'), str):
                        combined_analysis['domain'].append(segment_analysis['domain'])

                    for key in ['frameworks', 'methods', 'code_categories']:
                        if isinstance(segment_analysis.get(key), list):
                            combined_analysis[key].extend(segment_analysis[key])

                except json.JSONDecodeError as e:
                    print(f"    JSON parsing error for segment {i+1}: {e}")
                    print(f"    Raw response (first 200 chars): {raw_response_text[:200]}...")
                    continue

            except Exception as e:
                print(f"    Error analyzing segment {i+1}: {str(e)}")

        # Consolidate analysis
        document_analysis[doc_name] = {
            'document_type': Counter(combined_analysis['document_type']).most_common(1)[0][0] if combined_analysis['document_type'] else 'Unknown',
            'domain': Counter(combined_analysis['domain']).most_common(1)[0][0] if combined_analysis['domain'] else 'Unknown',
            'frameworks': list(set(combined_analysis['frameworks'])),  # Unique frameworks
            'methods': list(set(combined_analysis['methods'])),  # Unique methods
            'code_categories': list(set(combined_analysis['code_categories'])),  # Unique categories
            'document_length': len(content),
            'segments_analyzed': len(segments)
        }

        print(f"    Found {len(document_analysis[doc_name]['frameworks'])} frameworks, "
              f"{len(document_analysis[doc_name]['methods'])} methods")

    # Save analysis
    with open(f"{Config.OUTPUT_PATH}document_analysis.json", 'w') as f:
        json.dump(document_analysis, f, indent=2)

    return document_analysis

## Initial Code Extraction

Extract theoretical constructs and methodological concepts from source documents using the configured coding strategy. The system identifies key terms, frameworks, and approaches while building comprehensive code definitions and examples.



In [None]:
# This block extracts initial codes from documents using the specified coding approach

def extract_initial_codes(documents: Dict[str, str],
                         coding_strategy: str = "hybrid") -> Tuple[Dict[str, CodeEntry], List]:
    """
    Step 3: Build initial code set using parallel deductive + inductive generation
    """

    # Clear any existing global codebook state
    global uploaded_documents

    codebook = {}
    extraction_log = []

    print(f"Starting fresh extraction with {len(documents)} documents...")
    print(f"Using {coding_strategy} coding strategy")

    # Define extraction prompt based on strategy
    if coding_strategy == "inductive":
        prompt_template = """
        Using inductive coding, extract potential codes from the provided text.
        Focus on:
        - Theoretical concepts and constructs
        - Methodological approaches
        - Key terms that appear multiple times
        - Conceptual frameworks

        For each code provide:
        - label: ≤25 characters, alphanumeric only, no spaces (use_underscores)
        - definition: One clear sentence defining the concept
        - example: Direct quote showing the concept
        - context: Why this is a meaningful code

        Text:
        {text}

        Return ONLY valid JSON without markdown formatting. Structure as array:
        [
          {{
            "label": "code_name",
            "definition": "definition text",
            "example": "quote example",
            "context": "context explanation"
          }}
        ]
        """

    elif coding_strategy == "deductive":
        prompt_template = """
        Extract codes from the provided text based on established theoretical frameworks.
        Look specifically for:
        - Established theories (e.g., grounded theory, phenomenology)
        - Standard methodological approaches
        - Common analytical frameworks
        - Disciplinary conventions

        Format requirements:
        - label: ≤25 characters, alphanumeric only
        - definition: One litmus sentence
        - example: Supporting quote

        Text:
        {text}

        Return ONLY valid JSON without markdown formatting. Structure as array:
        [
          {{
            "label": "code_name",
            "definition": "definition text",
            "example": "quote example"
          }}
        ]
        """

    else:  # hybrid - most common
        prompt_template = """
        Extract codes from the provided text using a hybrid approach.

        First, identify standard theoretical/methodological codes:
        - Established frameworks and theories
        - Research design elements
        - Analytical approaches

        Then, identify emergent codes unique to this text:
        - Novel concepts introduced
        - Specific constructs defined
        - Unique methodological innovations

        For each code:
        - label: ≤25 characters, alphanumeric, no spaces
        - definition: One sentence "litmus test" definition
        - code_type: "deductive" or "inductive"
        - example: Direct quote (50-150 words)
        - inclusion: When to use this code
        - exclusion: When NOT to use this code

        Text:
        {text}

        Return ONLY valid JSON without markdown formatting. Structure as array:
        [
          {{
            "label": "code_name",
            "definition": "definition text",
            "code_type": "deductive",
            "example": "quote example",
            "inclusion": "when to use",
            "exclusion": "when not to use"
          }}
        ]
        """

    # Process each document
    successful_extractions = 0
    failed_extractions = 0

    for doc_name, content in documents.items():
        print(f"\nExtracting codes from: {doc_name}")

        # Chunk the document
        chunks = chunk_text(content, Config.CHUNK_SIZE, Config.OVERLAP)

        for i, chunk in enumerate(chunks):
            if i % 3 == 0:
                print(f"  Processing chunk {i+1}/{len(chunks)} (Success: {successful_extractions}, Failed: {failed_extractions})")

            try:
                response = client.messages.create(
                    model=Config.MODEL,
                    max_tokens=2000,
                    temperature=0.2,  # Lower temperature for consistent formatting
                    messages=[{
                        "role": "user",
                        "content": prompt_template.format(text=chunk)
                    }],
                    timeout=120.0
                )

                # Clean and parse the response
                raw_response_text = response.content[0].text

                try:
                    cleaned_response = raw_response_text.strip()

                    # Remove markdown code blocks
                    if cleaned_response.startswith("```json"):
                        cleaned_response = cleaned_response[7:]
                    if cleaned_response.startswith("```"):
                        cleaned_response = cleaned_response[3:]
                    if cleaned_response.endswith("```"):
                        cleaned_response = cleaned_response[:-3]

                    cleaned_response = cleaned_response.strip()

                    # Parse JSON
                    extracted_codes = json.loads(cleaned_response)

                    # Ensure it's a list
                    if not isinstance(extracted_codes, list):
                        print(f"  Warning: Expected list but got {type(extracted_codes)} for chunk {i}")
                        continue

                    # Process each extracted code
                    codes_added_this_chunk = 0
                    for code_data in extracted_codes:
                        if not isinstance(code_data, dict):
                            continue

                        label = sanitize_code_label(code_data.get('label', f'UNKNOWN_CODE_{i}'))

                        if not label or label == 'UNKNOWN_CODE':
                            continue

                        if label not in codebook:
                            # Create new code entry
                            code = CodeEntry()
                            code.label = label
                            code.definition = code_data.get('definition', 'No definition provided')
                            code.source_documents.append(doc_name)
                            code.frequency = 1

                            # Add example
                            example_text = code_data.get('example', 'No example provided')
                            code.examples.append({
                                'text': example_text[:500],  # Limit example length
                                'source': doc_name,
                                'chunk': i
                            })

                            # Add criteria if provided
                            if 'inclusion' in code_data:
                                code.inclusion_criteria.append(code_data['inclusion'])
                            if 'exclusion' in code_data:
                                code.exclusion_criteria.append(code_data['exclusion'])

                            # Add note about extraction
                            code.notes.append({
                                'date': datetime.now().isoformat(),
                                'note': f"Extracted via {coding_strategy} coding from {doc_name}",
                                'context': code_data.get('context', '')
                            })

                            codebook[label] = code
                            codes_added_this_chunk += 1
                            print(f"      Created new code: {label}")

                        else:
                            # Update existing code
                            codebook[label].frequency += 1
                            if doc_name not in codebook[label].source_documents:
                                codebook[label].source_documents.append(doc_name)

                            # Add additional example if different enough
                            if len(codebook[label].examples) < Config.MIN_EXAMPLES_PER_CODE:
                                codebook[label].examples.append({
                                    'text': code_data.get('example', 'No example provided')[:500],
                                    'source': doc_name,
                                    'chunk': i
                                })
                            print(f"      Updated existing code: {label} (freq: {codebook[label].frequency})")

                        # Log extraction
                        extraction_log.append({
                            'timestamp': datetime.now().isoformat(),
                            'document': doc_name,
                            'chunk': i,
                            'code': label,
                            'action': 'created' if codebook[label].frequency == 1 else 'updated'
                        })

                    successful_extractions += 1
                    if codes_added_this_chunk > 0:
                        print(f"    Added {codes_added_this_chunk} new codes from chunk {i}")

                except json.JSONDecodeError as e:
                    failed_extractions += 1
                    print(f"  JSON parsing failed for chunk {i}: {e}")
                    print(f"  Raw response (first 300 chars): {raw_response_text[:300]}...")

                    # Save problematic response for debugging
                    with open(f"{Config.OUTPUT_PATH}debug_chunk_{doc_name}_{i}.txt", 'w') as f:
                        f.write(f"Original:\n{raw_response_text}\n\nCleaned:\n{cleaned_response}")

                time.sleep(0.5)  # Rate limiting

            except Exception as e:
                failed_extractions += 1
                print(f"Error processing chunk {i}: {str(e)}")

    print(f"\nExtraction Summary:")
    print(f"  Successful chunks: {successful_extractions}")
    print(f"  Failed chunks: {failed_extractions}")
    print(f"  Total codes extracted: {len(codebook)}")
    print(f"  Unique code labels: {list(codebook.keys())}")

    # Verify no duplicates in the dictionary
    if len(codebook) != len(set(codebook.keys())):
        print("  ⚠️ WARNING: Duplicate detection in codebook keys!")
    else:
        print("  ✅ No duplicates detected in codebook")

    return codebook, extraction_log

def sanitize_code_label(label: str) -> str:
    """Ensure code label meets requirements: ≤25 chars, alphanumeric only, with underscores between words"""
    # First, handle spaces and convert them to underscores
    label = label.replace(' ', '_')

    # Insert underscores between camelCase words (e.g., ResponsibleAI -> Responsible_AI)
    label = re.sub(r'([a-z])([A-Z])', r'\1_\2', label)

    # Insert underscores between lowercase and numbers (e.g., AI2 -> AI_2)
    label = re.sub(r'([a-zA-Z])([0-9])', r'\1_\2', label)

    # Insert underscores between numbers and letters (e.g., 2AI -> 2_AI)
    label = re.sub(r'([0-9])([a-zA-Z])', r'\1_\2', label)

    # Remove non-alphanumeric characters (except underscores)
    label = re.sub(r'[^a-zA-Z0-9_]', '_', label)

    # Remove multiple underscores
    label = re.sub(r'_+', '_', label)

    # Trim to length
    label = label[:Config.MAX_CODE_LABEL_LENGTH]

    # Remove leading and trailing underscores
    label = label.strip('_')

    return label.upper()  # Uppercase for consistency

## Code Refinement and Validation


Refine extracted codes through similarity detection and merging while assessing conceptual distinctness. This process ensures each code represents a unique theoretical construct and meets quality standards for research application.

In [None]:
# This block handles the refinement of codes including merging similar codes

def refine_and_assess_reliability(codebook: Dict[str, CodeEntry]) -> Tuple[Dict, Dict]:
    """
    Step 4: Refine codes and assess conceptual reliability
    Following guidelines for intercoder reliability in manual coding
    """

    print("\n=== Codebook Refinement Process ===")

    # 1. Prune rare codes
    print("\n1. Pruning rare codes...")
    refined_codebook = {}
    for label, code in codebook.items():
        if code.frequency >= Config.MIN_CODE_FREQUENCY:
            refined_codebook[label] = code
        else:
            print(f"  Removed: {label} (frequency: {code.frequency})")

    print(f"  Retained {len(refined_codebook)} codes")

    # 2. Check for conceptual overlap and merge similar codes
    print("\n2. Checking for conceptual overlap...")
    merge_decisions = identify_similar_codes(refined_codebook)

    # 3. Apply merges
    for decision in merge_decisions:
        if decision['should_merge']:
            refined_codebook = merge_codes(
                refined_codebook,
                decision['code1'],
                decision['code2'],
                decision['merged_label'],
                decision['merged_definition']
            )

    # 4. Ensure all codes have complete definitions
    print("\n3. Validating code completeness...")
    validation_report = validate_codes(refined_codebook)

    # 5. Generate reliability metrics (conceptual distinctness)
    print("\n4. Assessing conceptual distinctness...")
    reliability_metrics = assess_conceptual_distinctness(refined_codebook)

    return refined_codebook, {
        'validation': validation_report,
        'reliability': reliability_metrics,
        'merge_decisions': merge_decisions
    }

def identify_similar_codes(codebook: Dict[str, CodeEntry]) -> List[Dict]:
    """Identify codes that may need merging"""

    if len(codebook) < 2:
        return []

    # Create embeddings
    model = SentenceTransformer('all-MiniLM-L6-v2')
    code_labels = list(codebook.keys())

    # Combine label, definition, and examples for embedding
    code_texts = []
    for label in code_labels:
        code = codebook[label]
        examples_text = " ".join([ex['text'][:100] for ex in code.examples[:2]])
        code_text = f"{label}: {code.definition}. Examples: {examples_text}"
        code_texts.append(code_text)

    embeddings = model.encode(code_texts)
    similarity_matrix = cosine_similarity(embeddings)

    # Find similar pairs
    merge_candidates = []

    for i in range(len(code_labels)):
        for j in range(i + 1, len(code_labels)):
            if similarity_matrix[i][j] > Config.SIMILARITY_THRESHOLD:
                # Use LLM to make merge decision
                decision = evaluate_merge_decision(
                    codebook[code_labels[i]],
                    codebook[code_labels[j]],
                    similarity_matrix[i][j]
                )
                merge_candidates.append(decision)

    return merge_candidates

def evaluate_merge_decision(code1: CodeEntry, code2: CodeEntry, similarity: float) -> Dict:
    """Use LLM to evaluate whether codes should be merged"""

    prompt = f"""
    Evaluate whether these two codes should be merged based on conceptual overlap.

    Similarity score: {similarity:.2f}

    Code 1: {code1.label}
    Definition: {code1.definition}
    Example: {code1.examples[0]['text'] if code1.examples else 'No example'}

    Code 2: {code2.label}
    Definition: {code2.definition}
    Example: {code2.examples[0]['text'] if code2.examples else 'No example'}

    Consider:
    - Are these conceptually distinct despite similar language?
    - Would merging lose important nuance?
    - Is one a subset of the other?

    Return ONLY valid JSON without markdown formatting:
    {{
        "should_merge": true,
        "rationale": "explanation of decision",
        "merged_label": "suggested_label_if_merging",
        "merged_definition": "comprehensive definition if merging"
    }}
    """

    try:
        response = client.messages.create(
            model=Config.MODEL,
            max_tokens=500,
            temperature=0.2,
            messages=[{"role": "user", "content": prompt}]
        )

        # Clean and parse response
        raw_text = response.content[0].text

        try:
            cleaned_response = raw_text.strip()

            # Remove markdown code blocks
            if cleaned_response.startswith("```json"):
                cleaned_response = cleaned_response[7:]
            if cleaned_response.startswith("```"):
                cleaned_response = cleaned_response[3:]
            if cleaned_response.endswith("```"):
                cleaned_response = cleaned_response[:-3]

            cleaned_response = cleaned_response.strip()

            decision = json.loads(cleaned_response)

        except json.JSONDecodeError as e:
            print(f"JSON parsing error in merge evaluation: {e}")
            decision = {
                'should_merge': False,
                'rationale': 'Error parsing response'
            }

        decision['code1'] = code1.label
        decision['code2'] = code2.label
        decision['similarity'] = similarity

        return decision

    except Exception as e:
        print(f"Error evaluating merge: {str(e)}")
        return {
            'code1': code1.label,
            'code2': code2.label,
            'should_merge': False,
            'rationale': 'Error in evaluation'
        }

def merge_codes(codebook: Dict, code1: str, code2: str,
                merged_label: str, merged_definition: str) -> Dict:
    """Merge two codes into one"""
    if code1 in codebook and code2 in codebook:
        # Create merged code
        merged_code = codebook[code1]
        merged_code.label = sanitize_code_label(merged_label)
        merged_code.definition = merged_definition

        # Combine examples
        merged_code.examples.extend(codebook[code2].examples)
        merged_code.frequency += codebook[code2].frequency

        # Combine source documents
        merged_code.source_documents = list(set(
            merged_code.source_documents + codebook[code2].source_documents
        ))

        # Add merge note
        merged_code.notes.append({
            'date': datetime.now().isoformat(),
            'note': f"Merged with {code2}",
            'context': 'Conceptual similarity detected'
        })

        # Remove old codes and add merged
        del codebook[code1]
        del codebook[code2]
        codebook[merged_code.label] = merged_code

    return codebook

## Quality Assessment


Validate code completeness and assess conceptual reliability following established methodological guidelines. Generate metrics for definition quality, example sufficiency, and overall codebook integrity.

In [None]:
# This block validates the quality of extracted codes

def validate_codes(codebook: Dict[str, CodeEntry]) -> Dict:
    """Validate each code meets quality criteria"""

    validation_results = {
        'total_codes': len(codebook),
        'issues_by_code': {},
        'summary': {
            'missing_definitions': 0,
            'short_definitions': 0,
            'missing_inclusion': 0,
            'missing_exclusion': 0,
            'insufficient_examples': 0
        }
    }

    if len(codebook) == 0:
        validation_results['quality_score'] = 0.0
        return validation_results

    for label, code in codebook.items():
        issues = []

        # Check definition
        if not code.definition:
            issues.append("Missing definition")
            validation_results['summary']['missing_definitions'] += 1
        elif len(code.definition) < Config.MIN_DEFINITION_LENGTH:
            issues.append(f"Definition too short ({len(code.definition)} chars)")
            validation_results['summary']['short_definitions'] += 1

        # Check criteria
        if not code.inclusion_criteria:
            issues.append("Missing inclusion criteria")
            validation_results['summary']['missing_inclusion'] += 1

        if not code.exclusion_criteria:
            issues.append("Missing exclusion criteria")
            validation_results['summary']['missing_exclusion'] += 1

        # Check examples
        if len(code.examples) < Config.MIN_EXAMPLES_PER_CODE:
            issues.append(f"Insufficient examples ({len(code.examples)})")
            validation_results['summary']['insufficient_examples'] += 1

        if issues:
            validation_results['issues_by_code'][label] = issues

    validation_results['quality_score'] = 1 - (
        len(validation_results['issues_by_code']) / len(codebook)
    )

    return validation_results

def assess_conceptual_distinctness(codebook: Dict[str, CodeEntry]) -> Dict:
    """Assess how conceptually distinct codes are from each other"""

    if len(codebook) < 2:
        return {'average_distinctness': 1.0, 'min_distinctness': 1.0, 'overlap_pairs': []}

    # Create embeddings for all codes
    model = SentenceTransformer('all-MiniLM-L6-v2')
    code_labels = list(codebook.keys())

    code_representations = []
    for label in code_labels:
        code = codebook[label]
        # Combine all information about the code
        representation = f"{label} {code.definition} "
        representation += " ".join(code.inclusion_criteria)
        representation += " ".join(code.exclusion_criteria)
        code_representations.append(representation)

    embeddings = model.encode(code_representations)
    similarity_matrix = cosine_similarity(embeddings)

    # Calculate distinctness metrics
    overlap_pairs = []
    distinctness_scores = []

    for i in range(len(code_labels)):
        max_similarity = 0
        for j in range(len(code_labels)):
            if i != j:
                max_similarity = max(max_similarity, similarity_matrix[i][j])
                if similarity_matrix[i][j] > 0.7:  # Concerning overlap
                    overlap_pairs.append({
                        'code1': code_labels[i],
                        'code2': code_labels[j],
                        'similarity': float(similarity_matrix[i][j])
                    })

        distinctness_scores.append(1 - max_similarity)

    return {
        'average_distinctness': float(np.mean(distinctness_scores)),
        'min_distinctness': float(np.min(distinctness_scores)),
        'overlap_pairs': overlap_pairs[:10]  # Top 10 overlapping pairs
    }

## Version Control

Implement semantic versioning and change tracking for the developing codebook. Maintain detailed changelog and create snapshots at key development milestones to ensure reproducibility and documentation.


In [None]:
# This block implements semantic versioning for the codebook

class CodebookVersionControl:
    """Implement semantic versioning and change tracking"""

    def __init__(self):
        self.current_version = "1.0.0"
        self.changelog = []
        self.version_history = {}

    def save_version(self, codebook: Dict[str, CodeEntry], change_description: str):
        """Save a version with changelog entry"""

        # Create version snapshot
        snapshot = {
            'version': self.current_version,
            'timestamp': datetime.now().isoformat(),
            'num_codes': len(codebook),
            'change_description': change_description,
            'codebook_snapshot': self._serialize_codebook(codebook)
        }

        # Calculate checksum
        checksum = hashlib.md5(
            json.dumps(snapshot['codebook_snapshot'], sort_keys=True).encode()
        ).hexdigest()
        snapshot['checksum'] = checksum

        # Save to version history
        filename = f"{Config.VERSION_PATH}codebook_v{self.current_version}.json"
        with open(filename, 'w') as f:
            json.dump(snapshot, f, indent=2)

        # Update changelog
        self.changelog.append({
            'version': self.current_version,
            'date': datetime.now().isoformat(),
            'changes': change_description,
            'checksum': checksum
        })

        # Save changelog
        with open(f"{Config.VERSION_PATH}CHANGELOG.json", 'w') as f:
            json.dump(self.changelog, f, indent=2)

        print(f"Saved version {self.current_version}: {change_description}")

    def increment_version(self, change_type: str = "patch"):
        """Increment version number (major.minor.patch)"""
        major, minor, patch = map(int, self.current_version.split('.'))

        if change_type == "major":
            major += 1
            minor = 0
            patch = 0
        elif change_type == "minor":
            minor += 1
            patch = 0
        else:  # patch
            patch += 1

        self.current_version = f"{major}.{minor}.{patch}"
        return self.current_version

    def _serialize_codebook(self, codebook: Dict[str, CodeEntry]) -> Dict:
        """Convert CodeEntry objects to JSON-serializable format"""
        serialized = {}
        for label, code in codebook.items():
            serialized[label] = {
                'label': code.label,
                'definition': code.definition,
                'inclusion_criteria': code.inclusion_criteria,
                'exclusion_criteria': code.exclusion_criteria,
                'examples': code.examples,
                'notes': code.notes,
                'source_documents': code.source_documents,
                'frequency': code.frequency,
                'created_date': code.created_date.isoformat(),
                'last_modified': code.last_modified.isoformat(),
                'version': code.version
            }
        return serialized

# Initialize version control
version_control = CodebookVersionControl()
print("✓ Version control initialized")

## Export and Documentation

Generate codebook outputs in multiple formats compatible with qualitative analysis software. Export includes CSV, JSON, Markdown documentation, and formats specifically designed for ATLAS.ti and NVivo integration.

In [None]:
# This block handles exporting the codebook in various formats

def export_codebook_formats(codebook: Dict[str, CodeEntry],
                          assessment_report: Dict,
                          purpose_config: Dict):
    """
    Step 6: Format for human and machine readability
    Export in multiple formats as recommended
    """

    # 1. CSV format (for quantitative linkage)
    export_to_csv(codebook)

    # 2. JSON format (for software integration)
    export_to_json(codebook, assessment_report)

    # 3. Human-readable markdown
    export_to_markdown(codebook, assessment_report, purpose_config)

    # 4. ATLAS.ti compatible format
    export_to_atlas_format(codebook)

    # 5. NVivo compatible format
    export_to_nvivo_format(codebook)

    print(f"\nExported codebook in 5 formats to {Config.OUTPUT_PATH}")

def export_to_csv(codebook: Dict[str, CodeEntry]):
    """Export as CSV with proper formatting"""

    rows = []
    for label, code in codebook.items():
        row = {
            'code_label': label,
            'definition': code.definition,
            'inclusion_criteria': '; '.join(code.inclusion_criteria),
            'exclusion_criteria': '; '.join(code.exclusion_criteria),
            'example_1': code.examples[0]['text'] if code.examples else '',
            'example_2': code.examples[1]['text'] if len(code.examples) > 1 else '',
            'frequency': code.frequency,
            'source_documents': '; '.join(code.source_documents),
            'created_date': code.created_date.strftime('%Y-%m-%d'),
            'version': code.version
        }
        rows.append(row)

    df = pd.DataFrame(rows)
    df.to_csv(f"{Config.OUTPUT_PATH}codebook.csv", index=False)

def export_to_json(codebook: Dict[str, CodeEntry], assessment_report: Dict):
    """Export as JSON for software integration"""

    json_export = {
        'metadata': {
            'version': version_control.current_version,
            'created': datetime.now().isoformat(),
            'total_codes': len(codebook),
            'quality_score': assessment_report['validation']['quality_score']
        },
        'codes': {}
    }

    for label, code in codebook.items():
        json_export['codes'][label] = {
            'definition': code.definition,
            'inclusion': code.inclusion_criteria,
            'exclusion': code.exclusion_criteria,
            'examples': [ex['text'] for ex in code.examples],
            'frequency': code.frequency,
            'sources': code.source_documents
        }

    with open(f"{Config.OUTPUT_PATH}codebook.json", 'w') as f:
        json.dump(json_export, f, indent=2)

def export_to_markdown(codebook: Dict[str, CodeEntry],
                      assessment_report: Dict,
                      purpose_config: Dict):
    """Generate comprehensive markdown documentation"""

    md_content = f"""# Codebook Documentation

**Version**: {version_control.current_version}
**Generated**: {datetime.now().strftime('%Y-%m-%d %H:%M')}
**Purpose**: {purpose_config['purpose']}
**Epistemological Stance**: {purpose_config['epistemological_stance']}

## Overview

This codebook was developed through systematic extraction from {len(set().union(*[code.source_documents for code in codebook.values()]))} source documents using a {Config.CODING_STRATEGY} coding approach.

### Statistics
- **Total Codes**: {len(codebook)}
- **Average Frequency**: {np.mean([code.frequency for code in codebook.values()]):.1f}
- **Conceptual Distinctness**: {assessment_report['reliability'].get('average_distinctness', 'N/A')}

## Code Definitions

"""

    # Sort codes alphabetically
    for label in sorted(codebook.keys()):
        code = codebook[label]

        md_content += f"""### {label}

**Definition**: {code.definition}

**Frequency**: {code.frequency} occurrences in {len(code.source_documents)} documents

**When to use this code**:
"""
        for criterion in code.inclusion_criteria:
            md_content += f"- {criterion}\n"

        md_content += "\n**When NOT to use this code**:\n"
        for criterion in code.exclusion_criteria:
            md_content += f"- {criterion}\n"

        md_content += "\n**Example applications**:\n\n"
        for i, example in enumerate(code.examples[:2], 1):
            md_content += f'{i}. "{example["text"][:200]}..." (*{example["source"]}*)\n\n'

        if code.notes:
            md_content += "**Notes**:\n"
            for note in code.notes[-2:]:  # Last 2 notes
                md_content += f"- {note['date'][:10]}: {note['note']}\n"

        md_content += "\n---\n\n"

    # Add changelog
    md_content += "## Version History\n\n"
    for entry in version_control.changelog[-5:]:  # Last 5 changes
        md_content += f"- **v{entry['version']}** ({entry['date'][:10]}): {entry['changes']}\n"

    with open(f"{Config.OUTPUT_PATH}codebook_documentation.md", 'w') as f:
        f.write(md_content)

def export_to_atlas_format(codebook: Dict[str, CodeEntry]):
    """Export in ATLAS.ti compatible format"""
    # Simplified XML format for ATLAS.ti
    atlas_export = []
    for label, code in codebook.items():
        atlas_export.append({
            'name': label,
            'comment': code.definition,
            'examples': [ex['text'] for ex in code.examples]
        })

    with open(f"{Config.OUTPUT_PATH}codebook_atlas.json", 'w') as f:
        json.dump(atlas_export, f, indent=2)

def export_to_nvivo_format(codebook: Dict[str, CodeEntry]):
    """Export in NVivo compatible format"""
    # Simplified format for NVivo
    nvivo_export = []
    for label, code in codebook.items():
        nvivo_export.append({
            'Name': label,
            'Description': code.definition,
            'Files': ', '.join(code.source_documents)
        })

    df = pd.DataFrame(nvivo_export)
    df.to_csv(f"{Config.OUTPUT_PATH}codebook_nvivo.csv", index=False)

## Quality Reporting

Create detailed quality assurance reports and usage guidelines for the finalized codebook. Generate metrics, recommendations, and documentation to support peer review and research team adoption.

In [None]:
# This block generates comprehensive quality reports

def generate_quality_report(codebook: Dict[str, CodeEntry],
                          assessment_report: Dict,
                          extraction_log: List):
    """Generate comprehensive quality assurance report"""

    report = {
        'metadata': {
            'version': version_control.current_version,
            'generated': datetime.now().isoformat(),
            'total_codes': len(codebook)
        },
        'quality_metrics': {
            'overall_quality': assessment_report['validation']['quality_score'],
            'conceptual_distinctness': assessment_report['reliability']['average_distinctness'],
            'validation_issues': assessment_report['validation']['summary']
        },
        'extraction_summary': {
            'total_extractions': len(extraction_log),
            'documents_processed': len(set(e['document'] for e in extraction_log))
        },
        'recommendations': []
    }

    # Generate recommendations
    if assessment_report['validation']['summary']['missing_definitions'] > 0:
        report['recommendations'].append(
            "Add definitions to all codes before finalizing"
        )

    if assessment_report['reliability']['average_distinctness'] < 0.8:
        report['recommendations'].append(
            "Review overlapping codes for potential mergers"
        )

    with open(f"{Config.OUTPUT_PATH}quality_report.json", 'w') as f:
        json.dump(report, f, indent=2)

def create_usage_guidelines(codebook: Dict[str, CodeEntry], purpose_config: Dict):
    """Create guidelines for using the codebook"""

    guidelines = f"""# Codebook Usage Guidelines

## Purpose
{purpose_config['purpose']}

## How to Apply These Codes

1. **Read the full definition** before applying any code
2. **Check inclusion criteria** - the text must meet these conditions
3. **Check exclusion criteria** - if any apply, do not use the code
4. **Reference the examples** when uncertain
5. **Document edge cases** in your coding notes

## Code Application Rules

- Codes are conceptually mutually exclusive
- Multiple codes may be applied to the same text segment if justified
- When in doubt, refer to the examples provided
- Record rationale for difficult coding decisions

## Quality Checks

- Aim for consistency across coders
- Regular team meetings to discuss edge cases
- Update coding notes for future reference
- Version control any modifications

## Contact
For questions about code definitions or applications, consult the codebook documentation.
"""

    with open(f"{Config.OUTPUT_PATH}usage_guidelines.md", 'w') as f:
        f.write(guidelines)

## Processing Pipeline

Orchestrate the complete codebook development workflow from document analysis through final export. This main execution function coordinates all previous components following methodological best practices.

In [None]:
# This is the main execution pipeline that orchestrates the entire codebook development process

def develop_codebook_pipeline():
    """
    Main pipeline following the 8-step best practice guide
    """

    print("=== Codebook Development System ===")
    print("Following methodological best practices\n")

    # Check if documents are uploaded
    if 'uploaded_documents' not in globals() or not uploaded_documents:
        print("\nNo documents found. Please run the file upload cell above first!")
        return

    # Step 1: Define purpose and scope
    print("Step 1: Defining codebook purpose and scope...")
    purpose_config = define_codebook_purpose()
    print(f"Purpose: {purpose_config['purpose']}")
    print(f"Epistemological stance: {purpose_config['epistemological_stance']}")

    # Step 2: Analyze document types
    print("\nStep 2: Analyzing uploaded documents...")
    document_analysis = analyze_document_types(uploaded_documents)
    print(f"Analyzed {len(document_analysis)} documents")

    # Save initial version
    version_control.save_version({}, "Initial setup - document analysis complete")

    # Step 3: Extract initial codes
    print("\nStep 3: Building initial code set...")
    initial_codebook, extraction_log = extract_initial_codes(
        uploaded_documents,
        Config.CODING_STRATEGY
    )
    print(f"Extracted {len(initial_codebook)} initial codes")

    # Save version after initial extraction
    version_control.increment_version("minor")
    version_control.save_version(initial_codebook, "Initial code extraction complete")

    # Step 4: Refine and assess reliability
    print("\nStep 4: Training codes and assessing reliability...")
    refined_codebook, assessment_report = refine_and_assess_reliability(initial_codebook)
    print(f"Refined to {len(refined_codebook)} codes")
    print(f"Quality score: {assessment_report['validation']['quality_score']:.2f}")

    # Step 5: Finalize and freeze
    if assessment_report['validation']['quality_score'] >= 0.7:
        print("\nStep 5: Freezing codebook (quality threshold met)")
        version_control.increment_version("major")
        version_control.save_version(refined_codebook, "Codebook frozen - ready for use")
    else:
        print("\nStep 5: Codebook needs improvement before freezing")
        print("Issues to address:")
        for issue_type, count in assessment_report['validation']['summary'].items():
            if count > 0:
                print(f"  - {issue_type}: {count} codes")

    # Step 6: Export in multiple formats
    print("\nStep 6: Formatting for human and machine readability...")
    export_codebook_formats(refined_codebook, assessment_report, purpose_config)

    # Step 7: Generate quality report
    print("\nStep 7: Generating quality assurance report...")
    generate_quality_report(refined_codebook, assessment_report, extraction_log)

    # Step 8: Create usage guidelines
    print("\nStep 8: Creating usage guidelines...")
    create_usage_guidelines(refined_codebook, purpose_config)

    print("\n=== Codebook Development Complete ===")
    print(f"Version: {version_control.current_version}")
    print(f"Total codes: {len(refined_codebook)}")
    print(f"Quality score: {assessment_report['validation']['quality_score']:.2f}")
    print(f"\nOutputs saved to: {Config.OUTPUT_PATH}")

    return refined_codebook, assessment_report

# Run the pipeline
print("Ready to run codebook development!")
print("To start, execute: develop_codebook_pipeline()")

## Execute Codebook Development

Run the complete codebook development process with your uploaded documents and configured parameters. Monitor progress through each development stage and generate final outputs ready for research application.


In [None]:
# Execute this cell to run the complete codebook development process
print("🚀 Starting Codebook Development Pipeline...")
print("=" * 70)

try:
    result = develop_codebook_pipeline()

    print("\n" + "=" * 70)
    print("🎉 CODEBOOK DEVELOPMENT COMPLETE!")
    print("=" * 70)
    print("✅ Pipeline executed successfully")
    print("📁 Check the output directory for your generated files:")
    print(f"   {Config.OUTPUT_PATH}")
    print("\n📋 Generated files include:")
    print("   • codebook.csv - Main codebook data")
    print("   • codebook.json - JSON format for software integration")
    print("   • codebook_documentation.md - Human-readable documentation")
    print("   • codebook_atlas.json - ATLAS.ti compatible format")
    print("   • codebook_nvivo.csv - NVivo compatible format")
    print("   • quality_report.json - Quality assessment metrics")
    print("   • usage_guidelines.md - Guidelines for using the codebook")
    print("\n🏁 You can now use your codebook for qualitative analysis!")

except Exception as e:
    print("\n" + "=" * 70)
    print("❌ PIPELINE EXECUTION FAILED!")
    print("=" * 70)
    print(f"Error: {str(e)}")
    print("\n🔧 Troubleshooting steps:")
    print("   1. Check that all configuration parameters are set")
    print("   2. Verify that documents are uploaded")
    print("   3. Ensure API key is valid and working")
    print("   4. Review any error messages above")
    print("\n💡 Try running individual pipeline steps to isolate the issue.")
    raise