# Interview Transcript Semantic Chunker

Created by [Matt Artz](https://www.mattartz.me/) — Advancing AI Anthropology through computational approaches to qualitative research.

<br>

---

<br>

## What This Tool Does

This notebook transforms lengthy interview transcripts into **semantically coherent chunks**—meaningful segments that respect natural conversation boundaries while preserving speaker identity and turn-taking patterns. Rather than facing a 50-page interview transcript as one overwhelming document, you receive coherent segments, each labeled and sized for systematic qualitative coding.

## Key Features

1. **Multi-Format Support**: Works with common transcript formats (PDF, DOCX, TXT, RTF)
2. **Ethnographically-Aware Chunking**: Preserves speaker labels (Q:, A:, Interviewer:, etc.) and conversation structure
3. **Intelligent Segmentation**: Uses AI to identify natural topic boundaries rather than arbitrary text splits
4. **Two Analysis Methods**:
   - **Claude API**: More sophisticated understanding of conversational flow
   - **Sentence Transformers**: No API required, good for privacy-sensitive research
5. **Flexible Configuration**: Adjust chunk sizes and sensitivity to match your analytical needs
6. **Multiple Export Formats**: CSV, Excel, JSON - structured for use with NVivo, ATLAS.ti, and other analysis tools
7. **Quality Metrics**: Built-in validation to ensure text preservation and chunking quality

## Workflow

1. **Configure Parameters**: Set chunking sensitivity and size preferences using intuitive controls
2. **Upload Transcripts**: Batch process multiple interview files
3. **AI-Powered Segmentation**: Automatic identification of topic boundaries and speaker turns
4. **Quality Review**: Statistical analysis of chunking results with visualizations
5. **Export for Analysis**: Download structured data ready for qualitative coding software

## Applications in Anthropological Practice

This tool supports any research involving interview transcripts—from dissertation fieldwork to applied research projects. It's particularly useful for computational analysis using the tools in my AI Anthropology Toolkit and comparative studies requiring standardized data units and collaborative research where multiple team members need consistent data preparation.

## Methodological Positioning

This tool represents a **computational anthropology** approach—using AI to enhance rather than replace traditional ethnographic analysis. The chunking preserves the interpretive work that defines anthropological inquiry while addressing the practical challenges of scale in contemporary research contexts.

**Important**: This tool prepares data for analysis but does not interpret it.

## Target Audience

Designed for anthropologists and qualitative researchers working with interview data—from graduate students managing thesis interviews to research teams processing large datasets for applied projects.

## Technical Approach

The notebook employs **semantic similarity analysis** to identify topical coherence in conversation. Rather than splitting text arbitrarily, it analyzes meaning relationships between sentences to find natural break points where topics shift or new themes emerge. This preserves the conversational integrity essential for anthropological interpretation.

## Contributing to AI Anthropology

This notebook contributes to the emerging field of AI Anthropology—which combines studying AI as cultural artifact, using AI to enhance ethnographic research, and applying anthropological insights to AI development (Artz, forthcoming). By open-sourcing these tools, this work advances the collective capacity of anthropologists to work effectively with computational methods.

## AI Anthropology Toolkit

This tool is part of a growing suite of computational resources for anthropological research:

- **[Qualitative Codebook Builder](https://github.com/MattArtzAnthro/Qualitative_Codebook_Builder)** - AI-assisted development of qualitative coding frameworks
- **[Interview Transcript Semantic Chunker](https://github.com/MattArtzAnthro/Interview_Transcript_Semantic_Chunker)** (this tool) - AI-assisted segmentation of interview transcripts
- **[Coding and Thematic Analysis](https://github.com/MattArtzAnthro/Coding_and_Thematic_Analysis)** - AI-assisted coding and thematic analysis of qualtiative data

*Additional tools will be added to this toolkit as they are developed.*


<br>

---

<br>

## License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. You may remix, adapt, and build upon the material for non-commercial purposes, provided you credit Matt Artz and link to the repository.

**Full license details**: https://creativecommons.org/licenses/by-nc/4.0/

## Attribution   

If you use or adapt this project in your work, please cite:


> Built with the Qualitative Codebook Builder (Matt Artz, 2025) — https://github.com/MattArtzAnthro/Qualitative_Codebook_Builder


## Citation

If you use this tool in your academic research, please cite:


> Artz, Matt. 2025. Interview Transcript Semantic Chunker. Software.
Zenodo. https://doi.org/10.5281/zenodo.15823716

## Refrences
Artz, Matt. Forthcoming. “AI Anthropology: The Future of Applied Anthropological Practice.” In Routledge Handbook of Applied Anthropology, edited by Christina Wasson, Edward B. Liebow, Karine L. Narahara, Ndukuyakhe Ndlovu, and Alaka Wali. New York: Routledge.

## Setup and Installation

This section installs required Python packages and imports necessary libraries for semantic analysis, file processing, and interactive widgets. Run this cell first to ensure all dependencies are available.


In [None]:
# Install required packages
!pip install python-docx pandas sentence-transformers nltk pypdf2 openpyxl ipywidgets matplotlib anthropic striprtf -q

import os
import re
import pandas as pd
from typing import List, Tuple, Optional
import numpy as np
from google.colab import files
import warnings
import json
from datetime import datetime
import io
from pathlib import Path
warnings.filterwarnings('ignore')

# Import core libraries
import docx
from sentence_transformers import SentenceTransformer
import nltk
from nltk.tokenize import sent_tokenize
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output
import PyPDF2
import matplotlib.pyplot as plt

# Download NLTK data for sentence tokenization
print("Downloading NLTK data...")
try:
    nltk.download('punkt', quiet=True)
    nltk.download('punkt_tab', quiet=True)
    print("✓ NLTK data downloaded successfully")
except:
    print("⚠️ NLTK download issue - will try alternative tokenization")

print("✓ All packages installed and libraries loaded successfully")
print("📋 Ready to configure your interview transcript chunker!")

## Chunking Configuration

Configure your chunking parameters using interactive widgets to customize the analysis for your specific research needs. Choose between Claude API or local sentence transformers, adjust sensitivity thresholds, and set output preferences.

In [None]:
# Configuration Class and Interactive Interface

class ChunkerConfig:
    """Configuration for semantic chunking following best practices"""

    # Core chunking parameters
    SIMILARITY_THRESHOLD = 0.5  # Lower = more chunks, Higher = fewer chunks
    MAX_CHUNK_SENTENCES = 5     # Maximum sentences per chunk
    MIN_CHUNK_SENTENCES = 1     # Minimum sentences per chunk
    MODEL_NAME = 'all-MiniLM-L6-v2'  # Sentence transformer model
    CHUNK_OVERLAP = 0           # Number of sentences to overlap between chunks

    # Text processing options (all default to True)
    PRESERVE_SPEAKER_LABELS = True
    PRESERVE_TIMESTAMPS = True    # Keep timestamp patterns (True = keep them)
    USE_PARAGRAPH_BOUNDARIES = True
    MIN_SENTENCE_LENGTH = 5     # Minimum words in a sentence

    # Chunking method
    CHUNKING_METHOD = 'claude'  # 'embeddings' or 'claude' - Default to Claude
    CLAUDE_API_KEY = ''
    CLAUDE_MODEL = 'claude-4-sonnet-20250514'  # Default to Claude 4.0

    # Output options
    OUTPUT_FORMAT = 'excel'       # csv, excel, json - Default to Excel (recommended)
    INCLUDE_METADATA = True
    INCLUDE_SPEAKER_COLUMN = True
    TIMESTAMP_OUTPUT = True

# Create global config instance
config = ChunkerConfig()

def create_configuration_interface():
    """Create an interactive configuration interface using widgets"""

    style = {'description_width': '160px'}
    layout = widgets.Layout(width='420px')

    # Instructions
    instructions_html = """
    <div style='background-color: #E7ECEF; padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 5px solid #274C77;'>
    <h2 style='color: #274C77; margin-top: 0;'>🎯 Interview Transcript Semantic Chunker</h2>
    <p><strong>Welcome!</strong> This tool helps social scientists process interview transcripts by breaking them into semantically coherent chunks for qualitative analysis.</p>
    <h3 style='color: #274C77;'>📖 How to Use:</h3>
    <ol>
        <li><strong>Configure:</strong> Adjust the settings below based on your research needs</li>
        <li><strong>Upload:</strong> Add your interview transcript files (multiple formats supported)</li>
        <li><strong>Process:</strong> Run the semantic chunking algorithm</li>
        <li><strong>Export:</strong> Download your chunked data as CSV/Excel for coding</li>
    </ol>
    </div>
    """

    # Core Parameters Section
    core_header = widgets.HTML("<h3 style='margin: 20px 0 10px 0; color: #2c3e50;'>🎯 Core Chunking Parameters</h3>")

    similarity_slider = widgets.FloatSlider(
        value=config.SIMILARITY_THRESHOLD,
        min=0.1,
        max=0.9,
        step=0.05,
        description='Similarity Threshold:',
        disabled=False,
        continuous_update=False,
        readout=True,
        readout_format='.2f',
        style=style,
        layout=layout
    )

    max_sentences_slider = widgets.IntSlider(
        value=config.MAX_CHUNK_SENTENCES,
        min=1,
        max=10,
        step=1,
        description='Max Sentences:',
        disabled=False,
        continuous_update=False,
        readout=True,
        style=style,
        layout=layout
    )

    min_sentences_slider = widgets.IntSlider(
        value=config.MIN_CHUNK_SENTENCES,
        min=1,
        max=5,
        step=1,
        description='Min Sentences:',
        disabled=False,
        continuous_update=False,
        readout=True,
        style=style,
        layout=layout
    )

    # Chunking Method Selection
    method_header = widgets.HTML("<h3 style='margin: 30px 0 15px 0; color: #2c3e50;'>🧠 Chunking Method</h3>")

    method_radio = widgets.RadioButtons(
        options=[
            ('Claude API (More Intelligent)', 'claude'),
            ('Sentence Embeddings (No API Key)', 'embeddings')
        ],
        value=config.CHUNKING_METHOD,
        description='Method:',
        disabled=False,
        style={'description_width': '80px'},
        layout=widgets.Layout(width='450px', margin='10px 0')
    )

    api_key_text = widgets.Password(
        value=config.CLAUDE_API_KEY,
        placeholder='Enter Claude API key (only if using Claude method)',
        description='API Key:',
        disabled=False,
        style=style,
        layout=layout
    )

    # Claude model selection
    claude_model_dropdown = widgets.Dropdown(
        options=[
            ('Claude 4.0 Sonnet', 'claude-4-sonnet-20250514'),
            ('Claude 3.5 Sonnet (Latest)', 'claude-3-5-sonnet-20241022')
        ],
        value=config.CLAUDE_MODEL,
        description='Claude Model:',
        disabled=False,
        style=style,
        layout=layout
    )

    # Model Selection (for embeddings method)
    model_header = widgets.HTML("<h3 style='margin: 20px 0 10px 0; color: #2c3e50;'>🔧 Model Settings</h3>")

    model_dropdown = widgets.Dropdown(
        options=[
            ('Mini-LM (Fast, Good Quality)', 'all-MiniLM-L6-v2'),
            ('MPNet (High Quality, Slower)', 'all-mpnet-base-v2'),
            ('Multi-QA (Question-Answer)', 'multi-qa-MiniLM-L6-cos-v1'),
        ],
        value=config.MODEL_NAME,
        description='Embedding Model:',
        disabled=False,
        style=style,
        layout=layout
    )

    # Processing Options (Enhanced with timestamp removal)
    processing_header = widgets.HTML("<h3 style='margin: 20px 0 10px 0; color: #2c3e50;'>📝 Text Processing Options</h3>")

    preserve_speakers_checkbox = widgets.Checkbox(
        value=config.PRESERVE_SPEAKER_LABELS,
        description='Preserve speaker labels (Q:, A:, Name:)',
        disabled=False,
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='400px', margin='5px 0')
    )

    preserve_timestamps_checkbox = widgets.Checkbox(
        value=config.PRESERVE_TIMESTAMPS,
        description='Preserve timestamps and time codes',
        disabled=False,
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='400px', margin='5px 0')
    )

    use_paragraphs_checkbox = widgets.Checkbox(
        value=config.USE_PARAGRAPH_BOUNDARIES,
        description='Respect paragraph boundaries',
        disabled=False,
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='400px', margin='5px 0')
    )

    # Output Settings
    output_header = widgets.HTML("<h3 style='margin: 20px 0 10px 0; color: #2c3e50;'>💾 Output Settings</h3>")

    output_format_radio = widgets.RadioButtons(
        options=[
            ('CSV - Simple spreadsheet (Excel/Google Sheets)', 'csv'),
            ('Excel - Multi-sheet with statistics (Recommended)', 'excel'),
            ('JSON - Structured data for programming', 'json')
        ],
        value=config.OUTPUT_FORMAT,
        description='Output Format:',
        disabled=False,
        style={'description_width': '100px'},
        layout=widgets.Layout(width='500px', margin='10px 0')
    )

    include_speaker_checkbox = widgets.Checkbox(
        value=config.INCLUDE_SPEAKER_COLUMN,
        description='Include speaker identification column',
        disabled=False,
        style={'description_width': 'initial'},
        layout=widgets.Layout(width='400px', margin='5px 0')
    )

    # Action Buttons
    buttons_header = widgets.HTML("<h3 style='margin: 30px 0 15px 0; color: #2c3e50;'>⚙️ Actions</h3>")

    apply_button = widgets.Button(
        description='✅ Apply Configuration',
        disabled=False,
        button_style='',
        tooltip='Apply current settings',
        icon='check',
        layout=widgets.Layout(width='220px', height='45px', margin='5px'),
        style={'button_color': '#6096BA', 'font_weight': 'bold'}
    )

    test_button = widgets.Button(
        description='🧪 Test Setup',
        disabled=False,
        button_style='',
        tooltip='Test if everything is configured correctly',
        icon='flask',
        layout=widgets.Layout(width='160px', height='45px', margin='5px'),
        style={'button_color': '#A3CEF1', 'font_weight': 'bold'}
    )

    # Status output
    status_output = widgets.Output()

    # Event handlers
    def apply_configuration(b):
        with status_output:
            status_output.clear_output()
            try:
                # Update config with widget values
                config.SIMILARITY_THRESHOLD = similarity_slider.value
                config.MAX_CHUNK_SENTENCES = max_sentences_slider.value
                config.MIN_CHUNK_SENTENCES = min_sentences_slider.value
                config.MODEL_NAME = model_dropdown.value
                config.CHUNKING_METHOD = method_radio.value
                config.CLAUDE_API_KEY = api_key_text.value
                config.CLAUDE_MODEL = claude_model_dropdown.value
                config.PRESERVE_SPEAKER_LABELS = preserve_speakers_checkbox.value
                config.PRESERVE_TIMESTAMPS = preserve_timestamps_checkbox.value
                config.USE_PARAGRAPH_BOUNDARIES = use_paragraphs_checkbox.value
                config.OUTPUT_FORMAT = output_format_radio.value
                config.INCLUDE_SPEAKER_COLUMN = include_speaker_checkbox.value

                print("✅ Configuration applied successfully!")
                print(f"📊 Method: {config.CHUNKING_METHOD}")
                print(f"📊 Similarity Threshold: {config.SIMILARITY_THRESHOLD}")
                print(f"📝 Chunk Size: {config.MIN_CHUNK_SENTENCES}-{config.MAX_CHUNK_SENTENCES} sentences")
                if config.CHUNKING_METHOD == 'embeddings':
                    print(f"🧠 Model: {config.MODEL_NAME}")
                elif config.CHUNKING_METHOD == 'claude':
                    print(f"🤖 Claude Model: {config.CLAUDE_MODEL}")
                print(f"💾 Output: {config.OUTPUT_FORMAT.upper()}")
                print(f"🔧 Processing: Speaker labels {'preserved' if config.PRESERVE_SPEAKER_LABELS else 'removed'}, Timestamps {'preserved' if config.PRESERVE_TIMESTAMPS else 'removed'}")

            except Exception as e:
                print(f"❌ Error applying configuration: {e}")

    def test_setup(b):
        with status_output:
            status_output.clear_output()
            print("🧪 Testing setup...")

            if config.CHUNKING_METHOD == 'claude':
                if not config.CLAUDE_API_KEY:
                    print("❌ Claude API key required for Claude method")
                    return
                print(f"✅ Claude API key provided")
                print(f"🤖 Selected model: {config.CLAUDE_MODEL}")
                try:
                    import anthropic
                    print("✅ Anthropic library available")
                except ImportError:
                    print("❌ anthropic package not installed. Please install it: !pip install anthropic")
                    return
            else:
                try:
                    print(f"🔄 Testing model: {config.MODEL_NAME}")
                    model = SentenceTransformer(config.MODEL_NAME)
                    test_sentence = "This is a test sentence."
                    embedding = model.encode([test_sentence])
                    print(f"✅ Model loaded successfully! Embedding dimension: {embedding.shape[1]}")
                except Exception as e:
                    print(f"❌ Model loading failed: {e}")
                    return

            print("✅ All tests passed! Ready to process files.")

    # Bind events
    apply_button.on_click(apply_configuration)
    test_button.on_click(test_setup)

    # Help documentation
    help_html = """
    <div style='background-color: #A3CEF1; padding: 15px; border-radius: 5px; margin: 15px 0; border-left: 4px solid #6096BA;'>
    <h4 style='color: #274C77; margin-top: 0;'>📖 Configuration Guide</h4>
    <div style='display: flex; gap: 20px; margin: 15px 0;'>
        <div style='flex: 1;'>
            <ul>
                <li><strong>Similarity Threshold:</strong> Lower values (0.3-0.5) = more, shorter chunks. Higher values (0.6-0.8) = fewer, longer chunks.</li>
                <li><strong>Chunking Methods:</strong>
                    <ul>
                        <li><em>Claude:</em> More intelligent chunking using Claude API (requires key) - recommended for best results</li>
                        <li><em>Embeddings:</em> Uses AI sentence similarity (no API key needed) - good alternative option</li>
                    </ul>
                </li>
                <li><strong>Claude Models:</strong>
                    <ul>
                        <li><em>4.0 Sonnet:</em> Latest model with enhanced capabilities (recommended)</li>
                        <li><em>3.5 Sonnet:</em> Reliable fallback option with proven performance</li>
                    </ul>
                </li>
                <li><strong>Text Processing:</strong>
                    <ul>
                        <li><em>Speaker Preservation:</em> Maintains "Q:", "A:", "Interviewer:" style labels and intelligently infers speakers for continuation paragraphs</li>
                        <li><em>Timestamp Preservation:</em> Keeps [00:12:34] and similar time codes (uncheck to remove)</li>
                        <li><em>Paragraph Boundaries:</em> Prevents chunks from crossing paragraph breaks</li>
                    </ul>
                </li>
            </ul>
        </div>
        <div style='flex: 1;'>
            <ul>
                <li><strong>Export Formats:</strong>
                    <ul>
                        <li><em>CSV:</em> Simple spreadsheet - easy to open in Excel/Google Sheets for quick analysis</li>
                        <li><em>Excel:</em> Professional format with 3 sheets (data + statistics + settings) - best for NVivo/ATLAS.ti import</li>
                        <li><em>JSON:</em> Technical format with metadata - best for custom programming analysis</li>
                    </ul>
                </li>
                <li><strong>File Storage:</strong>
                    <ul>
                        <li>All files saved to organized folders in Colab</li>
                        <li>Includes README file explaining contents</li>
                        <li>Configuration saved for reproducibility</li>
                    </ul>
                </li>
            </ul>
        </div>
    </div>
    <p><strong>💡 Tip:</strong> Start with default settings and Excel format for most research projects.</p>
    </div>
    """

    # Create organized layout sections
    core_section = widgets.VBox([
        core_header,
        similarity_slider,
        max_sentences_slider,
        min_sentences_slider
    ], layout=widgets.Layout(margin='10px'))

    method_section = widgets.VBox([
        method_header,
        method_radio,
        api_key_text,
        claude_model_dropdown
    ], layout=widgets.Layout(margin='10px'))

    model_section = widgets.VBox([
        model_header,
        model_dropdown
    ], layout=widgets.Layout(margin='10px'))

    processing_section = widgets.VBox([
        processing_header,
        preserve_speakers_checkbox,
        preserve_timestamps_checkbox,
        use_paragraphs_checkbox
    ], layout=widgets.Layout(margin='10px'))

    output_section = widgets.VBox([
        output_header,
        output_format_radio,
        include_speaker_checkbox
    ], layout=widgets.Layout(margin='10px'))

    buttons_section = widgets.VBox([
        buttons_header,
        widgets.HBox([apply_button, test_button])
    ], layout=widgets.Layout(margin='10px'))

    # Organize into two columns for better spacing
    left_column = widgets.VBox([
        core_section,
        method_section,
        model_section
    ])

    right_column = widgets.VBox([
        processing_section,
        output_section,
        buttons_section
    ])

    main_container = widgets.VBox([
        widgets.HBox([left_column, right_column], layout=widgets.Layout(gap='40px')),
        status_output
    ])

    # Display everything
    display(HTML(instructions_html))
    display(HTML(help_html))
    display(main_container)

    return {
        'apply_button': apply_button,
        'test_button': test_button,
        'status_output': status_output
    }

# Initialize configuration interface
print("🎛️ Loading Configuration Interface...")
config_widgets = create_configuration_interface()

## File Upload and Document Processing

Upload your interview transcript files in multiple formats (PDF, DOCX, TXT, RTF) and extract clean text for analysis. The notebook handles encoding issues and provides detailed file statistics.




In [None]:
# File Upload and Document Processing

def read_pdf(file_content):
    """Extract text from PDF file"""
    try:
        pdf_reader = PyPDF2.PdfReader(io.BytesIO(file_content))
        text = []
        for page in pdf_reader.pages:
            page_text = page.extract_text()
            if page_text:
                text.append(page_text)
        return '\n\n'.join(text)
    except Exception as e:
        raise ValueError(f"Error reading PDF: {e}")

def read_word_document(file_content):
    """Extract text from Word document"""
    try:
        doc = docx.Document(io.BytesIO(file_content))
        full_text = []

        for paragraph in doc.paragraphs:
            if paragraph.text.strip():
                full_text.append(paragraph.text.strip())

        return '\n\n'.join(full_text)
    except Exception as e:
        raise ValueError(f"Error reading Word document: {e}")

def read_text_file(file_content, encoding='utf-8'):
    """Read plain text file with fallback encoding detection"""
    encodings_to_try = [encoding, 'latin-1', 'utf-16', 'cp1252', 'iso-8859-1']

    for enc in encodings_to_try:
        try:
            return file_content.decode(enc)
        except UnicodeDecodeError:
            continue

    raise ValueError("Could not decode file with any common encoding")

def read_rtf_file(file_content):
    """Read RTF file"""
    try:
        # Try to use striprtf if available
        from striprtf.striprtf import rtf_to_text
        rtf_text = file_content.decode('utf-8', errors='ignore')
        return rtf_to_text(rtf_text)
    except ImportError:
        # Fallback: basic RTF stripping
        rtf_text = file_content.decode('utf-8', errors='ignore')
        # Simple RTF tag removal (basic fallback)
        text = re.sub(r'\\[a-z]+\d*\s?', '', rtf_text)
        text = re.sub(r'[{}]', '', text)
        return text.strip()
    except Exception as e:
        raise ValueError(f"Error reading RTF file: {e}")

def create_file_upload_interface():
    """Create streamlined file upload interface"""

    # Instructions
    instructions_html = """
    <div style='background-color: #E7ECEF; padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 5px solid #274C77;'>
    <h3 style='color: #274C77; margin-top: 0;'>📁 Upload Your Interview Transcript(s)</h3>
    <p><strong>Ready to upload your files!</strong> This tool supports multiple file formats commonly used for interview transcripts.</p>

    <div style='display: flex; gap: 20px; margin: 15px 0;'>
        <div style='flex: 1; background-color: #A3CEF1; padding: 15px; border-radius: 8px; border-left: 4px solid #6096BA;'>
            <h4 style='color: #274C77; margin-top: 0;'>✅ Supported Formats:</h4>
            <ul>
                <li><strong>.docx</strong> - Microsoft Word documents</li>
                <li><strong>.pdf</strong> - PDF files (text-based only)</li>
                <li><strong>.txt</strong> - Plain text files</li>
                <li><strong>.rtf</strong> - Rich Text Format</li>
            </ul>
        </div>
        <div style='flex: 1; background-color: #A3CEF1; padding: 15px; border-radius: 8px; border-left: 4px solid #6096BA;'>
            <h4 style='color: #274C77; margin-top: 0;'>💡 Best Practices:</h4>
            <ul>
                <li>Use consistent speaker labels (Q:, A:, Name:)</li>
                <li>Remove headers/footers/page numbers</li>
                <li>One speaker turn per paragraph works best</li>
                <li>Ensure text is selectable (not scanned images)</li>
            </ul>
        </div>
    </div>
    </div>
    """

    # Upload button
    upload_button = widgets.Button(
        description='📤 Choose Files to Upload',
        disabled=False,
        button_style='',
        tooltip='Click to select and upload your interview transcript files',
        icon='upload',
        layout=widgets.Layout(width='300px', height='50px'),
        style={'button_color': '#6096BA', 'font_weight': 'bold'}
    )

    # Clear button
    clear_button = widgets.Button(
        description='🗑️ Clear All Files',
        disabled=False,
        button_style='',
        tooltip='Remove all uploaded files',
        icon='trash',
        layout=widgets.Layout(width='150px', height='40px'),
        style={'button_color': '#8B8C89', 'font_weight': 'bold'}
    )

    # File info display
    file_info = widgets.HTML(
        value="<p style='color: #666; font-style: italic;'>No files uploaded yet. Click 'Choose Files to Upload' to begin.</p>",
        layout=widgets.Layout(width='100%')
    )

    # Upload progress/status
    upload_output = widgets.Output()

    # Global storage for uploaded files
    global uploaded_files, file_stats
    uploaded_files = {}
    file_stats = {}

    def handle_file_upload(b):
        with upload_output:
            upload_output.clear_output()

            print("=" * 60)
            print("📁 FILE UPLOAD")
            print("=" * 60)
            print("Supported formats: .docx, .pdf, .txt, .rtf")
            print("Please select your interview transcript file(s)...")
            print()

            try:
                uploaded = files.upload()

                if not uploaded:
                    print("❌ No files were selected.")
                    return

                print(f"\\n📥 Processing {len(uploaded)} file(s)...")

                global uploaded_files, file_stats
                uploaded_files.clear()
                file_stats.clear()

                successful_files = []
                failed_files = []

                for filename, file_content in uploaded.items():
                    print(f"\\n🔄 Processing: {filename}")

                    try:
                        # Determine file type and process
                        file_ext = os.path.splitext(filename)[1].lower()

                        if file_ext == '.docx':
                            text = read_word_document(file_content)
                        elif file_ext == '.pdf':
                            text = read_pdf(file_content)
                        elif file_ext in ['.txt']:
                            text = read_text_file(file_content)
                        elif file_ext == '.rtf':
                            text = read_rtf_file(file_content)
                        else:
                            print(f"   ❌ Unsupported file format: {file_ext}")
                            failed_files.append(filename)
                            continue

                        # Validate extracted text
                        if not text or len(text.strip()) < 10:
                            print(f"   ❌ No meaningful text extracted from {filename}")
                            failed_files.append(filename)
                            continue

                        # Store file and calculate stats
                        uploaded_files[filename] = text

                        word_count = len(text.split())
                        char_count = len(text)
                        estimated_sentences = len([s for s in re.split(r'[.!?]+', text) if s.strip()])

                        file_stats[filename] = {
                            'word_count': word_count,
                            'char_count': char_count,
                            'sentence_count': estimated_sentences,
                            'file_size': len(file_content)
                        }

                        print(f"   ✅ Success!")
                        print(f"      📊 {word_count:,} words")
                        print(f"      📊 {char_count:,} characters")
                        print(f"      📊 ~{estimated_sentences:,} sentences")
                        print(f"      📁 File size: {len(file_content):,} bytes")

                        successful_files.append(filename)

                    except Exception as e:
                        print(f"   ❌ Error processing {filename}: {e}")
                        failed_files.append(filename)

                # Update display
                if successful_files:
                    print(f"\\n🎉 Successfully processed {len(successful_files)} file(s)!")

                    # Create summary display
                    info_html = "<div style='background-color: #A3CEF1; padding: 15px; border-radius: 5px; border-left: 4px solid #6096BA;'>"
                    info_html += f"<h4 style='color: #274C77; margin-top: 0;'>✅ {len(successful_files)} File(s) Successfully Uploaded</h4>"

                    for filename in successful_files:
                        stats = file_stats[filename]
                        info_html += f"<div style='margin: 10px 0; padding: 10px; background-color: #E7ECEF; border-radius: 3px; border-left: 3px solid #274C77;'>"
                        info_html += f"<strong>📄 {filename}</strong><br>"
                        info_html += f"<small style='color: #274C77;'>📊 {stats['word_count']:,} words • {stats['sentence_count']:,} sentences • {stats['file_size']:,} bytes</small>"
                        info_html += "</div>"

                    if len(successful_files) > 1:
                        total_words = sum(stats['word_count'] for stats in file_stats.values())
                        info_html += f"<p style='color: #274C77;'><strong>📈 Total: {total_words:,} words across all files</strong></p>"

                    info_html += "<p style='color: #274C77;'><strong>✅ Ready to process! Continue to the next step.</strong></p>"
                    info_html += "</div>"

                else:
                    info_html = "<div style='background-color: #E7ECEF; padding: 15px; border-radius: 5px; border-left: 4px solid #8B8C89;'>"
                    info_html += "<h4 style='color: #274C77; margin-top: 0;'>❌ No Files Successfully Processed</h4>"
                    info_html += "<p style='color: #274C77;'>Please check your file formats and try again.</p>"
                    info_html += "</div>"

                if failed_files:
                    print(f"\\n⚠️ Failed to process {len(failed_files)} file(s): {', '.join(failed_files)}")

                file_info.value = info_html

            except Exception as e:
                print(f"\\n❌ Upload error: {e}")
                file_info.value = f"<p style='color: red;'>❌ Upload failed: {e}</p>"

    def clear_files(b):
        with upload_output:
            upload_output.clear_output()

            global uploaded_files, file_stats
            uploaded_files.clear()
            file_stats.clear()

            file_info.value = "<p style='color: #666; font-style: italic;'>Files cleared. Click 'Choose Files to Upload' to start over.</p>"
            print("🗑️ All files cleared successfully.")

    # Bind events
    upload_button.on_click(handle_file_upload)
    clear_button.on_click(clear_files)

    # Layout
    buttons_container = widgets.HBox([upload_button, clear_button])

    # Display interface
    display(HTML(instructions_html))
    display(buttons_container)
    display(file_info)
    display(upload_output)

    return {
        'upload_button': upload_button,
        'clear_button': clear_button,
        'file_info': file_info,
        'output': upload_output
    }

# Initialize file upload interface
print("📁 File Upload Interface Ready")
print("👆 Configure your settings above, then upload your files below!")

# Global variables for file storage
uploaded_files = {}
file_stats = {}

# Create and display upload interface
upload_interface = create_file_upload_interface()

# Text Processing and Chunking

Core functions that perform semantic analysis to identify natural conversation boundaries and create meaningful chunks. This section preserves speaker labels while intelligently segmenting content based on topic shifts.

In [None]:
# Core Text Processing and Chunking Functions

def remove_timestamps(text):
    """Remove common timestamp patterns from interview transcripts"""
    if config.PRESERVE_TIMESTAMPS:  # If preserving timestamps, don't remove them
        return text

    # First, handle timestamps within speaker labels
    # Pattern for **Speaker: timestamp S#:** format
    text = re.sub(r'(\*\*[^:*]+):\s*\d{1,2}:\d{2}:\d{2}\.\d+\s*S\d+:(\*\*)', r'\1:\2', text)
    text = re.sub(r'(\*\*[^:*]+):\s*\d{1,2}:\d{2}\.\d+\s*S\d+:(\*\*)', r'\1:\2', text)

    # Common timestamp patterns to remove
    timestamp_patterns = [
        r'\[\d{1,2}:\d{2}:\d{2}\]',           # [00:12:34]
        r'\[\d{1,2}:\d{2}\]',                 # [12:34]
        r'\(\d{1,2}:\d{2}:\d{2}\)',           # (00:12:34)
        r'\(\d{1,2}:\d{2}\)',                 # (12:34)
        r'\d{1,2}:\d{2}:\d{2}\.\d+\s*S\d+',   # 0:02:05.8 S2
        r'\d{1,2}:\d{2}:\d{2}\s*S\d+',        # 0:02:05 S2
        r'\d{1,2}:\d{2}:\d{2}\.\d+',          # 00:12:34.5
        r'\d{1,2}:\d{2}:\d{2}',               # 00:12:34
        r'^\d{1,2}:\d{2}\s',                  # 12:34 at start of line
        r'<\d{1,2}:\d{2}:\d{2}>',             # <00:12:34>
        r'<\d{1,2}:\d{2}>',                   # <12:34>
        r'\*\d{1,2}:\d{2}:\d{2}\*',           # *00:12:34*
        r'\*\d{1,2}:\d{2}\*',                 # *12:34*
        r'Timestamp:\s*\d{1,2}:\d{2}:\d{2}',  # Timestamp: 00:12:34
        r'Time:\s*\d{1,2}:\d{2}:\d{2}',       # Time: 00:12:34
    ]

    cleaned_text = text
    for pattern in timestamp_patterns:
        cleaned_text = re.sub(pattern, '', cleaned_text, flags=re.MULTILINE)

    # Clean up extra whitespace left by timestamp removal
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)  # Multiple spaces to single
    cleaned_text = re.sub(r'\n\s*\n', '\n\n', cleaned_text)  # Clean up paragraph breaks

    return cleaned_text.strip()

def extract_speaker_label(text):
    """Extract speaker label from text if present with enhanced pattern recognition"""
    # Enhanced patterns for interview transcripts
    patterns = [
        r'^\*\*([^:*]+):\*\*\s*(.*)',         # **Speaker:** format (markdown bold)
        r'^\*\*([^:*]+):\s*[\d:.]+ S\d+:\*\*\s*(.*)',  # **Speaker: 0:02:05.8 S2:** format
        r'^([A-Za-z0-9\s]+):\s*(.*)',         # Standard "Speaker: text"
        r'^(Q|A):\s*(.*)',             # "Q: text" or "A: text"
        r'^\[([^\]]+)\]:\s*(.*)',      # "[Speaker]: text"
        r'^([A-Z][a-z]*)\s*[-–—]\s*(.*)',  # "Speaker - text"
        r'^(Interviewer|Participant|Researcher|Subject|Respondent)\s*[:\-–—]\s*(.*)',  # Common interview roles
        r'^([A-Z]{2,})\s*[:\-–—]\s*(.*)',  # Acronym speakers like "CEO:", "HR:"
        r'^\*([^*]+)\*:\s*(.*)',       # "*Speaker*: text"
        r'^(\d+)\.\s*([A-Za-z][^:]*?):\s*(.*)',  # "1. Speaker: text"
    ]

    for pattern in patterns:
        match = re.match(pattern, text.strip(), re.IGNORECASE)
        if match:
            groups = match.groups()
            if len(groups) == 2:
                return groups[0].strip(), groups[1].strip()
            elif len(groups) == 3:  # Pattern with numbering
                return groups[1].strip(), groups[2].strip()

    return None, text

def smart_speaker_inference(paragraphs):
    """Infer speakers for unlabeled paragraphs using context and patterns"""
    processed_paragraphs = []
    current_speaker = None
    last_explicit_speaker = None
    interview_pattern_detected = False

    # Detect if this looks like a Q&A interview or uses markdown bold speakers
    total_paragraphs = len(paragraphs)
    q_a_count = sum(1 for para in paragraphs if re.match(r'^\s*[QA]:', para, re.IGNORECASE))
    bold_speaker_count = sum(1 for para in paragraphs if re.match(r'^\*\*[^:*]+:\*\*', para))

    if q_a_count > total_paragraphs * 0.3:  # More than 30% are Q: or A:
        interview_pattern_detected = True
        print("🎤 Q&A interview pattern detected - enhanced speaker inference enabled")
    elif bold_speaker_count > total_paragraphs * 0.3:
        print("🎤 Markdown bold speaker pattern detected")

    for i, para in enumerate(paragraphs):
        if not para.strip():
            continue

        lines = para.split('\n')
        processed_lines = []
        para_speaker = None

        for line in lines:
            line = line.strip()
            if not line:
                continue

            speaker, content = extract_speaker_label(line)

            if speaker:
                # Explicit speaker found
                current_speaker = speaker
                last_explicit_speaker = speaker
                para_speaker = speaker
                if content.strip():
                    processed_lines.append(f"{speaker}: {content}")
            else:
                # No explicit speaker - try to infer
                if content.strip():
                    inferred_speaker = None

                    # Always use the last explicit speaker for continuations
                    if last_explicit_speaker:
                        inferred_speaker = last_explicit_speaker

                    if inferred_speaker:
                        processed_lines.append(f"{inferred_speaker}: {content}")
                        if not para_speaker:  # Update paragraph speaker if not set
                            para_speaker = inferred_speaker
                    else:
                        processed_lines.append(content)

        # Update current speaker based on this paragraph
        if para_speaker:
            current_speaker = para_speaker

        if processed_lines:
            processed_paragraphs.append('\n'.join(processed_lines))

    return processed_paragraphs

def preprocess_text_for_chunking(text):
    """Preprocess text while preserving speaker information and handling paragraph continuations"""

    # First, remove timestamps if configured
    if not config.PRESERVE_TIMESTAMPS:
        text = remove_timestamps(text)
        print("🕒 Timestamps removed from text")

    # Handle speaker preservation with improved continuation logic
    if not config.PRESERVE_SPEAKER_LABELS:
        return text

    # Split into paragraphs
    paragraphs = text.split('\n\n') if config.USE_PARAGRAPH_BOUNDARIES else [text]
    processed_paragraphs = []

    # Global speaker tracking across paragraphs
    current_speaker = None
    last_explicit_speaker = None

    for i, para in enumerate(paragraphs):
        if not para.strip():
            continue

        lines = para.split('\n')
        processed_lines = []
        para_has_speaker = False

        # First check if this paragraph starts with a speaker label
        first_line_speaker, _ = extract_speaker_label(lines[0].strip() if lines else "")
        if first_line_speaker:
            para_has_speaker = True
            current_speaker = first_line_speaker
            last_explicit_speaker = first_line_speaker

        for line in lines:
            line = line.strip()
            if not line:
                continue

            speaker, content = extract_speaker_label(line)

            if speaker:
                # Explicit speaker found - update current speaker
                current_speaker = speaker
                last_explicit_speaker = speaker
                if content.strip():  # Only add if there's actual content
                    processed_lines.append(f"{speaker}: {content}")
            else:
                # No explicit speaker - this is a continuation
                if content.strip():
                    # Use the last explicit speaker for continuation
                    if last_explicit_speaker:
                        processed_lines.append(f"{last_explicit_speaker}: {content}")
                    else:
                        processed_lines.append(content)

        if processed_lines:
            processed_paragraphs.append('\n'.join(processed_lines))

    result = '\n\n'.join(processed_paragraphs)

    # Count speaker continuations for user feedback
    original_speaker_lines = len([line for line in text.split('\n') if extract_speaker_label(line)[0] is not None])
    processed_speaker_lines = len([line for line in result.split('\n') if extract_speaker_label(line)[0] is not None])

    if processed_speaker_lines > original_speaker_lines:
        added_labels = processed_speaker_lines - original_speaker_lines
        print(f"👥 Added speaker labels to {added_labels} continuation lines")

    return result

def chunk_with_embeddings(text, similarity_threshold=None, max_chunk_size=None, min_chunk_size=None):
    """Chunk text using sentence embeddings (no API needed)"""

    # Use config values if not provided
    similarity_threshold = similarity_threshold or config.SIMILARITY_THRESHOLD
    max_chunk_size = max_chunk_size or config.MAX_CHUNK_SENTENCES
    min_chunk_size = min_chunk_size or config.MIN_CHUNK_SENTENCES

    print(f"🧠 Loading model: {config.MODEL_NAME}")
    try:
        model = SentenceTransformer(config.MODEL_NAME)
    except Exception as e:
        print(f"❌ Error loading model: {e}")
        return [text]  # Return original text as fallback

    # Preprocess text (includes timestamp removal if enabled)
    processed_text = preprocess_text_for_chunking(text)

    # Split into sentences
    try:
        sentences = sent_tokenize(processed_text)
    except:
        # Fallback sentence splitting
        sentences = re.split(r'[.!?]+\s+', processed_text)
        sentences = [s.strip() for s in sentences if s.strip()]

    print(f"📝 Processing {len(sentences)} sentences")

    if len(sentences) <= 1:
        return [text]

    # Filter out very short sentences if configured
    if config.MIN_SENTENCE_LENGTH > 0:
        filtered_sentences = []
        for sent in sentences:
            # Clean sentence for word counting (remove speaker labels for counting)
            _, content = extract_speaker_label(sent)
            word_count = len(content.split())
            if word_count >= config.MIN_SENTENCE_LENGTH:
                filtered_sentences.append(sent)
            elif filtered_sentences:  # Append short sentences to previous if exists
                filtered_sentences[-1] += " " + sent
        sentences = filtered_sentences

    if len(sentences) <= 1:
        return [text]

    print("🔄 Computing sentence embeddings...")
    try:
        # Clean sentences for embedding (remove speaker labels for similarity calculation)
        clean_sentences = []
        for sent in sentences:
            _, content = extract_speaker_label(sent)
            clean_sentences.append(content.strip())

        print(f"📊 Encoding {len(clean_sentences)} sentences with {config.MODEL_NAME}...")
        embeddings = model.encode(clean_sentences, show_progress_bar=True)
    except Exception as e:
        print(f"❌ Error computing embeddings: {e}")
        return [text]

    # Calculate similarities between consecutive sentences
    similarities = []
    for i in range(len(embeddings) - 1):
        similarity = np.dot(embeddings[i], embeddings[i+1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i+1])
        )
        similarities.append(similarity)

    print(f"📊 Average similarity: {np.mean(similarities):.3f}")

    # Create chunks based on similarity drops
    chunks = []
    current_chunk = [sentences[0]]

    for i, similarity in enumerate(similarities):
        next_sentence = sentences[i + 1]

        # Decide whether to start new chunk
        start_new_chunk = False

        # Check similarity threshold
        if similarity < similarity_threshold:
            start_new_chunk = True

        # Check max chunk size
        if len(current_chunk) >= max_chunk_size:
            start_new_chunk = True

        # Check paragraph boundaries if enabled
        if config.USE_PARAGRAPH_BOUNDARIES:
            if current_chunk and current_chunk[-1].endswith('\n\n'):
                start_new_chunk = True

        if start_new_chunk and len(current_chunk) >= min_chunk_size:
            # Finish current chunk
            chunk_text = ' '.join(current_chunk).replace('\n\n', '\n').strip()
            chunks.append(chunk_text)
            current_chunk = [next_sentence]
        else:
            current_chunk.append(next_sentence)

    # Add final chunk
    if current_chunk:
        if len(current_chunk) >= min_chunk_size:
            chunk_text = ' '.join(current_chunk).replace('\n\n', '\n').strip()
            chunks.append(chunk_text)
        elif chunks:
            # Merge with last chunk if too small
            chunk_text = ' '.join(current_chunk).replace('\n\n', '\n').strip()
            chunks[-1] += ' ' + chunk_text
        else:
            # Force include even if small (only chunk)
            chunk_text = ' '.join(current_chunk).replace('\n\n', '\n').strip()
            chunks.append(chunk_text)


    print(f"✅ Created {len(chunks)} chunks")
    return chunks

def chunk_with_claude(text, api_key=None, max_chunk_sentences=None):
    """Chunk text using Claude API for more intelligent segmentation"""

    api_key = api_key or config.CLAUDE_API_KEY
    max_chunk_sentences = max_chunk_sentences or config.MAX_CHUNK_SENTENCES

    if not api_key:
        print("❌ Claude API key required for Claude chunking method")
        return [text]

    try:
        import anthropic
        client = anthropic.Anthropic(api_key=api_key)
        print(f"🤖 Using Claude API ({config.CLAUDE_MODEL}) for intelligent chunking...")
    except ImportError:
        print("❌ anthropic package not installed. Using embeddings fallback.")
        return chunk_with_embeddings(text)
    except Exception as e:
        print(f"❌ Claude API error: {e}")
        return chunk_with_embeddings(text)

    # Preprocess text (includes timestamp removal if enabled)
    processed_text = preprocess_text_for_chunking(text)

    # Split into manageable sections for API
    try:
        sentences = sent_tokenize(processed_text)
    except:
        sentences = re.split(r'[.!?]+\s+', processed_text)
        sentences = [s.strip() for s in sentences if s.strip()]

    # Process in batches to stay within API limits
    batch_size = 30  # sentences per API call
    all_chunks = []

    # Calculate total batches for progress tracking
    total_sentences = len(sentences)
    total_batches = (total_sentences + batch_size - 1) // batch_size

    print(f"📊 Processing {total_sentences} sentences in {total_batches} batches...")
    print(f"🔄 Progress: [", end='', flush=True)

    for batch_num, i in enumerate(range(0, len(sentences), batch_size)):
        batch_sentences = sentences[i:i + batch_size]
        batch_text = ' '.join(batch_sentences)

        # Update progress bar
        progress = int((batch_num / total_batches) * 20)
        print("=" * (progress - (batch_num - 1) * 20 // total_batches), end='', flush=True)

        # Skip very short batches
        if len(batch_text.strip()) < 50:
            if all_chunks:
                all_chunks[-1] += " " + batch_text
            else:
                all_chunks.append(batch_text)
            continue

        prompt = f"""Please analyze this interview transcript and break it into semantically coherent chunks. Each chunk should:

1. Contain at most {max_chunk_sentences} sentences
2. Represent a distinct topic, question-answer pair, or conversation turn
3. Preserve speaker labels exactly as they appear (Q:, A:, Interviewer:, etc.)
4. Maintain natural conversation flow

Return ONLY the chunks, each separated by exactly "---CHUNK_BREAK---"

Text to chunk:
{batch_text}"""

        try:
            response = client.messages.create(
                model=config.CLAUDE_MODEL,
                max_tokens=4000,
                messages=[{"role": "user", "content": prompt}]
            )

            # Parse response
            response_text = response.content[0].text if response.content else ""
            batch_chunks = response_text.split("---CHUNK_BREAK---")
            batch_chunks = [chunk.strip() for chunk in batch_chunks if chunk.strip()]

            if not batch_chunks:
                # Fallback if Claude didn't return proper format
                print(f"\n⚠️ Claude response format issue, using fallback for batch {batch_num + 1}/{total_batches}")
                print(f"🔄 Progress: [{'=' * progress}", end='', flush=True)
                fallback_chunks = []
                for j in range(0, len(batch_sentences), max_chunk_sentences):
                    chunk_sentences = batch_sentences[j:j + max_chunk_sentences]
                    fallback_chunks.append(' '.join(chunk_sentences))
                all_chunks.extend(fallback_chunks)
            else:
                all_chunks.extend(batch_chunks)

        except Exception as e:
            print(f"\n⚠️ API error for batch {batch_num + 1}/{total_batches}: {e}")
            print(f"🔄 Progress: [{'=' * progress}", end='', flush=True)
            # Fallback to simple splitting for this batch
            fallback_chunks = []
            for j in range(0, len(batch_sentences), max_chunk_sentences):
                chunk_sentences = batch_sentences[j:j + max_chunk_sentences]
                fallback_chunks.append(' '.join(chunk_sentences))
            all_chunks.extend(fallback_chunks)

    print("=" * (20 - progress) + "] 100%")
    print(f"✅ Created {len(all_chunks)} chunks using Claude API")
    return all_chunks

def analyze_chunks_quality(original_text, chunks):
    """Analyze the quality of chunking results"""

    # Basic preservation check
    original_words = len(original_text.split())
    chunk_words = sum(len(chunk.split()) for chunk in chunks)
    preservation_rate = (chunk_words / original_words) * 100 if original_words > 0 else 0

    # Chunk statistics
    chunk_lengths = [len(chunk.split()) for chunk in chunks]

    stats = {
        'total_chunks': len(chunks),
        'preservation_rate': preservation_rate,
        'avg_words_per_chunk': np.mean(chunk_lengths),
        'min_words': min(chunk_lengths) if chunk_lengths else 0,
        'max_words': max(chunk_lengths) if chunk_lengths else 0,
        'std_words': np.std(chunk_lengths) if len(chunk_lengths) > 1 else 0
    }

    return stats

print("✅ Core processing functions loaded successfully!")
print("🔧 Features: Semantic chunking, speaker preservation, timestamp removal")

## Main Processing Pipeline and Output

Execute the complete chunking workflow with your uploaded files and configured parameters. This orchestrates all previous functions and provides real-time progress feedback and quality validation.

In [None]:
# Main Processing Pipeline

def check_prerequisites(uploaded_files):
    """Check if everything is ready for processing"""
    issues = []

    # Check if files are uploaded
    if not uploaded_files:
        issues.append("❌ No files uploaded")

    # Check configuration
    if config.CHUNKING_METHOD == 'claude' and not config.CLAUDE_API_KEY:
        issues.append("❌ Claude API key required for Claude method")

    if issues:
        for issue in issues:
            print(issue)
        return False

    return True

# Removed select_file_for_processing as it used input()

def run_chunking_process(filename, text):
    """Run the main chunking process"""

    print("\\n" + "=" * 60)
    print("🚀 CHUNKING PROCESS")
    print("=" * 60)
    print(f"📄 Processing: {filename}")
    print(f"📊 Text length: {len(text):,} characters, {len(text.split()):,} words")
    print(f"🧠 Method: {config.CHUNKING_METHOD.title()}")
    print(f"⚙️ Settings: {config.MIN_CHUNK_SENTENCES}-{config.MAX_CHUNK_SENTENCES} sentences, threshold: {config.SIMILARITY_THRESHOLD}")
    print()

    # Run chunking based on method
    start_time = datetime.now()

    try:
        if config.CHUNKING_METHOD == 'claude':
            chunks = chunk_with_claude(text)
        else:
            chunks = chunk_with_embeddings(text)

        end_time = datetime.now()
        processing_time = (end_time - start_time).total_seconds()

        print(f"\\n⏱️ Processing completed in {processing_time:.1f} seconds")

        if not chunks:
            print("❌ No chunks created. Please check your settings.")
            return None

        # Analyze quality
        quality_stats = analyze_chunks_quality(text, chunks)

        print(f"\\n📊 Chunking Results:")
        print(f"   • Total chunks: {quality_stats['total_chunks']}")
        print(f"   • Text preservation: {quality_stats['preservation_rate']:.1f}%")
        print(f"   • Average words per chunk: {quality_stats['avg_words_per_chunk']:.1f}")
        print(f"   • Chunk size range: {quality_stats['min_words']}-{quality_stats['max_words']} words")

        if quality_stats['preservation_rate'] < 95:
            print("   ⚠️ Warning: Significant text loss detected")
        else:
            print("   ✅ Good text preservation")

        return chunks

    except Exception as e:
        print(f"❌ Chunking failed: {e}")
        return None

def create_dataframe_from_chunks(chunks, filename):
    """Create a comprehensive DataFrame from chunks"""

    chunk_data = []

    for i, chunk in enumerate(chunks):
        # Basic metrics
        word_count = len(chunk.split())
        char_count = len(chunk)
        sentence_count = len([s for s in re.split(r'[.!?]+', chunk) if s.strip()])

        data = {
            'chunk_id': i + 1,
            'text': chunk,
            'word_count': word_count,
            'char_count': char_count,
            'sentence_count': sentence_count
        }

        # Add speaker information if enabled
        if config.INCLUDE_SPEAKER_COLUMN:
            # Extract speaker from the chunk - look for any speaker label in the chunk
            speaker = None
            lines = chunk.split('\n')
            for line in lines:
                line_speaker, _ = extract_speaker_label(line.strip())
                if line_speaker:
                    speaker = line_speaker
                    break

            # If no speaker found in any line, mark as Unknown
            data['speaker'] = speaker if speaker else 'Unknown'

        # Add metadata if enabled
        if config.INCLUDE_METADATA:
            data['avg_words_per_sentence'] = word_count / max(1, sentence_count)
            data['source_file'] = filename

        # Add timestamp if enabled
        if config.TIMESTAMP_OUTPUT:
            data['processed_timestamp'] = datetime.now().isoformat()

        chunk_data.append(data)

    return pd.DataFrame(chunk_data)

def create_processing_interface():
    """Create the main processing interface"""

    # Instructions
    instructions_html = """
    <div style='background-color: #E7ECEF; padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 5px solid #274C77;'>
    <h3 style='color: #274C77; margin-top: 0;'>🚀 Process Your Interview Transcript</h3>
    <p><strong>Everything is ready!</strong> Click the button below to run the semantic chunking process on your uploaded files.</p>
    <div style='background-color: #A3CEF1; padding: 15px; border-radius: 8px; margin: 15px 0; border-left: 4px solid #6096BA;'>
        <p style='color: #274C77; margin: 0; font-weight: bold;'>What happens next:</p>
        <ol style='color: #274C77; margin: 10px 0;'>
            <li>🔍 Text preprocessing and speaker label preservation</li>
            <li>🧠 Semantic analysis using your chosen method</li>
            <li>✂️ Intelligent chunking based on content similarity</li>
            <li>📊 Quality analysis and statistics</li>
            <li>💾 Export preparation in your chosen format</li>
        </ol>
    </div>
    <p><strong>💡 Tip:</strong> The process may take a few minutes for large files. Watch the progress output below!</p>
    </div>
    """

    # Process button
    process_button = widgets.Button(
        description='🚀 Start Chunking Process',
        disabled=False,
        button_style='',
        tooltip='Begin processing your uploaded transcript',
        icon='play',
        layout=widgets.Layout(width='300px', height='50px'),
        style={'button_color': '#6096BA', 'font_weight': 'bold'}
    )

    # Reset button
    reset_button = widgets.Button(
        description='🔄 Reset All',
        disabled=False,
        button_style='',
        tooltip='Clear everything and start over',
        icon='refresh',
        layout=widgets.Layout(width='150px', height='40px'),
        style={'button_color': '#8B8C89', 'font_weight': 'bold'}
    )

    # Status display
    status_output = widgets.Output()

    # Global variable to store results
    global processing_results
    processing_results = {}

    def run_full_process(b):
        with status_output:
            status_output.clear_output()

            # Access the global uploaded_files from the upload interface
            global uploaded_files

            # Check prerequisites
            if not check_prerequisites(uploaded_files):
                print("\\n💡 Please complete the steps above before processing:")
                print("   1. Configure your settings")
                print("   2. Upload your transcript files")
                return

            try:
                # Select file or combine files based on uploaded_files
                if len(uploaded_files) == 1:
                    filename = list(uploaded_files.keys())[0]
                    text = uploaded_files[filename]
                    print(f"📄 Processing single file: {filename}")
                elif len(uploaded_files) > 1:
                     # Combine all files
                    combined_text = []
                    combined_filename = f"combined_{len(uploaded_files)}_files"

                    for i, (fname, ftext) in enumerate(uploaded_files.items(), 1):
                        combined_text.append(f"=== Document {i}: {fname} ===")
                        combined_text.append(ftext)
                        combined_text.append("")  # Blank line between files

                    text = '\\n\\n'.join(combined_text)
                    filename = combined_filename
                    print(f"🔗 Combined {len(uploaded_files)} files ({len(text.split()):,} total words) for processing")
                else:
                    print("❌ No files available for processing.")
                    return


                # Run chunking
                chunks = run_chunking_process(filename, text)

                if not chunks:
                    return

                # Create DataFrame
                print("\\n📋 Creating structured output...")
                df = create_dataframe_from_chunks(chunks, filename)

                # Store results globally
                global processing_results
                processing_results = {
                    'filename': filename,
                    'original_text': text,
                    'chunks': chunks,
                    'dataframe': df,
                    'timestamp': datetime.now()
                }

                # Display preview
                print("\\n" + "=" * 60)
                print("📋 PREVIEW OF RESULTS")
                print("=" * 60)

                # Show first few chunks
                display_df = df.head(3)
                if 'text' in display_df.columns:
                    # Truncate text for display
                    display_df = display_df.copy()
                    display_df['text_preview'] = display_df['text'].str[:100] + '...'
                    display_df = display_df.drop('text', axis=1)

                display(display_df)

                print(f"\\n✅ Processing complete! {len(chunks)} chunks created.")
                print("👇 Continue to the next step to export your results.")

            except Exception as e:
                print(f"❌ Processing failed: {e}")
                import traceback
                print("\\nDetailed error:")
                traceback.print_exc()

    def reset_everything(b):
        with status_output:
            status_output.clear_output()

            global uploaded_files, file_stats, processing_results
            uploaded_files.clear()
            file_stats.clear()
            processing_results.clear()

            # Reset file info display in the upload interface
            upload_interface['file_info'].value = "<p style='color: #666; font-style: italic;'>Files cleared. Click 'Choose Files to Upload' to start over.</p>"


            print("🔄 Everything has been reset.")
            print("👆 You can now reconfigure and upload new files.")

    # Bind events
    process_button.on_click(run_full_process)
    reset_button.on_click(reset_everything)

    # Layout
    buttons_container = widgets.HBox([process_button, reset_button])

    # Display interface
    display(HTML(instructions_html))
    display(buttons_container)
    display(status_output)

    return {
        'process_button': process_button,
        'reset_button': reset_button,
        'output': status_output
    }

# Initialize processing interface
print("🚀 Main Processing Pipeline Ready")
print("👆 Make sure you've configured settings and uploaded files above!")

# Global variable for results
processing_results = {}

# Create and display processing interface
processing_interface = create_processing_interface()

## Results Analsys & Export

Analyze chunking quality with detailed statistics and visualizations, then export results in your preferred format. Generate reports and download structured data ready for qualitative analysis software.

In [None]:
# Results Analysis and Export

from IPython.display import HTML

def generate_comprehensive_statistics(df, original_text):
    """Generate detailed statistics about the chunking results"""

    if df is None or len(df) == 0:
        print("❌ No data available for analysis")
        return

    print("\\n" + "=" * 70)
    print("📊 COMPREHENSIVE CHUNKING ANALYSIS")
    print("=" * 70)

    # Basic metrics
    total_chunks = len(df)
    total_words = df['word_count'].sum()
    avg_words = df['word_count'].mean()
    median_words = df['word_count'].median()
    std_words = df['word_count'].std()

    print(f"\\n📋 Basic Metrics:")
    print(f"   • Total chunks created: {total_chunks}")
    print(f"   • Total words processed: {total_words:,}")
    print(f"   • Average words per chunk: {avg_words:.1f}")
    print(f"   • Median words per chunk: {median_words:.1f}")
    print(f"   • Standard deviation: {std_words:.1f}")
    print(f"   • Shortest chunk: {df['word_count'].min()} words")
    print(f"   • Longest chunk: {df['word_count'].max()} words")

    # Size distribution analysis
    word_ranges = {
        'Very Short (1-15 words)': len(df[df['word_count'] <= 15]),
        'Short (16-30 words)': len(df[(df['word_count'] > 15) & (df['word_count'] <= 30)]),
        'Medium (31-60 words)': len(df[(df['word_count'] > 30) & (df['word_count'] <= 60)]),
        'Long (61-100 words)': len(df[(df['word_count'] > 60) & (df['word_count'] <= 100)]),
        'Very Long (100+ words)': len(df[df['word_count'] > 100])
    }

    print(f"\\n📏 Chunk Size Distribution:")
    for range_name, count in word_ranges.items():
        percentage = (count / total_chunks) * 100
        bar = "█" * int(percentage / 5)  # Simple text bar
        print(f"   • {range_name:25} {count:3d} ({percentage:5.1f}%) {bar}")

    # Speaker analysis if available
    if 'speaker' in df.columns:
        unique_speakers = df['speaker'].nunique()
        print(f"\\n👥 Speaker Analysis:")
        print(f"   • Unique speakers identified: {unique_speakers}")

        speaker_counts = df['speaker'].value_counts()
        speaker_word_totals = df.groupby('speaker')['word_count'].sum().sort_values(ascending=False)

        print(f"   • Most active speakers (by chunks):")
        for i, (speaker, count) in enumerate(speaker_counts.head(5).items()):
            percentage = (count / total_chunks) * 100
            word_total = speaker_word_totals.get(speaker, 0)
            print(f"     {i+1}. {speaker}: {count} chunks ({percentage:.1f}%), {word_total:,} words")

    # Text preservation analysis
    original_words = len(original_text.split())
    preservation_rate = (total_words / original_words) * 100 if original_words > 0 else 0

    print(f"\\n🔍 Text Preservation Analysis:")
    print(f"   • Original text: {original_words:,} words")
    print(f"   • Processed text: {total_words:,} words")
    print(f"   • Preservation rate: {preservation_rate:.2f}%")

    if preservation_rate >= 99:
        print("   ✅ Excellent preservation - minimal text loss")
    elif preservation_rate >= 95:
        print("   ✅ Good preservation - acceptable for analysis")
    elif preservation_rate >= 90:
        print("   ⚠️ Moderate preservation - review recommended")
    else:
        print("   ❌ Poor preservation - significant text loss detected")

    # Quality recommendations
    print(f"\\n💡 Quality Assessment for Qualitative Analysis:")

    if avg_words < 20:
        print("   📝 Small chunks - Good for detailed coding, but may lack context")
    elif avg_words < 50:
        print("   📝 Medium chunks - Balanced for most qualitative analysis methods")
    else:
        print("   📝 Large chunks - Good for thematic analysis, may need sub-coding")

    if std_words > avg_words * 0.5:
        print("   📊 High variability in chunk sizes - consider adjusting settings")
    else:
        print("   📊 Consistent chunk sizes - good for systematic analysis")

def create_visualizations(df):
    """Create helpful visualizations of the chunking results"""

    if df is None or len(df) == 0:
        print("❌ No data available for visualization")
        return

    print("\\n📈 Generating visualizations...")

    # Set up the plot style
    plt.style.use('default')
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Interview Transcript Chunking Analysis', fontsize=16, fontweight='bold')

    # 1. Histogram of chunk word counts
    ax1.hist(df['word_count'], bins=min(20, len(df)//2 + 1), edgecolor='black', alpha=0.7, color='skyblue')
    ax1.set_xlabel('Words per Chunk')
    ax1.set_ylabel('Number of Chunks')
    ax1.set_title('Distribution of Chunk Sizes')
    ax1.grid(True, alpha=0.3)
    ax1.axvline(df['word_count'].mean(), color='red', linestyle='--', alpha=0.7, label=f'Mean: {df["word_count"].mean():.1f}')
    ax1.legend()

    # 2. Line plot showing chunk sizes throughout document
    ax2.plot(df['chunk_id'], df['word_count'], marker='o', linewidth=2, markersize=3, color='coral', alpha=0.8)
    ax2.set_xlabel('Chunk Number')
    ax2.set_ylabel('Word Count')
    ax2.set_title('Chunk Sizes Throughout Document')
    ax2.grid(True, alpha=0.3)
    ax2.axhline(df['word_count'].mean(), color='green', linestyle='--', alpha=0.7, label=f'Average: {df["word_count"].mean():.1f}')
    ax2.legend()

    # 3. Box plot of word counts by speaker (if available) or general distribution
    if 'speaker' in df.columns and df['speaker'].nunique() > 1:
        # Filter to top speakers for readability
        top_speakers = df['speaker'].value_counts().head(6).index
        filtered_df = df[df['speaker'].isin(top_speakers)]

        speaker_data = [filtered_df[filtered_df['speaker'] == speaker]['word_count'].values
                       for speaker in top_speakers]

        bp = ax3.boxplot(speaker_data, labels=top_speakers, patch_artist=True)
        for patch in bp['boxes']:
            patch.set_facecolor('lightgreen')
            patch.set_alpha(0.7)

        ax3.set_ylabel('Word Count')
        ax3.set_title('Chunk Size Distribution by Speaker')
        ax3.tick_params(axis='x', rotation=45)
        ax3.grid(True, alpha=0.3)
    else:
        # General box plot
        bp = ax3.boxplot(df['word_count'], patch_artist=True)
        bp['boxes'][0].set_facecolor('lightgreen')
        bp['boxes'][0].set_alpha(0.7)
        ax3.set_ylabel('Word Count')
        ax3.set_title('Overall Chunk Size Distribution')
        ax3.grid(True, alpha=0.3)

    # 4. Speaker distribution pie chart (if available) or sentence count analysis
    if 'speaker' in df.columns and df['speaker'].nunique() > 1:
        speaker_counts = df['speaker'].value_counts().head(8)  # Top 8 speakers
        colors = plt.cm.Set3(np.linspace(0, 1, len(speaker_counts)))

        wedges, texts, autotexts = ax4.pie(speaker_counts.values, labels=speaker_counts.index,
                                          autopct='%1.1f%%', colors=colors, startangle=90)
        ax4.set_title('Distribution of Chunks by Speaker')

        # Make text more readable
        for autotext in autotexts:
            autotext.set_color('white')
            autotext.set_fontweight('bold')
    else:
        # Scatter plot: word count vs sentence count
        if 'sentence_count' in df.columns:
            ax4.scatter(df['word_count'], df['sentence_count'], alpha=0.6, color='purple', s=30)
            ax4.set_xlabel('Word Count')
            ax4.set_ylabel('Sentence Count')
            ax4.set_title('Words vs Sentences per Chunk')
            ax4.grid(True, alpha=0.3)

            # Add trend line
            z = np.polyfit(df['word_count'], df['sentence_count'], 1)
            p = np.poly1d(z)
            ax4.plot(df['word_count'], p(df['word_count']), "r--", alpha=0.7)
        else:
            ax4.text(0.5, 0.5, 'No additional\\nanalysis available',
                    horizontalalignment='center', verticalalignment='center',
                    transform=ax4.transAxes, fontsize=12)
            ax4.set_title('Additional Analysis')

    plt.tight_layout()
    plt.show()

    print("✅ Visualizations complete!")

def save_results(df, filename, original_text):
    """Save results in the configured format with metadata to a folder"""

    if df is None or len(df) == 0:
        print("❌ No data to save")
        return None

    print("\n" + "=" * 60)
    print("💾 SAVING RESULTS TO A FOLDER")
    print("=" * 60)

    # Create output directory
    base_name = os.path.splitext(filename)[0] if filename else "transcript"
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_dir = f"chunked_results_{base_name}_{timestamp}"

    # Create directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    print(f"📁 Created directory: {output_dir}")

    try:
        if config.OUTPUT_FORMAT == 'csv':
            output_filename = f"{output_dir}/{base_name}_chunked.csv"
            df.to_csv(output_filename, index=False, encoding='utf-8')

            print(f"\n📄 CSV Format Saved:")
            print(f"   File: {output_filename}")
            print(f"   Contains: Simple spreadsheet with columns for chunk_id, text, word_count, etc.")
            print(f"   Best for: Quick analysis in Excel/Google Sheets, simple data processing")

        elif config.OUTPUT_FORMAT == 'excel':
            output_filename = f"{output_dir}/{base_name}_chunked.xlsx"

            # Create Excel with multiple sheets and formatting
            with pd.ExcelWriter(output_filename, engine='openpyxl') as writer:
                # Main chunks sheet
                df.to_excel(writer, sheet_name='Chunks', index=False)

                # Statistics sheet
                stats_data = {
                    'Metric': ['Total Chunks', 'Total Words', 'Average Words/Chunk',
                              'Min Words', 'Max Words', 'Standard Deviation'],
                    'Value': [len(df), df['word_count'].sum(), df['word_count'].mean(),
                             df['word_count'].min(), df['word_count'].max(), df['word_count'].std()]
                }
                pd.DataFrame(stats_data).to_excel(writer, sheet_name='Statistics', index=False)

                # Configuration sheet
                config_data = {
                    'Setting': ['Chunking Method', 'Similarity Threshold', 'Max Sentences',
                               'Min Sentences', 'Model Name', 'Preserve Speakers'],
                    'Value': [config.CHUNKING_METHOD, config.SIMILARITY_THRESHOLD,
                             config.MAX_CHUNK_SENTENCES, config.MIN_CHUNK_SENTENCES,
                             config.MODEL_NAME, config.PRESERVE_SPEAKER_LABELS]
                }
                pd.DataFrame(config_data).to_excel(writer, sheet_name='Configuration', index=False)

                # Format the main sheet
                workbook = writer.book
                worksheet = writer.sheets['Chunks']

                # Auto-adjust column widths
                for column in worksheet.columns:
                    max_length = 0
                    column_letter = column[0].column_letter

                    for cell in column:
                        try:
                            if len(str(cell.value)) > max_length:
                                max_length = len(str(cell.value))
                        except:
                            pass

                    adjusted_width = min(max_length + 2, 60)  # Cap at 60
                    worksheet.column_dimensions[column_letter].width = adjusted_width

            print(f"\n📊 Excel Format Saved:")
            print(f"   File: {output_filename}")
            print(f"   Contains: 3 sheets - Chunks (main data), Statistics (summary), Configuration (settings)")
            print(f"   Best for: Professional analysis, importing to NVivo/ATLAS.ti, comprehensive reporting")

        elif config.OUTPUT_FORMAT == 'json':
            output_filename = f"{output_dir}/{base_name}_chunked.json"

            # Create comprehensive JSON output
            json_output = {
                'metadata': {
                    'source_file': filename,
                    'processing_timestamp': datetime.now().isoformat(),
                    'configuration': {
                        'chunking_method': config.CHUNKING_METHOD,
                        'similarity_threshold': config.SIMILARITY_THRESHOLD,
                        'max_sentences': config.MAX_CHUNK_SENTENCES,
                        'min_sentences': config.MIN_CHUNK_SENTENCES,
                        'model_name': config.MODEL_NAME,
                        'preserve_speakers': config.PRESERVE_SPEAKER_LABELS
                    },
                    'statistics': {
                        'total_chunks': len(df),
                        'total_words': int(df['word_count'].sum()),
                        'average_words_per_chunk': float(df['word_count'].mean()),
                        'min_words': int(df['word_count'].min()),
                        'max_words': int(df['word_count'].max())
                    }
                },
                'chunks': df.to_dict('records')
            }

            with open(output_filename, 'w', encoding='utf-8') as f:
                json.dump(json_output, f, indent=2, ensure_ascii=False)

            print(f"\n🔧 JSON Format Saved:")
            print(f"   File: {output_filename}")
            print(f"   Contains: Structured data with metadata, configuration, statistics, and chunks")
            print(f"   Best for: Programming analysis, API integration, custom processing scripts")

        else:
            raise ValueError(f"Unsupported output format: {config.OUTPUT_FORMAT}")

        # Also save configuration for reproducibility
        config_filename = f"{output_dir}/processing_config.json"
        config_dict = {
            'processing_info': {
                'timestamp': datetime.now().isoformat(),
                'source_file': filename,
                'total_chunks': len(df),
                'notebook_version': '2.0.0'
            },
            'chunking_parameters': {
                'method': config.CHUNKING_METHOD,
                'similarity_threshold': config.SIMILARITY_THRESHOLD,
                'max_chunk_sentences': config.MAX_CHUNK_SENTENCES,
                'min_chunk_sentences': config.MIN_CHUNK_SENTENCES,
                'model_name': config.MODEL_NAME
            },
            'processing_options': {
                'preserve_speaker_labels': config.PRESERVE_SPEAKER_LABELS,
                'preserve_timestamps': config.PRESERVE_TIMESTAMPS,
                'use_paragraph_boundaries': config.USE_PARAGRAPH_BOUNDARIES
            },
            'output_settings': {
                'output_format': config.OUTPUT_FORMAT,
                'include_speaker_column': config.INCLUDE_SPEAKER_COLUMN,
                'include_metadata': config.INCLUDE_METADATA
            }
        }

        with open(config_filename, 'w') as f:
            json.dump(config_dict, f, indent=2)

        print(f"\n⚙️ Configuration saved: {config_filename}")
        print(f"   Contains: All settings used for this processing run (for reproducibility)")

        # Create a summary file
        summary_filename = f"{output_dir}/README.txt"
        with open(summary_filename, 'w') as f:
            f.write("INTERVIEW TRANSCRIPT CHUNKING RESULTS\n")
            f.write("=" * 40 + "\n\n")
            f.write(f"Processing Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write(f"Source File: {filename}\n")
            f.write(f"Total Chunks Created: {len(df)}\n")
            f.write(f"Output Format: {config.OUTPUT_FORMAT.upper()}\n\n")

            f.write("FILES IN THIS DIRECTORY:\n")
            f.write(f"• {os.path.basename(output_filename)} - Main chunked data\n")
            f.write(f"• processing_config.json - Settings used for processing\n")
            f.write(f"• README.txt - This summary file\n\n")

            f.write("EXPORT FORMAT GUIDE:\n")
            if config.OUTPUT_FORMAT == 'csv':
                f.write("• CSV: Simple spreadsheet format, easy to open in Excel/Google Sheets\n")
            elif config.OUTPUT_FORMAT == 'excel':
                f.write("• Excel: Multi-sheet file with data, statistics, and configuration\n")
                f.write("  - Sheet 1 (Chunks): Your chunked interview data\n")
                f.write("  - Sheet 2 (Statistics): Summary statistics\n")
                f.write("  - Sheet 3 (Configuration): Processing settings\n")
            elif config.OUTPUT_FORMAT == 'json':
                f.write("• JSON: Structured data format for programming/API use\n")

            f.write("\nNEXT STEPS:\n")
            f.write("• Import the main data file into your qualitative analysis software\n")
            f.write("• Use the configuration file to reproduce these results\n")
            f.write("• Refer to chunk_id column to maintain order during analysis\n")

        print(f"\n📋 Summary created: {summary_filename}")

        # List all files created
        print(f"\n✅ All files saved to Colab directory: {output_dir}")
        print("📂 Directory contents:")
        for file in os.listdir(output_dir):
            file_path = os.path.join(output_dir, file)
            file_size = os.path.getsize(file_path)
            print(f"   • {file} ({file_size:,} bytes)")

        print(f"\n💡 Access your files at: /content/{output_dir}")
        print("💡 Use the file browser on the left to navigate to your results")

        return output_filename

    except Exception as e:
        print(f"❌ Error saving files: {e}")
        return None

def create_results_interface():
    """Create interface for viewing and exporting results"""

    # Instructions
    instructions_html = """
    <div style='background-color: #E7ECEF; padding: 20px; border-radius: 10px; margin: 20px 0; border-left: 5px solid #274C77;'>
    <h3 style='color: #274C77; margin-top: 0;'>📊 Analyze Results and Export</h3>
    <p><strong>Great job!</strong> Your interview transcript has been successfully chunked. Use the tools below to:</p>
    <div style='display: flex; gap: 20px; margin: 15px 0;'>
        <div style='flex: 1; background-color: #A3CEF1; padding: 15px; border-radius: 8px; border-left: 4px solid #6096BA;'>
            <ul style='color: #274C77; margin: 0;'>
                <li>📈 <strong>View Statistics:</strong> Understand your chunking results with detailed analytics</li>
                <li>📊 <strong>Generate Visualizations:</strong> See charts and graphs of your data</li>
            </ul>
        </div>
        <div style='flex: 1; background-color: #A3CEF1; padding: 15px; border-radius: 8px; border-left: 4px solid #6096BA;'>
            <ul style='color: #274C77; margin: 0;'>
                <li>💾 <strong>Export Files:</strong> Download your chunked data for qualitative analysis software</li>
                <li>🎯 <strong>Ready for Analysis:</strong> Import into NVivo, ATLAS.ti, or spreadsheets</li>
            </ul>
        </div>
    </div>
    </div>
    """

    # Action buttons
    stats_button = widgets.Button(
        description='📊 View Detailed Statistics',
        disabled=False,
        button_style='',
        tooltip='Generate comprehensive analysis of your chunks',
        icon='chart-bar',
        layout=widgets.Layout(width='250px', height='45px'),
        style={'button_color': '#A3CEF1', 'font_weight': 'bold'}
    )

    viz_button = widgets.Button(
        description='📈 Generate Visualizations',
        disabled=False,
        button_style='',
        tooltip='Create charts and graphs of your data',
        icon='chart-line',
        layout=widgets.Layout(width='250px', height='45px'),
        style={'button_color': '#A3CEF1', 'font_weight': 'bold'}
    )

    export_button = widgets.Button(
        description='💾 Export Results',
        disabled=False,
        button_style='',
        tooltip='Download your chunked data files',
        icon='download',
        layout=widgets.Layout(width='200px', height='45px'),
        style={'button_color': '#6096BA', 'font_weight': 'bold'}
    )

    # Output display
    results_output = widgets.Output()

    def show_statistics(b):
        with results_output:
            results_output.clear_output()

            if not processing_results:
                print("❌ No processing results available. Please run the chunking process first.")
                return

            generate_comprehensive_statistics(
                processing_results['dataframe'],
                processing_results['original_text']
            )

    def show_visualizations(b):
        with results_output:
            results_output.clear_output()

            if not processing_results:
                print("❌ No processing results available. Please run the chunking process first.")
                return

            create_visualizations(processing_results['dataframe'])

    def export_results(b):
        with results_output:
            results_output.clear_output()

            if not processing_results:
                print("❌ No processing results available. Please run the chunking process first.")
                return

            save_results(
                processing_results['dataframe'],
                processing_results['filename'],
                processing_results['original_text']
            )

    # Bind events
    stats_button.on_click(show_statistics)
    viz_button.on_click(show_visualizations)
    export_button.on_click(export_results)

    # Layout
    analysis_buttons = widgets.HBox([stats_button, viz_button])
    export_section = widgets.HBox([export_button])

    # Display interface
    display(HTML(instructions_html))
    display(analysis_buttons)
    display(export_section)
    display(results_output)

    return {
        'stats_button': stats_button,
        'viz_button': viz_button,
        'export_button': export_button,
        'output': results_output
    }

# Initialize results interface
print("📊 Results Analysis and Export Interface Ready")
print("👆 Process your transcript above, then analyze and export your results below!")

# Create and display results interface
results_interface = create_results_interface()

# Final summary
print("\\n" + "=" * 70)
print("🎉 INTERVIEW TRANSCRIPT SEMANTIC CHUNKER - READY FOR USE!")
print("=" * 70)
print("📋 Complete Workflow:")
print("   1. ✅ Configure your chunking parameters")
print("   2. ✅ Upload your interview transcript files")
print("   3. ✅ Run the semantic chunking process")
print("   4. ✅ Analyze results and export for qualitative analysis")
print()
print("💡 Tips for Best Results:")
print("   • Start with default settings and adjust based on your needs")
print("   • Ensure speaker labels are consistent (Q:, A:, Name:)")
print("   • Test different similarity thresholds for optimal chunking")
print("   • Export in Excel format for rich metadata and multiple sheets")
print()
print("🚀 Ready to transform your interview data into analyzable chunks!")
print("=" * 70)